pandas 通过从熊猫数据框中的非缺失值中随机选择来填充缺失数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36413314/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filling missing data by random choosing from non missing values in pandas dataframe
提问by Donald Gedeon
I have a pandas data frame where there are a several missing values. I noticed that the non missing values are close to each other. Thus, I would like to impute the missing values by randomly choosing the non missing values.
我有一个Pandas数据框,其中有几个缺失值。我注意到非缺失值彼此接近。因此,我想通过随机选择非缺失值来估算缺失值。
For instance:
例如:
import pandas as pd
import random
import numpy as np
foo = pd.DataFrame({'A': [2, 3, np.nan, 5, np.nan], 'B':[np.nan, 4, 2, np.nan, 5]})
foo
A B
0 2 NaN
1 3 4
2 NaN 2
3 5 NaN
4 NaN 5
I would like for instance foo['A'][2]=2
and foo['A'][5]=3
The shape of my pandas DataFrame is (6940,154).
I try this
例如foo['A'][2]=2
,我想foo['A'][5]=3
我的Pandas数据帧的形状是 (6940,154)。我试试这个
foo['A'] = foo['A'].fillna(random.choice(foo['A'].values.tolist()))
But it not working. Could you help me achieve that? Best regards.
但它不起作用。你能帮我实现吗?此致。
回答by bamdan
You can use pandas.fillna method and the random.choice method to fill the missing values with a random selection of a particular column.
您可以使用 pandas.fillna 方法和 random.choice 方法通过随机选择特定列来填充缺失值。
import random
import numpy as np
df["column"].fillna(lambda x: random.choice(df[df[column] != np.nan]["column"]), inplace =True)
Where column is the column you want to fill with non nan values randomly.
其中 column 是您要随机填充非 nan 值的列。
回答by Esptheitroad Murhabazi
This is another approach to this question after making improvement on the first answer and according to how to check if an numpy int is nand found here in numpy documentation
这是根据使得改进的第一答案和之后的另一种方法对这个问题如何检查如果numpy的int值NAND发现这里numpy的文件中
foo['A'].apply(lambda x: np.random.choice([x for x in range(min(foo['A']),max(foo['A'])]) if (np.isnan(x)) else x)
回答by Karolis
This works well for me on Pandas DataFrame
这在 Pandas DataFrame 上对我很有效
def randomiseMissingData(df2):
"randomise missing data for DataFrame (within a column)"
df = df2.copy()
for col in df.columns:
data = df[col]
mask = data.isnull()
samples = random.choices( data[~mask].values , k = mask.sum() )
data[mask] = samples
return df
回答by peralmq
Here is another Pandas DataFrame approach
这是另一种 Pandas DataFrame 方法
import numpy as np
def fill_with_random(df2, column):
'''Fill `df2`'s column with name `column` with random data based on non-NaN data from `column`'''
df = df2.copy()
df[column] = df[column].apply(lambda x: np.random.choice(df[column].dropna().values) if np.isnan(x) else x)
return df