pandas 通过从熊猫数据框中的非缺失值中随机选择来填充缺失数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36413314/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:59:50  来源:igfitidea点击:

Filling missing data by random choosing from non missing values in pandas dataframe

pythonpandasmissing-data

提问by Donald Gedeon

I have a pandas data frame where there are a several missing values. I noticed that the non missing values are close to each other. Thus, I would like to impute the missing values by randomly choosing the non missing values.

我有一个Pandas数据框,其中有几个缺失值。我注意到非缺失值彼此接近。因此,我想通过随机选择非缺失值来估算缺失值。

For instance:

例如:

import pandas as pd
import random
import numpy as np

foo = pd.DataFrame({'A': [2, 3, np.nan, 5, np.nan], 'B':[np.nan, 4, 2, np.nan, 5]})
foo
    A   B
0   2 NaN
1   3   4
2 NaN   2   
3   5 NaN
4 NaN   5

I would like for instance foo['A'][2]=2and foo['A'][5]=3The shape of my pandas DataFrame is (6940,154). I try this

例如foo['A'][2]=2,我想foo['A'][5]=3我的Pandas数据帧的形状是 (6940,154)。我试试这个

foo['A'] = foo['A'].fillna(random.choice(foo['A'].values.tolist()))

But it not working. Could you help me achieve that? Best regards.

但它不起作用。你能帮我实现吗?此致。

回答by bamdan

You can use pandas.fillna method and the random.choice method to fill the missing values with a random selection of a particular column.

您可以使用 pandas.fillna 方法和 random.choice 方法通过随机选择特定列来填充缺失值。

import random
import numpy as np

df["column"].fillna(lambda x: random.choice(df[df[column] != np.nan]["column"]), inplace =True)

Where column is the column you want to fill with non nan values randomly.

其中 column 是您要随机填充非 nan 值的列。

回答by Esptheitroad Murhabazi

This is another approach to this question after making improvement on the first answer and according to how to check if an numpy int is nand found here in numpy documentation

这是根据使得改进的第一答案和之后的另一种方法对这个问题如何检查如果numpy的int值NAND发现这里numpy的文件中

foo['A'].apply(lambda x: np.random.choice([x for x in range(min(foo['A']),max(foo['A'])]) if (np.isnan(x)) else x)

回答by Karolis

This works well for me on Pandas DataFrame

这在 Pandas DataFrame 上对我很有效

def randomiseMissingData(df2):
    "randomise missing data for DataFrame (within a column)"
    df = df2.copy()
    for col in df.columns:
        data = df[col]
        mask = data.isnull()
        samples = random.choices( data[~mask].values , k = mask.sum() )
        data[mask] = samples

return df

回答by peralmq

Here is another Pandas DataFrame approach

这是另一种 Pandas DataFrame 方法

import numpy as np
def fill_with_random(df2, column):
    '''Fill `df2`'s column with name `column` with random data based on non-NaN data from `column`'''
    df = df2.copy()
    df[column] = df[column].apply(lambda x: np.random.choice(df[column].dropna().values) if np.isnan(x) else x)
    return df