Python 大熊猫类别缺失值的插补

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32617811/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:56:19  来源:igfitidea点击:

Imputation of missing values for categories in pandas

pythonrpandas

提问by Igor Barinov

The question is how to fill NaNs with most frequent levels for category column in pandas dataframe?

问题是如何在 Pandas 数据框中为类别列填充最频繁级别的 NaN?

In R randomForest package there is na.roughfixoption : A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.

在 R randomForest 包中有 na.roughfix选项:A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.

in Pandas for numeric variables I can fill NaN values with :

在 Pandas 的数值变量中,我可以用以下内容填充 NaN 值:

df = df.fillna(df.median())

采纳答案by hellpanderr

You can use df = df.fillna(df['Label'].value_counts().index[0])to fill NaNs with the most frequent value from one column.

您可以使用df = df.fillna(df['Label'].value_counts().index[0])一列中最频繁的值填充 NaN。

If you want to fill every column with its own most frequent value you can use

如果你想用自己最频繁的值填充每一列,你可以使用

df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

UPDATE 2018-25-10?

更新 2018-25-10

Starting from 0.13.1pandas includes modemethod for Seriesand Dataframes. You can use it to fill missing values for each column (using its own most frequent value) like this

0.13.1pandas开始,包括SeriesDataframes 的mode方法。您可以使用它来填充每列的缺失值(使用它自己最频繁的值),如下所示

df = df.fillna(df.mode().iloc[0])

回答by Pratik Gohil

def fillna(col):
    col.fillna(col.value_counts().index[0], inplace=True)
    return col
df=df.apply(lambda col:fillna(col))

回答by kevins_1

In more recent versions of scikit-learn up you can use SimpleImputerto impute both numerics and categoricals:

在最新版本的 scikit-learn up 中,您可以SimpleImputer用来估算数字和分类:

import pandas as pd
from sklearn.impute import SimpleImputer
arr = [[1., 'x'], [np.nan, 'y'], [7., 'z'], [7., 'y'], [4., np.nan]]
df1 = pd.DataFrame({'x1': [x[0] for x in arr],
                    'x2': [x[1] for x in arr]},
                  index=[l for l in 'abcde'])
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
print(pd.DataFrame(imp.fit_transform(df1),
                   columns=df1.columns,
                   index=df1.index))
#   x1 x2
# a  1  x
# b  7  y
# c  7  z
# d  7  y
# e  4  y

回答by Sarah

Most of the time, you wouldn't want the same imputing strategy for all the columns. For example, you may want column mode for categorical variables and column mean or median for numeric columns.

大多数情况下,您不希望所有列都采用相同的插补策略。例如,您可能需要分类变量的列模式和数字列的列均值或中位数。

# numeric columns
>>> df.select_dtypes(include='float').fillna(\
     df.select_dtypes(include='float').mean().iloc[0],\                    
     inplace=True)

# categorical columns
>>> df.select_dtypes(include='object').fillna(\
 ...: df.select_dtypes(include='object').mode().iloc[0])