Python 大熊猫类别缺失值的插补
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32617811/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Imputation of missing values for categories in pandas
提问by Igor Barinov
The question is how to fill NaNs with most frequent levels for category column in pandas dataframe?
问题是如何在 Pandas 数据框中为类别列填充最频繁级别的 NaN?
In R randomForest package there is
na.roughfixoption : A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.
在 R randomForest 包中有
na.roughfix选项:A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.
in Pandas for numeric variables I can fill NaN values with :
在 Pandas 的数值变量中,我可以用以下内容填充 NaN 值:
df = df.fillna(df.median())
采纳答案by hellpanderr
You can use df = df.fillna(df['Label'].value_counts().index[0])
to fill NaNs with the most frequent value from one column.
您可以使用df = df.fillna(df['Label'].value_counts().index[0])
一列中最频繁的值填充 NaN。
If you want to fill every column with its own most frequent value you can use
如果你想用自己最频繁的值填充每一列,你可以使用
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
UPDATE 2018-25-10?
更新 2018-25-10?
Starting from 0.13.1
pandas includes mode
method for Seriesand Dataframes.
You can use it to fill missing values for each column (using its own most frequent value) like this
从0.13.1
pandas开始,包括Series和Dataframes 的mode
方法。您可以使用它来填充每列的缺失值(使用它自己最频繁的值),如下所示
df = df.fillna(df.mode().iloc[0])
回答by Pratik Gohil
def fillna(col):
col.fillna(col.value_counts().index[0], inplace=True)
return col
df=df.apply(lambda col:fillna(col))
回答by kevins_1
In more recent versions of scikit-learn up you can use SimpleImputer
to impute both numerics and categoricals:
在最新版本的 scikit-learn up 中,您可以SimpleImputer
用来估算数字和分类:
import pandas as pd
from sklearn.impute import SimpleImputer
arr = [[1., 'x'], [np.nan, 'y'], [7., 'z'], [7., 'y'], [4., np.nan]]
df1 = pd.DataFrame({'x1': [x[0] for x in arr],
'x2': [x[1] for x in arr]},
index=[l for l in 'abcde'])
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
print(pd.DataFrame(imp.fit_transform(df1),
columns=df1.columns,
index=df1.index))
# x1 x2
# a 1 x
# b 7 y
# c 7 z
# d 7 y
# e 4 y
回答by Sarah
Most of the time, you wouldn't want the same imputing strategy for all the columns. For example, you may want column mode for categorical variables and column mean or median for numeric columns.
大多数情况下,您不希望所有列都采用相同的插补策略。例如,您可能需要分类变量的列模式和数字列的列均值或中位数。
# numeric columns
>>> df.select_dtypes(include='float').fillna(\
df.select_dtypes(include='float').mean().iloc[0],\
inplace=True)
# categorical columns
>>> df.select_dtypes(include='object').fillna(\
...: df.select_dtypes(include='object').mode().iloc[0])