Python 大熊猫类别缺失值的插补

Question

提问by Igor Barinov

The question is how to fill NaNs with most frequent levels for category column in pandas dataframe?

问题是如何在 Pandas 数据框中为类别列填充最频繁级别的 NaN？

In R randomForest package there is na.roughfixoption : A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.

在 R randomForest 包中有 na.roughfix选项：A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.

in Pandas for numeric variables I can fill NaN values with :

在 Pandas 的数值变量中，我可以用以下内容填充 NaN 值：

df = df.fillna(df.median())

Answer 1

采纳答案by hellpanderr

You can use df = df.fillna(df['Label'].value_counts().index[0])to fill NaNs with the most frequent value from one column.

您可以使用df = df.fillna(df['Label'].value_counts().index[0])一列中最频繁的值填充 NaN。

If you want to fill every column with its own most frequent value you can use

如果你想用自己最频繁的值填充每一列，你可以使用

df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

UPDATE 2018-25-10?

更新 2018-25-10？

Starting from 0.13.1pandas includes modemethod for Seriesand Dataframes. You can use it to fill missing values for each column (using its own most frequent value) like this

从0.13.1pandas开始，包括Series和Dataframes 的mode方法。您可以使用它来填充每列的缺失值（使用它自己最频繁的值），如下所示

df = df.fillna(df.mode().iloc[0])

Answer 2

回答by Pratik Gohil

def fillna(col):
    col.fillna(col.value_counts().index[0], inplace=True)
    return col
df=df.apply(lambda col:fillna(col))

Answer 3

回答by kevins_1

In more recent versions of scikit-learn up you can use SimpleImputerto impute both numerics and categoricals:

在最新版本的 scikit-learn up 中，您可以SimpleImputer用来估算数字和分类：

import pandas as pd
from sklearn.impute import SimpleImputer
arr = [[1., 'x'], [np.nan, 'y'], [7., 'z'], [7., 'y'], [4., np.nan]]
df1 = pd.DataFrame({'x1': [x[0] for x in arr],
                    'x2': [x[1] for x in arr]},
                  index=[l for l in 'abcde'])
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
print(pd.DataFrame(imp.fit_transform(df1),
                   columns=df1.columns,
                   index=df1.index))
#   x1 x2
# a  1  x
# b  7  y
# c  7  z
# d  7  y
# e  4  y

Answer 4

回答by Sarah

Most of the time, you wouldn't want the same imputing strategy for all the columns. For example, you may want column mode for categorical variables and column mean or median for numeric columns.

大多数情况下，您不希望所有列都采用相同的插补策略。例如，您可能需要分类变量的列模式和数字列的列均值或中位数。

# numeric columns
>>> df.select_dtypes(include='float').fillna(\
     df.select_dtypes(include='float').mean().iloc[0],\                    
     inplace=True)

# categorical columns
>>> df.select_dtypes(include='object').fillna(\
 ...: df.select_dtypes(include='object').mode().iloc[0])

Python 大熊猫类别缺失值的插补

提问by Igor Barinov

采纳答案by hellpanderr

回答by Pratik Gohil

回答by kevins_1

回答by Sarah

相关推荐

最近更新

标签

Python 大熊猫类别缺失值的插补

提问by Igor Barinov

采纳答案by hellpanderr

回答by Pratik Gohil

回答by kevins_1

回答by Sarah

相关推荐

使用 OpenCV 在 Python 中计算图像中的黑色像素数

Python 合并两个 Pandas 数据框（加入一个公共列）

如何获取 numpy.random.choice 的索引？- Python

Python pandas：读取Excel文件时如何指定数据类型？

相关推荐

最近更新

标签