Python 在 scikit-learn 中估算分类缺失值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25239958/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:55:55  来源:igfitidea点击:

Impute categorical missing values in scikit-learn

pythonpandasscikit-learnimputation

提问by night_bat

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer(replacing NaN by the most frequent value). The problem is in implementation. Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature. Once I run:

我有一些带有文本类型列的熊猫数据。这些文本列还有一些 NaN 值。我想要做的是通过sklearn.preprocessing.Imputer(用最频繁的值替换 NaN )来估算这些 NaN 。问题出在执行上。假设有一个 Pandas 数据框 df 有 30 列,其中 10 列是分类性质的。一旦我运行:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df) 

Python generates an error: 'could not convert string to float: 'run1'', where 'run1' is an ordinary (non-missing) value from the first column with categorical data.

Python 生成一个error: 'could not convert string to float: 'run1'',其中 'run1' 是来自具有分类数据的第一列的普通(非缺失)值。

Any help would be very welcome

任何帮助将非常受欢迎

采纳答案by sveitser

To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.

要使用数字列的平均值和非数字列的最常见值,您可以执行以下操作。您可以进一步区分整数和浮点数。我想对整数列使用中位数可能更有意义。

import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]

X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)

print('before...')
print(X)
print('after...')
print(xt)

which prints,

打印,

before...
     0   1   2
0    a   1   2
1    b   1   1
2    b   2   2
3  NaN NaN NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667

回答by scottlittle

This code fills in a series with the most frequent category:

这段代码用最频繁的类别填充了一个系列:

import pandas as pd
import numpy as np

# create fake data 
m = pd.Series(list('abca'))
m.iloc[1] = np.nan #artificially introduce nan

print('m = ')
print(m)

#make dummy variables, count and sort descending:
most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] 

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

new_m = m.map(replace_most_common) #apply function to original data

print('new_m = ')
print(new_m)

Outputs:

输出:

m =
0      a
1    NaN
2      c
3      a
dtype: object

new_m =
0    a
1    a
2    c
3    a
dtype: object

回答by user1367204

Copying and modifying sveitser's answer, I made an imputer for a pandas.Series object

复制和修改 sveitser 的答案,我为 pandas.Series 对象做了一个输入器

import numpy
import pandas 

from sklearn.base import TransformerMixin

class SeriesImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  

        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self

    def transform(self, X, y=None):
       return X.fillna(self.fill)

To use it you would do:

要使用它,您将执行以下操作:

# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])


a  = SeriesImputer()   # Initialize the imputer
a.fit(s1)              # Fit the imputer
s2 = a.transform(s1)   # Get a new series

回答by qAp

Similar. Modify Imputerfor strategy='most_frequent':

相似的。修改Imputerstrategy='most_frequent'

class GeneralImputer(Imputer):
    def __init__(self, **kwargs):
        Imputer.__init__(self, **kwargs)

    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
            self.statistics_ = self.fills.values
            return self
        else:
            return Imputer.fit(self, X, y=y)

    def transform(self, X):
        if hasattr(self, 'fills'):
            return pd.DataFrame(X).fillna(self.fills).values.astype(str)
        else:
            return Imputer.transform(self, X)

where pandas.DataFrame.mode()finds the most frequent value for each column and then pandas.DataFrame.fillna()fills missing values with these. Other strategyvalues are still handled the same way by Imputer.

wherepandas.DataFrame.mode()找到每列最频繁的值,然后pandas.DataFrame.fillna()用这些值填充缺失值。其他strategy值仍由 以相同方式处理Imputer

回答by Gautham Kumaran

Inspired by the answers here and for the want of a goto Imputer for all use-cases I ended up writing this. It supports four strategies for imputation mean, mode, median, fillworks on both pd.DataFrameand Pd.Series.

受到这里的答案的启发,并且为了所有用例都需要一个 goto Imputer,我最终写了这篇文章。它支持四种mean, mode, median, fill适用于pd.DataFrame和 的插补策略Pd.Series

meanand medianworks only for numeric data, modeand fillworks for both numeric and categorical data.

mean并且median仅适用于数字数据,modefill适用于数字和分类数据。

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy='mean',filler='NA'):
       self.strategy = strategy
       self.fill = filler

    def fit(self, X, y=None):
       if self.strategy in ['mean','median']:
           if not all(X.dtypes == np.number):
               raise ValueError('dtypes mismatch np.number dtype is \
                                 required for '+ self.strategy)
       if self.strategy == 'mean':
           self.fill = X.mean()
       elif self.strategy == 'median':
           self.fill = X.median()
       elif self.strategy == 'mode':
           self.fill = X.mode().iloc[0]
       elif self.strategy == 'fill':
           if type(self.fill) is list and type(X) is pd.DataFrame:
               self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
       return self

   def transform(self, X, y=None):
       return X.fillna(self.fill)

usage

用法

>> df   
    MasVnrArea  FireplaceQu
Id  
1   196.0   NaN
974 196.0   NaN
21  380.0   Gd
5   350.0   TA
651 NaN     Gd


>> CustomImputer(strategy='mode').fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   Gd
974 196.0   Gd
21  380.0   Gd
5   350.0   TA
651 196.0   Gd

>> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   NA
974 196.0   NA
21  380.0   Gd
5   350.0   TA
651 0.0     Gd 

回答by Austin

You can use sklearn_pandas.CategoricalImputerfor the categorical columns. Details:

您可以sklearn_pandas.CategoricalImputer用于分类列。细节:

First, (from the book Hands-On Machine Learning with Scikit-Learn and TensorFlow) you can have subpipelines for numerical and string/categorical features, where each subpipeline's first transformer is a selector that takes a list of column names (and the full_pipeline.fit_transform()takes a pandas DataFrame):

首先,(来自《Hands-On Machine Learning with Scikit-Learn and TensorFlow》一书)您可以拥有数字和字符串/分类特征的子管道,其中每个子管道的第一个转换器是一个选择器,它接受一个列名列表(并且full_pipeline.fit_transform()需要一个熊猫数据框):

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

You can then combine these sub pipelines with sklearn.pipeline.FeatureUnion, for example:

然后,您可以将这些子管道与 结合sklearn.pipeline.FeatureUnion,例如:

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])

Now, in the num_pipelineyou can simply use sklearn.preprocessing.Imputer(), but in the cat_pipline, you can use CategoricalImputer()from the sklearn_pandaspackage.

现在,num_pipeline您可以简单地使用sklearn.preprocessing.Imputer(),但在 中cat_pipline,您可以CategoricalImputer()sklearn_pandas包中使用。

note:sklearn-pandaspackage can be installed with pip install sklearn-pandas, but it is imported as import sklearn_pandas

注意:sklearn-pandas包可以用 安装pip install sklearn-pandas,但它被导入为import sklearn_pandas

回答by Piyush

  • strategy = 'most_frequent' can be used only with quantitative feature, not with qualitative. This custom impuer can be used for both qualitative and quantitative. Also with scikit learn imputer either we can use it for whole data frame(if all features are quantitative) or we can use 'for loop' with list of similar type of features/columns(see the below example). But custom imputer can be used with any combinations.

        from sklearn.preprocessing import Imputer
        impute = Imputer(strategy='mean')
        for cols in ['quantitative_column', 'quant']:  # here both are quantitative features.
              xx[cols] = impute.fit_transform(xx[[cols]])
    
  • Custom Imputer :

       from sklearn.preprocessing import Imputer
       from sklearn.base import TransformerMixin
    
       class CustomImputer(TransformerMixin):
             def __init__(self, cols=None, strategy='mean'):
                   self.cols = cols
                   self.strategy = strategy
    
             def transform(self, df):
                   X = df.copy()
                   impute = Imputer(strategy=self.strategy)
                   if self.cols == None:
                          self.cols = list(X.columns)
                   for col in self.cols:
                          if X[col].dtype == np.dtype('O') : 
                                 X[col].fillna(X[col].value_counts().index[0], inplace=True)
                          else : X[col] = impute.fit_transform(X[[col]])
    
                   return X
    
             def fit(self, *_):
                   return self
    
  • Dataframe:

          X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san 
                                     francisco', 'tokyo'], 
              'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 
              'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                                'somewhat like', 'dislike'], 
              'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
    
    
                city              boolean   ordinal_column  quantitative_column
            0   tokyo             yes       somewhat like   1.0
            1   NaN               no        like            11.0
            2   london            NaN       somewhat like   -0.5
            3   seattle           no        like            10.0
            4   san francisco     no        somewhat like   NaN
            5   tokyo             yes       dislike         20.0
    
  • 1) Can be used with list of similar type of features.

     cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
     cci.fit_transform(X)
    
  • can be used with strategy = median

     sd = CustomImputer(['quantitative_column'], strategy = 'median')
     sd.fit_transform(X)
    
  • 3) Can be used with whole data frame, it will use default mean(or we can also change it with median. for qualitative features it uses strategy = 'most_frequent' and for quantitative mean/median.

     call = CustomImputer()
     call.fit_transform(X)   
    
  • strategy = 'most_frequent' 只能用于定量特征,不能用于定性特征。这种定制的impuer可用于定性和定量。同样使用 scikit learn imputer,我们可以将它用于整个数据框(如果所有特征都是定量的),或者我们可以将“for 循环”与类似类型的特征/列列表一起使用(参见下面的示例)。但是自定义输入器可以与任何组合一起使用。

        from sklearn.preprocessing import Imputer
        impute = Imputer(strategy='mean')
        for cols in ['quantitative_column', 'quant']:  # here both are quantitative features.
              xx[cols] = impute.fit_transform(xx[[cols]])
    
  • 自定义输入器:

       from sklearn.preprocessing import Imputer
       from sklearn.base import TransformerMixin
    
       class CustomImputer(TransformerMixin):
             def __init__(self, cols=None, strategy='mean'):
                   self.cols = cols
                   self.strategy = strategy
    
             def transform(self, df):
                   X = df.copy()
                   impute = Imputer(strategy=self.strategy)
                   if self.cols == None:
                          self.cols = list(X.columns)
                   for col in self.cols:
                          if X[col].dtype == np.dtype('O') : 
                                 X[col].fillna(X[col].value_counts().index[0], inplace=True)
                          else : X[col] = impute.fit_transform(X[[col]])
    
                   return X
    
             def fit(self, *_):
                   return self
    
  • 数据框:

          X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san 
                                     francisco', 'tokyo'], 
              'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 
              'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                                'somewhat like', 'dislike'], 
              'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
    
    
                city              boolean   ordinal_column  quantitative_column
            0   tokyo             yes       somewhat like   1.0
            1   NaN               no        like            11.0
            2   london            NaN       somewhat like   -0.5
            3   seattle           no        like            10.0
            4   san francisco     no        somewhat like   NaN
            5   tokyo             yes       dislike         20.0
    
  • 1) 可以与相似类型的特征列表一起使用。

     cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
     cci.fit_transform(X)
    
  • 可以与策略=中位数一起使用

     sd = CustomImputer(['quantitative_column'], strategy = 'median')
     sd.fit_transform(X)
    
  • 3)可用于整个数据框,它将使用默认均值(或者我们也可以用中值更改它。对于定性特征,它使用 strategy = 'most_frequent' 和定量均值/中值。

     call = CustomImputer()
     call.fit_transform(X)   
    

回答by prashanth

There is a package sklearn-pandaswhich has option for imputation for categorical variable https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

有一个包sklearn-pandas可以选择分类变量 https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)

回答by sunnyspain1

You could try the following:

您可以尝试以下操作:

replace = df.<yourcolumn>.value_counts().argmax()

df['<yourcolumn>'].fillna(replace, inplace=True)