pandas 用相关列的平均值替换数据框中的 NaN 值的函数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51207491/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:46:40  来源:igfitidea点击:

Function to replace NaN values in a dataframe with mean of the related column

pythonpandasnumpydataframe

提问by Marco G. de Pinto

EDIT:This question is not a clone of pandas dataframe replace nan values with average of columnsbecause I want to replace the value of each column with the mean of the column and not with the mean of the dataframe values.

编辑:这个问题不是pandas数据框的克隆,用列的平均值替换nan值,因为我想用列的平均值而不是数据框值的平均值替换每列的值。

QUESTION

I have a pandas dataframe (train) with a hundred columns to which I have to apply Machine Learning techniques.

我有一个train包含一百列的Pandas 数据框 ( ),我必须对其应用机器学习技术。

Usually I made feature engineering by hand but in this case I have a lot of columns to deal with.

通常我手工制作特征工程,但在这种情况下,我有很多列要处理。

I would like to build a Python function that:

我想构建一个 Python 函数:

1) Find the NaNvalues in each column (I have thought to df.isnull().any())

1)找到NaN每列中的值(我想过df.isnull().any()

2) For each NaNvalue, replace it with the mean of the column in which the NaN value has been found.

2) 对于每个NaN值,将其替换为找到 NaN 值的列的平均值。

My idea was something like this:

我的想法是这样的:

def replace(value):
    for value in train:
        if train['value'].isnull():
           train['value'] = train['value'].fillna(train['value'].mean())

train = train.apply(replace,axis=1)

But I receive the following error

但我收到以下错误

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3063             try:
-> 3064                 return self._engine.get_loc(key)
   3065             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'value'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-25-003b3eb2463c> in <module>()
----> 1 train = train.apply(replace,axis=1)

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6012                          args=args,
   6013                          kwds=kwds)
-> 6014         return op.get_result()
   6015 
   6016     def applymap(self, func):

/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
    140             return self.apply_raw()
    141 
--> 142         return self.apply_standard()
    143 
    144     def apply_empty_result(self):

/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
    246 
    247         # compute the result using the series generator
--> 248         self.apply_series_generator()
    249 
    250         # wrap results

/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
    275             try:
    276                 for i, v in enumerate(series_gen):
--> 277                     results[i] = self.f(v)
    278                     keys.append(v.name)
    279             except Exception as e:

<ipython-input-22-2e7fa654e765> in replace(value)
      1 def replace(value):
      2     for value in train:
----> 3         if train['value'].isnull():
      4            train['value'] = train['value'].fillna(df['value'].mean())

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2686             return self._getitem_multilevel(key)
   2687         else:
-> 2688             return self._getitem_column(key)
   2689 
   2690     def _getitem_column(self, key):

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2693         # get column
   2694         if self.columns.is_unique:
-> 2695             return self._get_item_cache(key)
   2696 
   2697         # duplicate columns & possible reduce dimensionality

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   2484         res = cache.get(item)
   2485         if res is None:
-> 2486             values = self._data.get(item)
   2487             res = self._box_item_values(item, values)
   2488             cache[item] = res

/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3064                 return self._engine.get_loc(key)
   3065             except KeyError:
-> 3066                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3067 
   3068         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('value', 'occurred at index 0')

While searching for solutions, I found:

在寻找解决方案时,我发现:

  • Thisbut it works with a txt file (not a pandas dataframe)

  • Thisquestion about the df.isnull().any()method.

  • 但它适用于 txt 文件(不是 Pandas 数据框)

  • 这个关于df.isnull().any()方法的问题。

回答by Quickbeam2k1

You can also use fillna

你也可以使用 fillna

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [2, np.nan, np.nan]})
df.fillna(df.mean(axis=0))
    A   B
0   1.0 2.0
1   2.0 2.0
2   1.5 2.0

df.mean(axis=0)computes the mean for every column, and this is passed to the fillna method.

df.mean(axis=0)计算每一列的平均值,并将其传递给 fillna 方法。

This solution is on my machine, twice as fast as the solution using apply for the data set shown above.

这个解决方案在我的机器上,是使用上面显示的数据集的解决方案的两倍。

回答by zipa

To fill NaNof each column with its respective mean use:

NaN用各自的平均用途填充每一列:

df.apply(lambda x: x.fillna(x.mean())) 

回答by Paul-Darius

You can try something like:

您可以尝试以下操作:

[df[col].fillna(df[col].mean(), inplace=True) for col in df.columns]

But that is just a way to do it. Your code is a priori almost correct. Your error is that you should call

但这只是一种方法。您的代码是先验的,几乎是正确的。你的错误是你应该打电话

train[value]

Instead of :

代替 :

train['value']

Everywhere in your code. Because the latter will try to look for a column named as "value" which is rather a variable from a list you are iterating on.

在您的代码中的任何地方。因为后者将尝试查找名为“value”的列,它是您正在迭代的列表中的一个变量。