pandas 用相关列的平均值替换数据框中的 NaN 值的函数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51207491/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Function to replace NaN values in a dataframe with mean of the related column
提问by Marco G. de Pinto
EDIT:This question is not a clone of pandas dataframe replace nan values with average of columnsbecause I want to replace the value of each column with the mean of the column and not with the mean of the dataframe values.
编辑:这个问题不是pandas数据框的克隆,用列的平均值替换nan值,因为我想用列的平均值而不是数据框值的平均值替换每列的值。
QUESTION
题
I have a pandas dataframe (train
) with a hundred columns to which I have to apply Machine Learning techniques.
我有一个train
包含一百列的Pandas 数据框 ( ),我必须对其应用机器学习技术。
Usually I made feature engineering by hand but in this case I have a lot of columns to deal with.
通常我手工制作特征工程,但在这种情况下,我有很多列要处理。
I would like to build a Python function that:
我想构建一个 Python 函数:
1) Find the NaN
values in each column (I have thought to df.isnull().any()
)
1)找到NaN
每列中的值(我想过df.isnull().any()
)
2) For each NaN
value, replace it with the mean of the column in which the NaN value has been found.
2) 对于每个NaN
值,将其替换为找到 NaN 值的列的平均值。
My idea was something like this:
我的想法是这样的:
def replace(value):
for value in train:
if train['value'].isnull():
train['value'] = train['value'].fillna(train['value'].mean())
train = train.apply(replace,axis=1)
But I receive the following error
但我收到以下错误
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3063 try:
-> 3064 return self._engine.get_loc(key)
3065 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'value'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-003b3eb2463c> in <module>()
----> 1 train = train.apply(replace,axis=1)
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6012 args=args,
6013 kwds=kwds)
-> 6014 return op.get_result()
6015
6016 def applymap(self, func):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
<ipython-input-22-2e7fa654e765> in replace(value)
1 def replace(value):
2 for value in train:
----> 3 if train['value'].isnull():
4 train['value'] = train['value'].fillna(df['value'].mean())
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2484 res = cache.get(item)
2485 if res is None:
-> 2486 values = self._data.get(item)
2487 res = self._box_item_values(item, values)
2488 cache[item] = res
/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3064 return self._engine.get_loc(key)
3065 except KeyError:
-> 3066 return self._engine.get_loc(self._maybe_cast_indexer(key))
3067
3068 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ('value', 'occurred at index 0')
While searching for solutions, I found:
在寻找解决方案时,我发现:
回答by Quickbeam2k1
You can also use fillna
你也可以使用 fillna
df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [2, np.nan, np.nan]})
df.fillna(df.mean(axis=0))
A B
0 1.0 2.0
1 2.0 2.0
2 1.5 2.0
df.mean(axis=0)
computes the mean for every column, and this is passed to the fillna method.
df.mean(axis=0)
计算每一列的平均值,并将其传递给 fillna 方法。
This solution is on my machine, twice as fast as the solution using apply for the data set shown above.
这个解决方案在我的机器上,是使用上面显示的数据集的解决方案的两倍。
回答by zipa
To fill NaN
of each column with its respective mean use:
NaN
用各自的平均用途填充每一列:
df.apply(lambda x: x.fillna(x.mean()))
回答by Paul-Darius
You can try something like:
您可以尝试以下操作:
[df[col].fillna(df[col].mean(), inplace=True) for col in df.columns]
But that is just a way to do it. Your code is a priori almost correct. Your error is that you should call
但这只是一种方法。您的代码是先验的,几乎是正确的。你的错误是你应该打电话
train[value]
Instead of :
代替 :
train['value']
Everywhere in your code. Because the latter will try to look for a column named as "value" which is rather a variable from a list you are iterating on.
在您的代码中的任何地方。因为后者将尝试查找名为“value”的列,它是您正在迭代的列表中的一个变量。