Pandas Dataframe:用行平均值替换 NaN
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33058590/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas Dataframe: Replacing NaN with row average
提问by Aenaon
I am trying to learn pandas but i have been puzzled with the following please. I want to replace NaNs is a dataframe with the row average. Hence something like df.fillna(df.mean(axis=1))should work but for some reason it fails for me. Am I missing anything please, something I'm doing wrong? Is is because its not implemented; see link here
我正在尝试学习Pandas,但我对以下内容感到困惑。我想用行平均值替换 NaNs 是一个数据框。因此,类似的东西df.fillna(df.mean(axis=1))应该可以工作,但由于某种原因它对我来说失败了。我错过了什么吗,我做错了什么?是因为它没有实现;在这里查看链接
import pandas as pd
import numpy as np
?
pd.__version__
Out[44]:
'0.15.2'
In [45]:
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
df
Out[45]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
In [46]:
df.fillna(df.mean(axis=1))
Out[46]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
However something like this looks to work fine
然而,像这样的东西看起来工作正常
df.fillna(df.mean(axis=0))
Out[47]:
c1 c2 c3
0 1 4 7
1 2 5 8
2 3 6 9
回答by Andy Hayden
As commented the axis argument to fillna is NotImplemented.
正如所评论的, fillna 的轴参数是NotImplemented。
df.fillna(df.mean(axis=1), axis=1)
Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.
注意:这在这里很重要,因为您不想用第 n 行平均值填充第 n 列。
For now you'll need to iterate through:
现在你需要遍历:
In [11]: m = df.mean(axis=1)
for i, col in enumerate(df):
# using i allows for duplicate columns
# inplace *may* not always work here, so IMO the next line is preferred
# df.iloc[:, i].fillna(m, inplace=True)
df.iloc[:, i] = df.iloc[:, i].fillna(m)
In [12]: df
Out[12]:
c1 c2 c3
0 1 4 7.0
1 2 5 3.5
2 3 6 9.0
An alternative is to fillna the transpose and then transpose, which may be more efficient...
另一种方法是填充转置然后转置,这可能更有效......
df.T.fillna(df.mean(axis=1)).T
回答by Cleb
As an alternative, you could also use an applywith a lambdaexpression like this:
作为替代方案,您还可以将 anapply与这样的lambda表达式一起使用:
df.apply(lambda row: row.fillna(row.mean()), axis=1)
yielding also
也屈服
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
回答by Troy
I'll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.
我将提出一个涉及转换成 numpy 数组的替代方案。在性能方面,我认为这比迄今为止其他提出的解决方案更有效,并且可能扩展性更好。
The idea being to use an indicator matrix (df.isna().valueswhich is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages.
Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.
想法是使用指标矩阵(df.isna().values如果元素不适用,则为 1,否则为 0)并将其广播乘以行平均值。因此,我们最终得到一个矩阵(与原始 df 完全相同的形状),如果原始元素为 N/A,则该矩阵包含行平均值,否则为 0。
We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A's with the respective row averages.
我们将此矩阵添加到原始 df 中,确保用 0 填充 na,这样实际上,我们已经用相应的行平均值填充了 N/A。
# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df
giving
给予
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
回答by LKho
Just had the same problem. I found this workaround to be working:
刚刚有同样的问题。我发现此解决方法有效:
df.transpose().fillna(df.mean(axis=1)).transpose()
I'm not sure though about the efficiency of this solution.
我不确定这个解决方案的效率。
回答by ALollz
You can broadcast the mean to a DataFrame with the same index as the original and then use updatewith overwrite=Falseto get the behavior of .fillna. Unlike .fillna, updateallows for filling when the Indices have duplicated labels. Should be faster than the looping .fillna for smaller than 50,000 rows or so.
您可以将平均值广播到与原始索引相同的 DataFrame 中,然后使用updatewithoverwrite=False来获取.fillna. 与 不同.fillna,update允许在索引具有重复标签时进行填充。对于小于 50,000 行左右,应该比循环 .fillna 更快。
fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape),
columns=df.columns,
index=df.index)
df.update(fill, overwrite=False)
print(df)
1 1 1
0 1.0 4.0 7.0
0 2.0 5.0 3.5
0 3.0 6.0 9.0

