Python 在 Pandas 创建的数据框中计算列的平均值时指定“跳过 NA”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25039328/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
specifying "skip NA" when calculating mean of the column in a data frame created by Pandas
提问by lokheart
I am learning Pandaspackage by replicating the outing from some of the R vignettes. Now I am using the dplyrpackage from R as an example:
我正在Pandas通过复制一些 R 小插图的郊游来学习包。现在我以dplyrR 中的包为例:
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
R script
R 脚本
planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
count = n(),
dist = mean(Distance, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
Python script
Python脚本
planes = hflights.groupby('TailNum')
planes['Distance'].agg({'count' : 'count',
'dist' : 'mean'})
How can I state explicitly in python that NAneeds to be skipped?
如何在NA需要跳过的python中明确说明?
采纳答案by FooBar
That's a trick question, since you don't do that. Pandas will automatically exclude NaNnumbers from aggregation functions. Consider my df:
这是一个棘手的问题,因为你不这样做。Pandas 会自动NaN从聚合函数中排除数字。考虑我的df:
b c d e
a
2 2 6 1 3
2 4 8 NaN 7
2 4 4 6 3
3 5 NaN 2 6
4 NaN NaN 4 1
5 6 2 1 8
7 3 2 4 7
9 6 1 NaN 1
9 NaN NaN 9 3
9 3 4 6 1
The internal count()function will ignore NaNvalues, and so will mean(). The only point where we get NaN, is when the only value is NaN. Then, we take the mean value of an empty set, which turns out to be NaN:
内部count()函数将忽略NaN值,因此 mean(). 我们得到的唯一点NaN是当唯一的值是NaN。然后,我们取一个空集的平均值,结果是NaN:
In[335]: df.groupby('a').mean()
Out[333]:
b c d e
a
2 3.333333 6.0 3.5 4.333333
3 5.000000 NaN 2.0 6.000000
4 NaN NaN 4.0 1.000000
5 6.000000 2.0 1.0 8.000000
7 3.000000 2.0 4.0 7.000000
9 4.500000 2.5 7.5 1.666667
Aggregate functions work in the same way:
聚合函数的工作方式相同:
In[340]: df.groupby('a')['b'].agg({'foo': np.mean})
Out[338]:
foo
a
2 3.333333
3 5.000000
4 NaN
5 6.000000
7 3.000000
9 4.500000
Addendum: Notice how the standard dataframe.mean APIwill allow you to control inclusion of NaNvalues, where the default is exclusion.
附录:请注意标准dataframe.mean API如何允许您控制NaN值的包含,其中默认值为 exclude。
回答by c-a
What foobar said is true in regards to how it was implemented by default, but there is a very easy way to specify skipna. Here is an exemple that speaks for itself:
foobar 在默认情况下是如何实现的,但有一种非常简单的方法可以指定skipna。这是一个不言自明的例子:
def custom_mean(df):
return df.mean(skipna=False)
group.agg({"your_col_name_to_be_aggregated":custom_mean})
That's it! You can customize your own aggregation the way you want, and I'd expect this to be fairly efficient, but I did not dig into it.
就是这样!您可以按照您想要的方式自定义您自己的聚合,我希望这相当有效,但我没有深入研究。
It was also discussed here, but I thought I'd help spread the good news! Answer was found in the official doc.

