Python 在 Pandas 创建的数据框中计算列的平均值时指定“跳过 NA”

Question

提问by lokheart

I am learning Pandaspackage by replicating the outing from some of the R vignettes. Now I am using the dplyrpackage from R as an example:

我正在Pandas通过复制一些 R 小插图的郊游来学习包。现在我以dplyrR 中的包为例：

http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

R script

R 脚本

planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
  count = n(),
  dist = mean(Distance, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

Python script

Python脚本

planes = hflights.groupby('TailNum')
planes['Distance'].agg({'count' : 'count',
                        'dist' : 'mean'})

How can I state explicitly in python that NAneeds to be skipped?

如何在NA需要跳过的python中明确说明？

Answer 1

采纳答案by FooBar

That's a trick question, since you don't do that. Pandas will automatically exclude NaNnumbers from aggregation functions. Consider my df:

这是一个棘手的问题，因为你不这样做。Pandas 会自动NaN从聚合函数中排除数字。考虑我的df：

    b   c   d  e
a               
2   2   6   1  3
2   4   8 NaN  7
2   4   4   6  3
3   5 NaN   2  6
4 NaN NaN   4  1
5   6   2   1  8
7   3   2   4  7
9   6   1 NaN  1
9 NaN NaN   9  3
9   3   4   6  1

The internal count()function will ignore NaNvalues, and so will mean(). The only point where we get NaN, is when the only value is NaN. Then, we take the mean value of an empty set, which turns out to be NaN:

内部count()函数将忽略NaN值，因此 mean(). 我们得到的唯一点NaN是当唯一的值是NaN。然后，我们取一个空集的平均值，结果是NaN：

In[335]: df.groupby('a').mean()
Out[333]: 
          b    c    d         e
a                              
2  3.333333  6.0  3.5  4.333333
3  5.000000  NaN  2.0  6.000000
4       NaN  NaN  4.0  1.000000
5  6.000000  2.0  1.0  8.000000
7  3.000000  2.0  4.0  7.000000
9  4.500000  2.5  7.5  1.666667

Aggregate functions work in the same way:

聚合函数的工作方式相同：

In[340]: df.groupby('a')['b'].agg({'foo': np.mean})
Out[338]: 
        foo
a          
2  3.333333
3  5.000000
4       NaN
5  6.000000
7  3.000000
9  4.500000

Addendum: Notice how the standard dataframe.mean APIwill allow you to control inclusion of NaNvalues, where the default is exclusion.

附录：请注意标准dataframe.mean API如何允许您控制NaN值的包含，其中默认值为 exclude。

Answer 2

回答by c-a

What foobar said is true in regards to how it was implemented by default, but there is a very easy way to specify skipna. Here is an exemple that speaks for itself:

foobar 在默认情况下是如何实现的，但有一种非常简单的方法可以指定skipna。这是一个不言自明的例子：

def custom_mean(df):
    return df.mean(skipna=False)

group.agg({"your_col_name_to_be_aggregated":custom_mean})

That's it! You can customize your own aggregation the way you want, and I'd expect this to be fairly efficient, but I did not dig into it.

就是这样！您可以按照您想要的方式自定义您自己的聚合，我希望这相当有效，但我没有深入研究。

It was also discussed here, but I thought I'd help spread the good news! Answer was found in the official doc.

这里也讨论过，但我想我会帮助传播这个好消息！在官方文档中找到了答案。

Python 在 Pandas 创建的数据框中计算列的平均值时指定“跳过 NA”

提问by lokheart

R script

R 脚本

Python script

Python脚本

采纳答案by FooBar

回答by c-a

相关推荐

最近更新

标签

Python 在 Pandas 创建的数据框中计算列的平均值时指定“跳过 NA”

提问by lokheart

R script

R 脚本

Python script

Python脚本

采纳答案by FooBar

回答by c-a

相关推荐

如何将python屏幕输出保存到文本文件

Python 如何展平列表/嵌套列表列表？

Python pandas groupby 中的最大和最小日期

Python 使用 .readlines() 时摆脱 \n

相关推荐

最近更新

标签