忽略 NaN 的 Pandas 聚合

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26145585/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:32:05  来源:igfitidea点击:

Pandas aggregation ignoring NaN's

pythonnumpypandasaggregatenan

提问by Zhubarb

I aggregate my Pandas dataframe: data. Specifically, I want to get the average and sum amounts by tuples of [originand type]. For averaging and summing I tried the numpy functions below:

我聚合了我的 Pandas 数据框:data. 具体来说,我想amount通过 [origintype] 的元组获得平均值和总和。为了求平均值和求和,我尝试了下面的 numpy 函数:

import numpy as np
import pandas as pd
result = data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum, pd.Series.mean]}).reset_index() 

My issue is that the amountcolumn includes NaNs, which causes the resultof the above code to have a lot of NaNaverage and sums.

我的问题是该amount列包含NaNs,这导致result上述代码的 有很多NaN平均值和总和。

I know both pd.Series.sumand pd.Series.meanhave skipna=Trueby default, so why am I still getting NaNs here?

我知道两者pd.Series.sum并且默认情况下pd.Series.mean都有skipna=True,那么为什么我仍然在NaN这里得到s ?

I also tried this, which obviously did not work:

我也试过这个,这显然不起作用:

data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum(skipna=True), pd.Series.mean(skipna=True)]}).reset_index() 

EDIT:Upon @Korem's suggestion, I also tried to use a partialas below:

编辑:根据@Korem 的建议,我也尝试使用 apartial如下:

s_na_mean = partial(pd.Series.mean, skipna = True)    
data.groupby(groupbyvars).agg({'amount': [ np.nansum, s_na_mean ]}).reset_index() 

but get this error:

但得到这个错误:

error: 'functools.partial' object has no attribute '__name__'

回答by Korem

Use numpy's nansumand nanmean:

使用 numpy 的nansumnanmean

from numpy import nansum
from numpy import nanmean
data.groupby(groupbyvars).agg({'amount': [ nansum, nanmean]}).reset_index() 

As a workaround for older version of numpy, and also a way to fix your last try:

作为旧版本 numpy 的解决方法,也是修复上次尝试的方法:

When you do pd.Series.sum(skipna=True)you actually call the method. If you want to use it like this you want to define a partial. So if you don't have nanmean, let's define s_na_meanand use that:

当你这样做时,pd.Series.sum(skipna=True)你实际上调用了该方法。如果你想像这样使用它,你想定义一个partial。所以如果你没有nanmean,让我们定义s_na_mean和使用它:

from functools import partial
s_na_mean = partial(pd.Series.mean, skipna = True)

回答by Miros

It might be too late but anyways it might be useful for others.

可能为时已晚,但无论如何它可能对其他人有用。

Try apply function:

尝试应用功能:

import numpy as np
import pandas as pd

def nan_agg(x):
    res = {}

    res['nansum'] = x.loc[ not x['amount'].isnull(), :]['amount'].sum()
    res['nanmean'] = x.loc[ not x['amount'].isnull(), :]['amount'].mean()

    return pd.Series(res, index=['nansum', 'nanmean'])

result = data.groupby(groupbyvars).apply(nan_agg).reset_index()