忽略 NaN 的 Pandas 聚合
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26145585/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas aggregation ignoring NaN's
提问by Zhubarb
I aggregate my Pandas dataframe: data. Specifically, I want to get the average and sum amounts by tuples of [originand type]. For averaging and summing I tried the numpy functions below:
我聚合了我的 Pandas 数据框:data. 具体来说,我想amount通过 [origin和type] 的元组获得平均值和总和。为了求平均值和求和,我尝试了下面的 numpy 函数:
import numpy as np
import pandas as pd
result = data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum, pd.Series.mean]}).reset_index()
My issue is that the amountcolumn includes NaNs, which causes the resultof the above code to have a lot of NaNaverage and sums.
我的问题是该amount列包含NaNs,这导致result上述代码的 有很多NaN平均值和总和。
I know both pd.Series.sumand pd.Series.meanhave skipna=Trueby default, so why am I still getting NaNs here?
我知道两者pd.Series.sum并且默认情况下pd.Series.mean都有skipna=True,那么为什么我仍然在NaN这里得到s ?
I also tried this, which obviously did not work:
我也试过这个,这显然不起作用:
data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum(skipna=True), pd.Series.mean(skipna=True)]}).reset_index()
EDIT:Upon @Korem's suggestion, I also tried to use a partialas below:
编辑:根据@Korem 的建议,我也尝试使用 apartial如下:
s_na_mean = partial(pd.Series.mean, skipna = True)
data.groupby(groupbyvars).agg({'amount': [ np.nansum, s_na_mean ]}).reset_index()
but get this error:
但得到这个错误:
error: 'functools.partial' object has no attribute '__name__'
回答by Korem
Use numpy's nansumand nanmean:
from numpy import nansum
from numpy import nanmean
data.groupby(groupbyvars).agg({'amount': [ nansum, nanmean]}).reset_index()
As a workaround for older version of numpy, and also a way to fix your last try:
作为旧版本 numpy 的解决方法,也是修复上次尝试的方法:
When you do pd.Series.sum(skipna=True)you actually call the method. If you want to use it like this you want to define a partial. So if you don't have nanmean, let's define s_na_meanand use that:
当你这样做时,pd.Series.sum(skipna=True)你实际上调用了该方法。如果你想像这样使用它,你想定义一个partial。所以如果你没有nanmean,让我们定义s_na_mean和使用它:
from functools import partial
s_na_mean = partial(pd.Series.mean, skipna = True)
回答by Miros
It might be too late but anyways it might be useful for others.
可能为时已晚,但无论如何它可能对其他人有用。
Try apply function:
尝试应用功能:
import numpy as np
import pandas as pd
def nan_agg(x):
res = {}
res['nansum'] = x.loc[ not x['amount'].isnull(), :]['amount'].sum()
res['nanmean'] = x.loc[ not x['amount'].isnull(), :]['amount'].mean()
return pd.Series(res, index=['nansum', 'nanmean'])
result = data.groupby(groupbyvars).apply(nan_agg).reset_index()

