Python 将多个函数应用于多个 groupby 列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14529838/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apply multiple functions to multiple groupby columns
提问by beardc
The docsshow how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:
该文档说明如何将GROUPBY对象在同一时间使用的字典与输出列名作为关键字应用多个功能:
In [563]: grouped['D'].agg({'result1' : np.sum,
.....: 'result2' : np.mean})
.....:
Out[563]:
result2 result1
A
bar -0.579846 -1.739537
foo -0.280588 -1.402938
However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.
但是,这仅适用于 Series groupby 对象。当一个 dict 类似地传递给 groupby DataFrame 时,它期望键是函数将应用到的列名。
What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object(like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.
我想要做的是将多个函数应用于多个列(但某些列将被多次操作)。此外,某些函数将依赖于 groupby 对象中的其他列(如 sumif 函数)。我当前的解决方案是逐列进行,并执行类似于上面的代码的操作,对依赖于其他行的函数使用 lambda。但这需要很长时间,(我认为遍历 groupby 对象需要很长时间)。我将不得不更改它,以便我在一次运行中遍历整个 groupby 对象,但我想知道 Pandas 中是否有一种内置的方法可以稍微干净地做到这一点。
For example, I've tried something like
例如,我尝试过类似的东西
grouped.agg({'C_sum' : lambda x: x['C'].sum(),
'C_std': lambda x: x['C'].std(),
'D_sum' : lambda x: x['D'].sum()},
'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)
but as expected I get a KeyError (since the keys have to be a column if aggis called from a DataFrame).
但正如预期的那样,我得到了一个 KeyError (因为如果agg从 DataFrame 调用,键必须是一列)。
Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?
是否有任何内置的方式来做我想做的事情,或者可能添加这个功能,或者我只需要手动遍历 groupby?
Thanks
谢谢
采纳答案by Ted Petrou
The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agggroupby method. Second, never use .ix.
当前接受的答案的后半部分已经过时并且有两个弃用。首先也是最重要的,您不能再将字典字典传递给agggroupby 方法。其次,永远不要使用.ix.
If you desire to work with two separate columns at the same time I would suggest using the applymethod which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above
如果您希望同时处理两个单独的列,我建议使用apply将 DataFrame 隐式传递给应用函数的方法。让我们使用与上面类似的数据框
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1
A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.
从列名映射到聚合函数的字典仍然是执行聚合的完美方法。
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})
a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__attribute like this:
如果您不喜欢那个丑陋的 lambda 列名称,您可以使用普通函数并为特殊__name__属性提供自定义名称,如下所示:
def max_min(x):
return x.max() - x.min()
max_min.__name__ = 'Max minus Min'
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})
a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
Using applyand returning a Series
使用apply和返回一个系列
Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using applythe entire group as a DataFrame gets passed into the function.
现在,如果您有多个需要一起交互的列,则不能使用agg,它隐式地将 Series 传递给聚合函数。当使用apply整个组作为 DataFrame 被传递到函数时。
I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:
我建议创建一个返回所有聚合系列的自定义函数。使用系列索引作为新列的标签:
def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
df.groupby('group').apply(f)
a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
If you are in love with MultiIndexes, you can still return a Series with one like this:
如果你喜欢 MultiIndexes,你仍然可以用这样的方式返回一个系列:
def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])
df.groupby('group').apply(f_mi)
a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
回答by Zelazny7
For the first part you can pass a dict of column names for keys and a list of functions for the values:
对于第一部分,您可以传递键的列名字典和值的函数列表:
In [28]: df
Out[28]:
A B C D E GRP
0 0.395670 0.219560 0.600644 0.613445 0.242893 0
1 0.323911 0.464584 0.107215 0.204072 0.927325 0
2 0.321358 0.076037 0.166946 0.439661 0.914612 1
3 0.133466 0.447946 0.014815 0.130781 0.268290 1
In [26]: f = {'A':['sum','mean'], 'B':['prod']}
In [27]: df.groupby('GRP').agg(f)
Out[27]:
A B
sum mean prod
GRP
0 0.719580 0.359790 0.102004
1 0.454824 0.227412 0.034060
UPDATE 1:
更新1:
Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.
由于聚合函数适用于 Series,因此会丢失对其他列名称的引用。为了解决这个问题,您可以引用完整的数据帧并使用 lambda 函数中的组索引对其进行索引。
Here's a hacky workaround:
这是一个hacky解决方法:
In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}
In [69]: df.groupby('GRP').agg(f)
Out[69]:
A B D
sum mean prod <lambda>
GRP
0 0.719580 0.359790 0.102004 1.170219
1 0.454824 0.227412 0.034060 1.182901
Here, the resultant 'D' column is made up of the summed 'E' values.
此处,生成的“D”列由相加的“E”值组成。
UPDATE 2:
更新 2:
Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.indexto df.ix[]selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[]which selects only those rows meeting the criteria.
这是一种我认为可以满足您要求的方法。首先制作一个自定义的 lambda 函数。下面,g 引用了该组。聚合时,g 将是一个系列。传递g.index到df.ix[]从 df 中选择当前组。然后我测试 C 列是否小于 0.5。返回的布尔系列被传递给g[]它只选择那些符合条件的行。
In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()
In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}
In [97]: df.groupby('GRP').agg(f)
Out[97]:
A B D
sum mean prod my name
GRP
0 0.719580 0.359790 0.102004 0.204072
1 0.454824 0.227412 0.034060 0.570441
回答by campo
Ted's answer is amazing. I ended up using a smaller version of that in case anyone is interested. Useful when you are looking for one aggregation that depends on values from multiple columns:
泰德的回答是惊人的。我最终使用了一个较小的版本,以防有人感兴趣。当您正在寻找一种依赖于多列值的聚合时很有用:
create a dataframe
创建数据框
df=pd.DataFrame({'a': [1,2,3,4,5,6], 'b': [1,1,0,1,1,0], 'c': ['x','x','y','y','z','z']})
a b c
0 1 1 x
1 2 1 x
2 3 0 y
3 4 1 y
4 5 1 z
5 6 0 z
grouping and aggregating with apply (using multiple columns)
使用 apply 分组和聚合(使用多列)
df.groupby('c').apply(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())
c
x 2.0
y 4.0
z 5.0
grouping and aggregating with aggregate (using multiple columns)
使用聚合进行分组和聚合(使用多列)
I like this approach since I can still use aggregate. Perhaps people will let me know why apply is needed for getting at multiple columns when doing aggregations on groups.
我喜欢这种方法,因为我仍然可以使用聚合。也许人们会让我知道为什么在对组进行聚合时需要 apply 来获取多个列。
It seems obvious now, but as long as you don't select the column of interest directly after the groupby, you will have access to all the columns of the dataframe from within your aggregation function.
现在看起来很明显,但只要您不直接在 groupby 之后选择感兴趣的列,您就可以从聚合函数中访问数据帧的所有列。
only access to the selected column
只能访问选定的列
df.groupby('c')['a'].aggregate(lambda x: x[x>1].mean())
access to all columns since selection is after all the magic
访问所有列,因为选择毕竟是魔法
df.groupby('c').aggregate(lambda x: x[(x['a']>1) & (x['b']==1)].mean())['a']
or similarly
或类似
df.groupby('c').aggregate(lambda x: x['a'][(x['a']>1) & (x['b']==1)].mean())
I hope this helps.
我希望这有帮助。
回答by r2evans
As an alternative (mostly on aesthetics) to Ted Petrou's answer, I found I preferred a slightly more compact listing. Please don't consider accepting it, it's just a much-more-detailed comment on Ted's answer, plus code/data. Python/pandas is not my first/best, but I found this to read well:
作为 Ted Petrou 答案的替代方案(主要是在美学方面),我发现我更喜欢稍微紧凑一点的列表。请不要考虑接受它,它只是对 Ted 答案的更详细的评论,加上代码/数据。Python/pandas 不是我的第一个/最好的,但我发现这很好读:
df.groupby('group') \
.apply(lambda x: pd.Series({
'a_sum' : x['a'].sum(),
'a_max' : x['a'].max(),
'b_mean' : x['b'].mean(),
'c_d_prodsum' : (x['c'] * x['d']).sum()
})
)
a_sum a_max b_mean c_d_prodsum
group
0 0.530559 0.374540 0.553354 0.488525
1 1.433558 0.832443 0.460206 0.053313
I find it more reminiscent of dplyrpipes and data.tablechained commands. Not to say they're better, just more familiar to me. (I certainly recognize the power and, for many, the preference of using more formalized deffunctions for these types of operations. This is just an alternative, not necessarily better.)
我发现它更让人联想到dplyr管道和data.table链式命令。不是说他们更好,只是对我来说更熟悉。(我当然认识到def对这些类型的操作使用更正式的函数的力量和偏好。这只是一个替代方案,不一定更好。)
I generated data in the same manner as Ted, I'll add a seed for reproducibility.
我以与 Ted 相同的方式生成数据,我将添加一个可重复性的种子。
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.374540 0.950714 0.731994 0.598658 0
1 0.156019 0.155995 0.058084 0.866176 0
2 0.601115 0.708073 0.020584 0.969910 1
3 0.832443 0.212339 0.181825 0.183405 1
回答by Erfan
Pandas >= 0.25.0, named aggregations
Pandas >= 0.25.0, 命名聚合
Since pandas version 0.25.0or higher, we are moving away from the dictionary based aggregation and renaming, and moving towards named aggregationswhich accepts a tuple. Now we can simultaneously aggregate + rename to a more informative column name:
由于大熊猫版本0.25.0或更高版本,我们从基于字典的聚集远离,并重新命名,并朝着移动名为聚集其接受tuple。现在我们可以同时聚合 + 重命名为一个信息更丰富的列名:
Example:
示例:
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
a b c d group
0 0.521279 0.914988 0.054057 0.125668 0
1 0.426058 0.828890 0.784093 0.446211 0
2 0.363136 0.843751 0.184967 0.467351 1
3 0.241012 0.470053 0.358018 0.525032 1
Apply GroupBy.aggwith named aggregation:
GroupBy.agg使用命名聚合应用:
df.groupby('group').agg(
a_sum=('a', 'sum'),
a_mean=('a', 'mean'),
b_mean=('b', 'mean'),
c_sum=('c', 'sum'),
d_range=('d', lambda x: x.max() - x.min())
)
a_sum a_mean b_mean c_sum d_range
group
0 0.947337 0.473668 0.871939 0.838150 0.320543
1 0.604149 0.302074 0.656902 0.542985 0.057681
回答by exan
To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where
为了支持列特定聚合并控制输出列名称,pandas 接受GroupBy.agg() 中的特殊语法,称为“命名聚合”,其中
- The keywords are the output column names
- The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
- 关键字是输出列名称
- 这些值是元组,其第一个元素是要选择的列,第二个元素是要应用于该列的聚合。Pandas 为 pandas.NamedAgg 命名元组提供了字段 ['column', 'aggfunc'] 以更清楚地说明参数是什么。像往常一样,聚合可以是可调用的或字符串别名。
In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
....: 'height': [9.1, 6.0, 9.5, 34.0],
....: 'weight': [7.9, 7.5, 9.9, 198.0]})
....:
In [80]: animals
Out[80]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [81]: animals.groupby("kind").agg(
....: min_height=pd.NamedAgg(column='height', aggfunc='min'),
....: max_height=pd.NamedAgg(column='height', aggfunc='max'),
....: average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
....: )
....:
Out[81]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.
pandas.NamedAgg 只是一个命名元组。也允许使用普通元组。
In [82]: animals.groupby("kind").agg(
....: min_height=('height', 'min'),
....: max_height=('height', 'max'),
....: average_weight=('weight', np.mean),
....: )
....:
Out[82]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().
额外的关键字参数不会传递给聚合函数。只有 (column, aggfunc) 对应该作为 **kwargs 传递。如果您的聚合函数需要额外的参数,请使用 functools.partial() 部分应用它们。
Named aggregation is also valid for Series groupby aggregations. In this case there's no column selection, so the values are just the functions.
命名聚合也适用于系列 groupby 聚合。在这种情况下,没有列选择,因此值只是函数。
In [84]: animals.groupby("kind").height.agg(
....: min_height='min',
....: max_height='max',
....: )
....:
Out[84]:
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0

