具有用户定义函数 Pandas 的 Groupby
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19615760/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Groupby with User Defined Functions Pandas
提问by Woody Pride
I understand that passing a function as a group key calls the function once per index value with the return values being used as the group names. What I can't figure out is how to call the function on column values.
我知道将函数作为组键传递给每个索引值调用一次函数,返回值被用作组名。我想不通的是如何在列值上调用函数。
So I can do this:
所以我可以这样做:
people = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
def GroupFunc(x):
if len(x) > 3:
return 'Group1'
else:
return 'Group2'
people.groupby(GroupFunc).sum()
This splits the data into two groups, one of which has index values of length 3 or less, and the other with length three or more. But how can I pass one of the column values? So for example if column d value for each index point is greater than 1. I realise I could just do the following:
这将数据分成两组,其中一组具有长度为 3 或更小的索引值,另一组具有长度为 3 或更大的索引值。但是如何传递列值之一?例如,如果每个索引点的 d 列值大于 1。我意识到我可以执行以下操作:
people.groupby(people.a > 1).sum()
But I want to know how to do this in a user defined function for future reference.
但我想知道如何在用户定义的函数中执行此操作以供将来参考。
Something like:
就像是:
def GroupColFunc(x):
if x > 1:
return 'Group1'
else:
return 'Group2'
But how do I call this? I tried
但是我怎么称呼它呢?我试过
people.groupby(GroupColFunc(people.a))
and similar variants but this does not work.
和类似的变体,但这不起作用。
How do I pass the column values to the function? How would I pass multiple column values e.g. to group on whether people.a > people.b for example?
如何将列值传递给函数?例如,我将如何传递多个列值,例如分组是否为 people.a > people.b?
回答by Roman Pekar
To group by a > 1, you can define your function like:
要按 a > 1 分组,您可以定义您的函数,例如:
>>> def GroupColFunc(df, ind, col):
... if df[col].loc[ind] > 1:
... return 'Group1'
... else:
... return 'Group2'
...
An then call it like
然后称之为
>>> people.groupby(lambda x: GroupColFunc(people, x, 'a')).sum()
a b c d e
Group2 -2.384614 -0.762208 3.359299 -1.574938 -2.65963
Or you can do it only with anonymous function:
或者你只能用匿名函数来做到这一点:
>>> people.groupby(lambda x: 'Group1' if people['b'].loc[x] > people['a'].loc[x] else 'Group2').sum()
a b c d e
Group1 -3.280319 -0.007196 1.525356 0.324154 -1.002439
Group2 0.895705 -0.755012 1.833943 -1.899092 -1.657191
As said in documentation, you can also group by passing Series providing a label -> group name mapping:
如文档中所述,您还可以通过传递提供标签的系列进行分组 -> 组名映射:
>>> mapping = np.where(people['b'] > people['a'], 'Group1', 'Group2')
>>> mapping
Joe Group2
Steve Group1
Wes Group2
Jim Group1
Travis Group1
dtype: string48
>>> people.groupby(mapping).sum()
a b c d e
Group1 -3.280319 -0.007196 1.525356 0.324154 -1.002439
Group2 0.895705 -0.755012 1.833943 -1.899092 -1.657191

