pandas 如何跨多列使用groupby转换
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34099684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use groupby transform across multiple columns
提问by Hillary Sanders
I have a big dataframe, and I'm grouping by one to n columns, and want to apply a function on these groups across two columns (e.g. foo and bar).
我有一个大数据框,我按一到 n 列进行分组,并希望对跨两列(例如 foo 和 bar)的这些组应用一个函数。
Here's an example dataframe:
这是一个示例数据框:
foo_function = lambda x: np.sum(x.a+x.b)
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
# works with apply, but I want transform:
df.groupby(['c', 'd'])[['a','b']].apply(foo_function)
# transform doesn't work!
df.groupby(['c', 'd'])[['a','b']].transform(foo_function)
TypeError: cannot concatenate a non-NDFrame object
But transform
apparently isn't able to combine multiple columns together because it looks at each column separately (unlike apply). What is the next best alternative in terms of speed / elegance? e.g. I could use apply
and then create df['new_col']
by using pd.match
, but that would necessitate matching over sometimes multiple groupby columns (col1 and col2) which seems really hacky / would take a fair amount of code.
但transform
显然无法将多列组合在一起,因为它分别查看每一列(与应用不同)。就速度/优雅而言,下一个最佳选择是什么?例如,我可以使用apply
然后df['new_col']
通过 using创建pd.match
,但这有时需要匹配多个 groupby 列(col1 和 col2),这看起来真的很笨拙/需要大量代码。
--> Is there a function that is like groupby().transform that can use functions that work over multiple columns? If this doesn't exist, what's the best hack?
--> 有没有像 groupby().transform 这样的函数可以使用在多列上工作的函数?如果这不存在,最好的黑客是什么?
采纳答案by unutbu
Circa Pandas version 0.18, it appears the original answer (below) no longer works.
大约 Pandas 0.18 版,看来原始答案(如下)不再有效。
Instead, if you need to do a groupby computation across multiple columns, do the multi-column computation first, and then the groupby:
相反,如果您需要跨多列进行 groupby 计算,请先进行多列计算,然后进行 groupby:
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
df['e'] = df['a'] + df['b']
df['e'] = (df.groupby(['c', 'd'])['e'].transform('sum'))
print(df)
yields
产量
a b c d e
0 1 1 q z 12
1 2 2 q z 12
2 3 3 q z 12
3 4 4 q o 8
4 5 5 w o 22
5 6 6 w o 22
Original answer:
原答案:
The error message:
错误信息:
TypeError: cannot concatenate a non-NDFrame object
suggests that in order to concatenate, the foo_function
should return an NDFrame (such as a Series or DataFrame). If you return a Series, then:
建议为了连接,foo_function
应该返回一个 NDFrame(例如 Series 或 DataFrame)。如果您返回一个系列,则:
In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b'])))
Out[99]:
a b
0 12 12
1 12 12
2 12 12
3 8 8
4 22 22
5 22 22
回答by Victor Chubukov
The way I read the question, you want to be able to do something arbitrary with both the individual values from both columns. You just need to make sure to return a dataframe of the same size as you get passed in. I think the best way is to just make a new column, like this:
我阅读问题的方式是,您希望能够对两列中的两个单独值进行任意操作。您只需要确保返回与传入的大小相同的数据框。我认为最好的方法是创建一个新列,如下所示:
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
df['e']=0
def f(x):
y=(x['a']+x['b'])/sum(x['b'])
return pd.DataFrame({'e':y,'a':x['a'],'b':x['b']})
df.groupby(['c','d']).transform(f)
:
:
a b e
0 1 1 0.333333
1 2 2 0.666667
2 3 3 1.000000
3 4 4 2.000000
4 5 5 0.909091
5 6 6 1.090909
If you have a very complicated dataframe, you can pick your columns (e.g. df.groupby(['c'])['a','b','e'].transform(f)
)
如果你有一个非常复杂的数据框,你可以选择你的列(例如df.groupby(['c'])['a','b','e'].transform(f)
)
This sure looks very inelegant to me, but it's still much faster than apply
on large datasets.
这对我来说确实看起来很不雅,但它仍然比apply
大型数据集快得多。
Another alternative is to use set_index
to capture all the columns you need and then pass just one column to transform
.
另一种选择是用于set_index
捕获您需要的所有列,然后仅将一列传递给transform
.