pandas 如何跨多列使用groupby转换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34099684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:19:56  来源:igfitidea点击:

How to use groupby transform across multiple columns

pythonpandas

提问by Hillary Sanders

I have a big dataframe, and I'm grouping by one to n columns, and want to apply a function on these groups across two columns (e.g. foo and bar).

我有一个大数据框,我按一到 n 列进行分组,并希望对跨两列(例如 foo 和 bar)的这些组应用一个函数。

Here's an example dataframe:

这是一个示例数据框:

foo_function = lambda x: np.sum(x.a+x.b)

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})

# works with apply, but I want transform:
df.groupby(['c', 'd'])[['a','b']].apply(foo_function)
# transform doesn't work!
df.groupby(['c', 'd'])[['a','b']].transform(foo_function)
TypeError: cannot concatenate a non-NDFrame object

But transformapparently isn't able to combine multiple columns together because it looks at each column separately (unlike apply). What is the next best alternative in terms of speed / elegance? e.g. I could use applyand then create df['new_col']by using pd.match, but that would necessitate matching over sometimes multiple groupby columns (col1 and col2) which seems really hacky / would take a fair amount of code.

transform显然无法将多列组合在一起,因为它分别查看每一列(与应用不同)。就速度/优雅而言,下一个最佳选择是什么?例如,我可以使用apply然后df['new_col']通过 using创建pd.match,但这有时需要匹配多个 groupby 列(col1 和 col2),这看起来真的很笨拙/需要大量代码。

--> Is there a function that is like groupby().transform that can use functions that work over multiple columns? If this doesn't exist, what's the best hack?

--> 有没有像 groupby().transform 这样的函数可以使用在多列上工作的函数?如果这不存在,最好的黑客是什么?

采纳答案by unutbu

Circa Pandas version 0.18, it appears the original answer (below) no longer works.

大约 Pandas 0.18 版,看来原始答案(如下)不再有效。

Instead, if you need to do a groupby computation across multiple columns, do the multi-column computation first, and then the groupby:

相反,如果您需要跨多列进行 groupby 计算,请先进行多列计算,然后进行 groupby:

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})
df['e'] = df['a'] + df['b']
df['e'] = (df.groupby(['c', 'd'])['e'].transform('sum'))
print(df)

yields

产量

   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22


Original answer:

原答案:

The error message:

错误信息:

TypeError: cannot concatenate a non-NDFrame object

suggests that in order to concatenate, the foo_functionshould return an NDFrame (such as a Series or DataFrame). If you return a Series, then:

建议为了连接,foo_function应该返回一个 NDFrame(例如 Series 或 DataFrame)。如果您返回一个系列,则:

In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b'])))
Out[99]: 
    a   b
0  12  12
1  12  12
2  12  12
3   8   8
4  22  22
5  22  22

回答by Victor Chubukov

The way I read the question, you want to be able to do something arbitrary with both the individual values from both columns. You just need to make sure to return a dataframe of the same size as you get passed in. I think the best way is to just make a new column, like this:

我阅读问题的方式是,您希望能够对两列中的两个单独值进行任意操作。您只需要确保返回与传入的大小相同的数据框。我认为最好的方法是创建一个新列,如下所示:

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})
df['e']=0

def f(x):
    y=(x['a']+x['b'])/sum(x['b'])
    return pd.DataFrame({'e':y,'a':x['a'],'b':x['b']})

df.groupby(['c','d']).transform(f)

:

    a   b   e
0   1   1   0.333333
1   2   2   0.666667
2   3   3   1.000000
3   4   4   2.000000
4   5   5   0.909091
5   6   6   1.090909

If you have a very complicated dataframe, you can pick your columns (e.g. df.groupby(['c'])['a','b','e'].transform(f))

如果你有一个非常复杂的数据框,你可以选择你的列(例如df.groupby(['c'])['a','b','e'].transform(f)

This sure looks very inelegant to me, but it's still much faster than applyon large datasets.

这对我来说确实看起来很不雅,但它仍然比apply大型数据集快得多。

Another alternative is to use set_indexto capture all the columns you need and then pass just one column to transform.

另一种选择是用于set_index捕获您需要的所有列,然后仅将一列传递给transform.