Python 如何从 pandas groupby().sum() 的输出创建一个新列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30244952/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:07:28  来源:igfitidea点击:

How do I create a new column from the output of pandas groupby().sum()?

pythonpandaspandas-groupby

提问by fe ner

Trying to create a new column from the groupbycalculation. In the code below, I get the correct calculated values for each date (see group below) but when I try to create a new column (df['Data4']) with it I get NaN. So I am trying to create a new column in the dataframe with the sum of Data3for the all dates and apply that to each date row. For example, 2015-05-08 is in 2 rows (total is 50+5 = 55) and in this new column I would like to have 55 in both of the rows.

试图从groupby计算中创建一个新列。在下面的代码中,我得到了每个日期的正确计算值(参见下面的组),但是当我尝试df['Data4']用它创建一个新列 ( ) 时,我得到 NaN。所以我试图在数据框中创建一个新列,其中包含Data3所有日期的总和,并将其应用于每个日期行。例如,2015-05-08 有 2 行(总数为 50+5 = 55),在这个新列中,我希望两行都有 55。

import pandas as pd
import numpy as np
from pandas import DataFrame

df = pd.DataFrame({
    'Date' : ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 
    'Sym'  : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 
    'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
    'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
})

group = df['Data3'].groupby(df['Date']).sum()

df['Data4'] = group

回答by EdChum

You want to use transformthis will return a Series with the index aligned to the df so you can then add it as a new column:

您想使用transform这将返回一个索引与 df 对齐的系列,以便您可以将其添加为新列:

In [74]:

df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120]})
?
df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
   Data2  Data3        Date   Sym  Data4
0     11      5  2015-05-08  aapl     55
1      8      8  2015-05-07  aapl    108
2     10      6  2015-05-06  aapl     66
3     15      1  2015-05-05  aapl    121
4    110     50  2015-05-08  aaww     55
5     60    100  2015-05-07  aaww    108
6    100     60  2015-05-06  aaww     66
7     40    120  2015-05-05  aaww    121

回答by cs95

How do I create a new column with Groupby().Sum()?

如何使用 Groupby().Sum() 创建新列?

There are two ways - one straightforward and the other slightly more interesting.

有两种方法 - 一种简单明了,另一种更有趣。



Everybody's Favorite: GroupBy.transform()with 'sum'

每个人的最爱:GroupBy.transform()'sum'

@Ed Chum's answer can be simplified, a bit. Call DataFrame.groupbyrather than Series.groupby. This results in simpler syntax.

@Ed Chum 的回答可以简化一点。调用DataFrame.groupby而不是Series.groupby。这导致更简单的语法。

# The setup.
df[['Date', 'Data3']]

         Date  Data3
0  2015-05-08      5
1  2015-05-07      8
2  2015-05-06      6
3  2015-05-05      1
4  2015-05-08     50
5  2015-05-07    100
6  2015-05-06     60
7  2015-05-05    120

df.groupby('Date')['Data3'].transform('sum')

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data3, dtype: int64 

It's a tad faster,

速度有点快,

df2 = pd.concat([df] * 12345)

%timeit df2['Data3'].groupby(df['Date']).transform('sum')
%timeit df2.groupby('Date')['Data3'].transform('sum')

10.4 ms ± 367 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.58 ms ± 559 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Unconventional, but Worth your Consideration: GroupBy.sum()+ Series.map()

非常规,但值得您考虑:GroupBy.sum()+Series.map()

I stumbled upon an interesting idiosyncrasy in the API. From what I tell, you can reproduce this on any major version over 0.20 (I tested this on 0.23 and 0.24). It seems like you consistently can shave off a few milliseconds of the time taken by transformif you instead use a direct function of GroupByand broadcast it using map:

我在 API 中偶然发现了一个有趣的特性。据我所知,您可以在 0.20 以上的任何主要版本上重现这一点(我在 0.23 和 0.24 上对此进行了测试)。transform如果您改为使用 的直接函数GroupBy并使用map以下方法广播它,您似乎始终可以减少几毫秒的时间:

df.Date.map(df.groupby('Date')['Data3'].sum())

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Date, dtype: int64

Compare with

与之比较

df.groupby('Date')['Data3'].transform('sum')

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data3, dtype: int64

My tests show that mapis a bit faster if you can afford to use the direct GroupByfunction (such as mean, min, max, first, etc). It is more or less faster for most general situations upto around ~200 thousand records. After that, the performance really depends on the data.

我的测试表明,map有点更快,如果你能负担得起使用直接GroupBy函数(如meanminmaxfirst,等)。对于最多约 20 万条记录的大多数一般情况,它或多或少更快。在那之后,性能真的取决于数据。

(Left: v0.23, Right: v0.24)

(左:v0.23,右:v0.24)

Nice alternative to know, and better if you have smaller frames with smaller numbers of groups. . . but I would recommend transformas a first choice. Thought this was worth sharing anyway.

不错的选择,如果您有较小的帧和较少的组,则更好。. . 但我会推荐transform作为首选。认为这无论如何都值得分享。

Benchmarking code, for reference:

基准代码,供参考:

import perfplot

perfplot.show(
    setup=lambda n: pd.DataFrame({'A': np.random.choice(n//10, n), 'B': np.ones(n)}),
    kernels=[
        lambda df: df.groupby('A')['B'].transform('sum'),
        lambda df:  df.A.map(df.groupby('A')['B'].sum()),
    ],
    labels=['GroupBy.transform', 'GroupBy.sum + map'],
    n_range=[2**k for k in range(5, 20)],
    xlabel='N',
    logy=True,
    logx=True
)