Python 如何从 pandas groupby().sum() 的输出创建一个新列？

Question

提问by fe ner

Trying to create a new column from the groupbycalculation. In the code below, I get the correct calculated values for each date (see group below) but when I try to create a new column (df['Data4']) with it I get NaN. So I am trying to create a new column in the dataframe with the sum of Data3for the all dates and apply that to each date row. For example, 2015-05-08 is in 2 rows (total is 50+5 = 55) and in this new column I would like to have 55 in both of the rows.

试图从groupby计算中创建一个新列。在下面的代码中，我得到了每个日期的正确计算值（参见下面的组），但是当我尝试df['Data4']用它创建一个新列 ( ) 时，我得到 NaN。所以我试图在数据框中创建一个新列，其中包含Data3所有日期的总和，并将其应用于每个日期行。例如，2015-05-08 有 2 行（总数为 50+5 = 55），在这个新列中，我希望两行都有 55。

import pandas as pd
import numpy as np
from pandas import DataFrame

df = pd.DataFrame({
    'Date' : ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 
    'Sym'  : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 
    'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
    'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
})

group = df['Data3'].groupby(df['Date']).sum()

df['Data4'] = group

Answer 1

回答by EdChum

You want to use transformthis will return a Series with the index aligned to the df so you can then add it as a new column:

您想使用transform这将返回一个索引与 df 对齐的系列，以便您可以将其添加为新列：

In [74]:

df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120]})
?
df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
   Data2  Data3        Date   Sym  Data4
0     11      5  2015-05-08  aapl     55
1      8      8  2015-05-07  aapl    108
2     10      6  2015-05-06  aapl     66
3     15      1  2015-05-05  aapl    121
4    110     50  2015-05-08  aaww     55
5     60    100  2015-05-07  aaww    108
6    100     60  2015-05-06  aaww     66
7     40    120  2015-05-05  aaww    121

Answer 2

回答by cs95

How do I create a new column with Groupby().Sum()?

如何使用 Groupby().Sum() 创建新列？

There are two ways - one straightforward and the other slightly more interesting.

有两种方法 - 一种简单明了，另一种更有趣。

Everybody's Favorite: `GroupBy.transform()`with `'sum'`

每个人的最爱：`GroupBy.transform()`与`'sum'`

@Ed Chum's answer can be simplified, a bit. Call DataFrame.groupbyrather than Series.groupby. This results in simpler syntax.

@Ed Chum 的回答可以简化一点。调用DataFrame.groupby而不是Series.groupby。这导致更简单的语法。

# The setup.
df[['Date', 'Data3']]

         Date  Data3
0  2015-05-08      5
1  2015-05-07      8
2  2015-05-06      6
3  2015-05-05      1
4  2015-05-08     50
5  2015-05-07    100
6  2015-05-06     60
7  2015-05-05    120

df.groupby('Date')['Data3'].transform('sum')

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data3, dtype: int64

It's a tad faster,

速度有点快，

df2 = pd.concat([df] * 12345)

%timeit df2['Data3'].groupby(df['Date']).transform('sum')
%timeit df2.groupby('Date')['Data3'].transform('sum')

10.4 ms ± 367 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.58 ms ± 559 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Unconventional, but Worth your Consideration: `GroupBy.sum()`+ `Series.map()`

非常规，但值得您考虑：`GroupBy.sum()`+`Series.map()`

I stumbled upon an interesting idiosyncrasy in the API. From what I tell, you can reproduce this on any major version over 0.20 (I tested this on 0.23 and 0.24). It seems like you consistently can shave off a few milliseconds of the time taken by transformif you instead use a direct function of GroupByand broadcast it using map:

我在 API 中偶然发现了一个有趣的特性。据我所知，您可以在 0.20 以上的任何主要版本上重现这一点（我在 0.23 和 0.24 上对此进行了测试）。transform如果您改为使用的直接函数GroupBy并使用map以下方法广播它，您似乎始终可以减少几毫秒的时间：

df.Date.map(df.groupby('Date')['Data3'].sum())

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Date, dtype: int64

Compare with

与之比较

df.groupby('Date')['Data3'].transform('sum')

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data3, dtype: int64

My tests show that mapis a bit faster if you can afford to use the direct GroupByfunction (such as mean, min, max, first, etc). It is more or less faster for most general situations upto around ~200 thousand records. After that, the performance really depends on the data.

我的测试表明，map有点更快，如果你能负担得起使用直接GroupBy函数（如mean，min，max，first，等）。对于最多约 20 万条记录的大多数一般情况，它或多或少更快。在那之后，性能真的取决于数据。

(Left: v0.23, Right: v0.24)

（左：v0.23，右：v0.24）

Nice alternative to know, and better if you have smaller frames with smaller numbers of groups. . . but I would recommend transformas a first choice. Thought this was worth sharing anyway.

不错的选择，如果您有较小的帧和较少的组，则更好。. . 但我会推荐transform作为首选。认为这无论如何都值得分享。

Benchmarking code, for reference:

基准代码，供参考：

import perfplot

perfplot.show(
    setup=lambda n: pd.DataFrame({'A': np.random.choice(n//10, n), 'B': np.ones(n)}),
    kernels=[
        lambda df: df.groupby('A')['B'].transform('sum'),
        lambda df:  df.A.map(df.groupby('A')['B'].sum()),
    ],
    labels=['GroupBy.transform', 'GroupBy.sum + map'],
    n_range=[2**k for k in range(5, 20)],
    xlabel='N',
    logy=True,
    logx=True
)

Python 如何从 pandas groupby().sum() 的输出创建一个新列？

提问by fe ner

回答by EdChum

回答by cs95

How do I create a new column with Groupby().Sum()?

如何使用 Groupby().Sum() 创建新列？

Everybody's Favorite: `GroupBy.transform()`with `'sum'`

每个人的最爱：`GroupBy.transform()`与`'sum'`

Unconventional, but Worth your Consideration: `GroupBy.sum()`+ `Series.map()`

非常规，但值得您考虑：`GroupBy.sum()`+`Series.map()`

相关推荐

最近更新

标签

Python 如何从 pandas groupby().sum() 的输出创建一个新列？

提问by fe ner

回答by EdChum

回答by cs95

How do I create a new column with Groupby().Sum()?

如何使用 Groupby().Sum() 创建新列？

Everybody's Favorite: GroupBy.transform()with 'sum'

每个人的最爱：GroupBy.transform()与'sum'

Unconventional, but Worth your Consideration: GroupBy.sum()+ Series.map()

非常规，但值得您考虑：GroupBy.sum()+Series.map()

相关推荐

Python 在二维矩阵中查找值的索引

如何使用list.insert将Python中的元素添加到列表的末尾？

Python：检测套接字何时因任何原因断开连接？

Python 中的递归基础

相关推荐

最近更新

标签

Everybody's Favorite: `GroupBy.transform()`with `'sum'`

每个人的最爱：`GroupBy.transform()`与`'sum'`

Unconventional, but Worth your Consideration: `GroupBy.sum()`+ `Series.map()`

非常规，但值得您考虑：`GroupBy.sum()`+`Series.map()`