Python 如何从 pandas groupby().sum() 的输出创建一个新列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30244952/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I create a new column from the output of pandas groupby().sum()?
提问by fe ner
Trying to create a new column from the groupby
calculation. In the code below, I get the correct calculated values for each date (see group below) but when I try to create a new column (df['Data4']
) with it I get NaN. So I am trying to create a new column in the dataframe with the sum of Data3
for the all dates and apply that to each date row. For example, 2015-05-08 is in 2 rows (total is 50+5 = 55) and in this new column I would like to have 55 in both of the rows.
试图从groupby
计算中创建一个新列。在下面的代码中,我得到了每个日期的正确计算值(参见下面的组),但是当我尝试df['Data4']
用它创建一个新列 ( ) 时,我得到 NaN。所以我试图在数据框中创建一个新列,其中包含Data3
所有日期的总和,并将其应用于每个日期行。例如,2015-05-08 有 2 行(总数为 50+5 = 55),在这个新列中,我希望两行都有 55。
import pandas as pd
import numpy as np
from pandas import DataFrame
df = pd.DataFrame({
'Date' : ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'],
'Sym' : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'],
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
})
group = df['Data3'].groupby(df['Date']).sum()
df['Data4'] = group
回答by EdChum
You want to use transform
this will return a Series with the index aligned to the df so you can then add it as a new column:
您想使用transform
这将返回一个索引与 df 对齐的系列,以便您可以将其添加为新列:
In [74]:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120]})
?
df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
Data2 Data3 Date Sym Data4
0 11 5 2015-05-08 aapl 55
1 8 8 2015-05-07 aapl 108
2 10 6 2015-05-06 aapl 66
3 15 1 2015-05-05 aapl 121
4 110 50 2015-05-08 aaww 55
5 60 100 2015-05-07 aaww 108
6 100 60 2015-05-06 aaww 66
7 40 120 2015-05-05 aaww 121
回答by cs95
How do I create a new column with Groupby().Sum()?
如何使用 Groupby().Sum() 创建新列?
There are two ways - one straightforward and the other slightly more interesting.
有两种方法 - 一种简单明了,另一种更有趣。
Everybody's Favorite: GroupBy.transform()
with 'sum'
每个人的最爱:GroupBy.transform()
与'sum'
@Ed Chum's answer can be simplified, a bit. Call DataFrame.groupby
rather than Series.groupby
. This results in simpler syntax.
@Ed Chum 的回答可以简化一点。调用DataFrame.groupby
而不是Series.groupby
。这导致更简单的语法。
# The setup.
df[['Date', 'Data3']]
Date Data3
0 2015-05-08 5
1 2015-05-07 8
2 2015-05-06 6
3 2015-05-05 1
4 2015-05-08 50
5 2015-05-07 100
6 2015-05-06 60
7 2015-05-05 120
df.groupby('Date')['Data3'].transform('sum')
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64
It's a tad faster,
速度有点快,
df2 = pd.concat([df] * 12345)
%timeit df2['Data3'].groupby(df['Date']).transform('sum')
%timeit df2.groupby('Date')['Data3'].transform('sum')
10.4 ms ± 367 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.58 ms ± 559 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Unconventional, but Worth your Consideration: GroupBy.sum()
+ Series.map()
非常规,但值得您考虑:GroupBy.sum()
+Series.map()
I stumbled upon an interesting idiosyncrasy in the API. From what I tell, you can reproduce this on any major version over 0.20 (I tested this on 0.23 and 0.24). It seems like you consistently can shave off a few milliseconds of the time taken by transform
if you instead use a direct function of GroupBy
and broadcast it using map
:
我在 API 中偶然发现了一个有趣的特性。据我所知,您可以在 0.20 以上的任何主要版本上重现这一点(我在 0.23 和 0.24 上对此进行了测试)。transform
如果您改为使用 的直接函数GroupBy
并使用map
以下方法广播它,您似乎始终可以减少几毫秒的时间:
df.Date.map(df.groupby('Date')['Data3'].sum())
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Date, dtype: int64
Compare with
与之比较
df.groupby('Date')['Data3'].transform('sum')
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64
My tests show that map
is a bit faster if you can afford to use the direct GroupBy
function (such as mean
, min
, max
, first
, etc). It is more or less faster for most general situations upto around ~200 thousand records. After that, the performance really depends on the data.
我的测试表明,map
有点更快,如果你能负担得起使用直接GroupBy
函数(如mean
,min
,max
,first
,等)。对于最多约 20 万条记录的大多数一般情况,它或多或少更快。在那之后,性能真的取决于数据。
(Left: v0.23, Right: v0.24)
(左:v0.23,右:v0.24)
Nice alternative to know, and better if you have smaller frames with smaller numbers of groups. . . but I would recommend transform
as a first choice. Thought this was worth sharing anyway.
不错的选择,如果您有较小的帧和较少的组,则更好。. . 但我会推荐transform
作为首选。认为这无论如何都值得分享。
Benchmarking code, for reference:
基准代码,供参考:
import perfplot
perfplot.show(
setup=lambda n: pd.DataFrame({'A': np.random.choice(n//10, n), 'B': np.ones(n)}),
kernels=[
lambda df: df.groupby('A')['B'].transform('sum'),
lambda df: df.A.map(df.groupby('A')['B'].sum()),
],
labels=['GroupBy.transform', 'GroupBy.sum + map'],
n_range=[2**k for k in range(5, 20)],
xlabel='N',
logy=True,
logx=True
)