pandas 在 DataFrame 中嵌套 groupby 并聚合多列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40493024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Nested groupby in DataFrame and aggregate multiple columns
提问by Rahul
I am trying to do nested groupby as follows:
我正在尝试按如下方式进行嵌套 groupby:
>>> df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'Quantity': {0: 60,1: 50, 2: 40, 3: 30, 4: 20, 5: 10}, 'UiD':{0:1,1:1,2:1,3:2,4:2,5:3}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'Sign':{0:1,1:1,2:0,3:-1,4:0,5:-1}, 'leg1':{0:2,1:2,2:4,3:5,4:7,5:8}})
>>> df1
Date Quantity Sign StartTime Stock UiD leg1
0 2016-10-11 60 1 08:00:00.241 ABC 1 2
1 2016-10-11 50 1 08:00:00.243 ABC 1 2
2 2016-10-11 40 0 12:34:23.563 ABC 1 4
3 2016-10-11 30 -1 08:14.05.908 ABC 2 5
4 2016-10-11 20 0 18:54:50.100 ABC 2 7
5 2016-10-12 10 -1 10:08:36.657 XYZ 3 8
>>> dfg1=df1.groupby(['Date','Stock'])
>>> dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))
Date Stock
2016-10-11 ABC 90
2016-10-12 XYZ 10
dtype: int64
>>>
>>> dfg1['leg1'].sum()
Date Stock
2016-10-11 ABC 20
2016-10-12 XYZ 8
Name: leg1, dtype: int64
So far so good. Now I am trying to concatenate the two results into a new DataFrame df2
as follows:
到现在为止还挺好。现在我试图将两个结果连接到一个新的 DataFrame 中df2
,如下所示:
>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)
0 1
Date Stock
2016-10-11 ABC 20 90
2016-10-12 XYZ 8 10
>>>
I am wondering if there is a better way to re-write following line in order to avoid repetition of groupby(['Date','Stock'])
我想知道是否有更好的方法来重写以下行以避免重复 groupby(['Date','Stock'])
dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))
Also this fails if ['Date','Stock']
contains 'UiD'
as one of the keys or if ['Date','Stock']
is replaced by just ['UiD']
.
如果['Date','Stock']
contains'UiD'
作为键之一或者['Date','Stock']
被替换为 just ,这也会失败['UiD']
。
回答by smci
Please restate your question to be clearer. You want to groupby(['Date','Stock'])
, then:
请重申您的问题以使其更清楚。你想groupby(['Date','Stock'])
,那么:
- take only the first record for each UiD and sum (aggregate) its Quantity, but also
- sum allleg1 values for that Date,Stock combination (not just the first-for-each-UiD). Is that right?
- 只取每个 UiD 的第一条记录并总结(聚合)它的数量,但也
- 为该 Date,Stock 组合的所有leg1 值求和(不仅仅是每个 UiD 的第一个值)。那正确吗?
Anyway you want to perform an aggregation (sum) on multiple columns, and yeah the way to avoid repetition of groupby(['Date','Stock']) is to keep one dataframe, not try to stitch together two dataframes from two individual aggregate operations. Something like the following (I'll fix it once you confirm this is what you want):
无论如何,您想对多列执行聚合(求和),是的,避免重复 groupby(['Date','Stock']) 的方法是保留一个数据帧,而不是尝试将两个数据帧拼接在一起聚合操作。类似于以下内容(一旦您确认这是您想要的,我就会修复它):
def filter_first_UiD(g):
#return g.groupby('UiD').first().agg(np.sum)
return g.groupby('UiD').first().agg({'Quantity':'sum', 'leg1':'sum'})
df1.groupby(['Date','Stock']).apply(filter_first_UiD)
回答by Rahul
The way I dealt with the last scenario of avoiding groupby to fail if ['Date','Stock']
contains 'UiD'
as one of the keys or if ['Date','Stock']
is replaced by just ['UiD']
is as follows:
我处理避免 groupby 失败的最后一种情况的方式,如果['Date','Stock']
包含'UiD'
作为键之一或 if['Date','Stock']
被 just 替换,['UiD']
如下所示:
>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1[].first() if 'UiD' in `['Date','Stock']` else dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)
But more elegant solution is still an open question.
但更优雅的解决方案仍然是一个悬而未决的问题。