pandas 在 DataFrame 中嵌套 groupby 并聚合多列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40493024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:23:55  来源:igfitidea点击:

Nested groupby in DataFrame and aggregate multiple columns

pandasdataframegroup-bynestedaggregate

提问by Rahul

I am trying to do nested groupby as follows:

我正在尝试按如下方式进行嵌套 groupby:

>>> df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'Quantity': {0: 60,1: 50, 2: 40, 3: 30, 4: 20, 5: 10}, 'UiD':{0:1,1:1,2:1,3:2,4:2,5:3}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'Sign':{0:1,1:1,2:0,3:-1,4:0,5:-1}, 'leg1':{0:2,1:2,2:4,3:5,4:7,5:8}})
>>> df1
         Date  Quantity  Sign     StartTime Stock  UiD  leg1
0  2016-10-11        60     1  08:00:00.241   ABC    1     2
1  2016-10-11        50     1  08:00:00.243   ABC    1     2
2  2016-10-11        40     0  12:34:23.563   ABC    1     4
3  2016-10-11        30    -1  08:14.05.908   ABC    2     5
4  2016-10-11        20     0  18:54:50.100   ABC    2     7
5  2016-10-12        10    -1  10:08:36.657   XYZ    3     8
>>> dfg1=df1.groupby(['Date','Stock'])
>>> dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))
Date        Stock
2016-10-11  ABC      90
2016-10-12  XYZ      10
dtype: int64
>>>
>>> dfg1['leg1'].sum()
Date        Stock
2016-10-11  ABC      20
2016-10-12  XYZ       8
Name: leg1, dtype: int64

So far so good. Now I am trying to concatenate the two results into a new DataFrame df2as follows:

到现在为止还挺好。现在我试图将两个结果连接到一个新的 DataFrame 中df2,如下所示:

>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)
                   0   1
Date       Stock        
2016-10-11 ABC    20  90
2016-10-12 XYZ     8  10
>>>

I am wondering if there is a better way to re-write following line in order to avoid repetition of groupby(['Date','Stock'])

我想知道是否有更好的方法来重写以下行以避免重复 groupby(['Date','Stock'])

dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))

Also this fails if ['Date','Stock']contains 'UiD'as one of the keys or if ['Date','Stock']is replaced by just ['UiD'].

如果['Date','Stock']contains'UiD'作为键之一或者['Date','Stock']被替换为 just ,这也会失败['UiD']

回答by smci

Please restate your question to be clearer. You want to groupby(['Date','Stock']), then:

请重申您的问题以使其更清楚。你想groupby(['Date','Stock']),那么:

  1. take only the first record for each UiD and sum (aggregate) its Quantity, but also
  2. sum allleg1 values for that Date,Stock combination (not just the first-for-each-UiD). Is that right?
  1. 只取每个 UiD 的第一条记录并总结(聚合)它的数量,但也
  2. 为该 Date,Stock 组合的所有leg1 值求和(不仅仅是每个 UiD 的第一个值)。那正确吗?

Anyway you want to perform an aggregation (sum) on multiple columns, and yeah the way to avoid repetition of groupby(['Date','Stock']) is to keep one dataframe, not try to stitch together two dataframes from two individual aggregate operations. Something like the following (I'll fix it once you confirm this is what you want):

无论如何,您想对多列执行聚合(求和),是的,避免重复 groupby(['Date','Stock']) 的方法是保留一个数据帧,而不是尝试将两个数据帧拼接在一起聚合操作。类似于以下内容(一旦您确认这是您想要的,我就会修复它):

def filter_first_UiD(g):
    #return g.groupby('UiD').first().agg(np.sum)
    return g.groupby('UiD').first().agg({'Quantity':'sum', 'leg1':'sum'})

df1.groupby(['Date','Stock']).apply(filter_first_UiD)

回答by Rahul

The way I dealt with the last scenario of avoiding groupby to fail if ['Date','Stock']contains 'UiD'as one of the keys or if ['Date','Stock']is replaced by just ['UiD']is as follows:

我处理避免 groupby 失败的最后一种情况的方式,如果['Date','Stock']包含'UiD'作为键之一或 if['Date','Stock']被 just 替换,['UiD']如下所示:

>>> df2 = pd.concat([dfg1['leg1'].sum(), dfg1[].first() if 'UiD' in `['Date','Stock']` else dfg1.apply(lambda x:x.groupby('UiD').first()).groupby(['Date','Stock']).apply(lambda x:np.sum(x['Quantity']))],axis=1)

But more elegant solution is still an open question.

但更优雅的解决方案仍然是一个悬而未决的问题。