pandas 如何在熊猫中按对象分组应用滚动功能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28642511/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:58:09  来源:igfitidea点击:

How to apply rolling functions in a group by object in pandas

pythonpandasgroup-bydataframeapply

提问by user6396

I'm having difficulty to solve a look-back or roll-over problem in dataframe or perhaps in groupby.

我很难解决数据帧或 groupby 中的回溯或翻转问题。

The following is a simple example of the dataframe I have:

以下是我拥有的数据框的一个简单示例:

              fruit    amount    
   20140101   apple     3
   20140102   apple     5
   20140102   orange    10
   20140104   banana    2
   20140104   apple     10
   20140104   orange    4
   20140105   orange    6
   20140105   grape     1
   …
   20141231   apple     3
   20141231   grape     2

I need to calculate the average value of 'amount' of each fruit in the previous 3 days for everyday, and create the following data frame:

我需要计算每天前 3 天每个水果的“数量”的平均值,并创建以下数据框:

              fruit     average_in_last 3 days
   20140104   apple      4
   20140104   orange     10
   ...

For example on 20140104, the previous 3 days are 20140101, 20140102 and 20140103 (note the date in the data frame is not continuous and 20140103 does not exist), the average amount of apple is (3+5)/2 = 4 and orange is 10/1=10, the rest is 0.

比如20140104,前3天分别是20140101、20140102和20140103(注意数据框中的日期不连续,20140103不存在),苹果的平均数量是(3+5)/2 = 4,橙子是 10/1=10,其余为 0。

The sample data frame is very simple but the actual data frame is much more complicated and larger. Hope someone can shed some light on this, thank you in advance!

示例数据框非常简单,但实际的数据框要复杂得多且更大。希望有人能对此有所了解,在此先感谢您!

回答by dbokers

Assuming we have a data frame like that in the beginning,

假设我们一开始有一个这样的数据框,

>>> df
             fruit  amount
2017-06-01   apple       1
2017-06-03   apple      16
2017-06-04   apple      12
2017-06-05   apple       8
2017-06-06   apple      14
2017-06-08   apple       1
2017-06-09   apple       4
2017-06-02  orange      13
2017-06-03  orange       9
2017-06-04  orange       9
2017-06-05  orange       2
2017-06-06  orange      11
2017-06-07  orange       6
2017-06-08  orange       3
2017-06-09  orange       3
2017-06-10  orange      13
2017-06-02   grape      14
2017-06-03   grape      16
2017-06-07   grape       4
2017-06-09   grape      15
2017-06-10   grape       5

>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]

>>> temp = (df.groupby('fruit')['amount']
    .apply(lambda x: x.reindex(dates)  # fill in the missing dates for each group)
                      .fillna(0)   # fill each missing group with 0
                      .rolling(3)
                      .sum()) # do a rolling sum
    .reset_index()
    .rename(columns={'amount': 'sum_of_3_days', 
                     'level_1': 'date'}))  # rename date index to date col


>>> temp.head()
   fruit        date  amount
0  apple  2017-06-01     NaN
1  apple  2017-06-02     NaN
2  apple  2017-06-03    17.0
3  apple  2017-06-04    28.0
4  apple  2017-06-05    36.0

# converts the date index into date column 
>>> df = df.reset_index().rename(columns={'index': 'date'})  
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
          date   fruit  amount  sum_of_3_days
0   2017-06-01   apple       1                NaN
1   2017-06-03   apple      16               17.0
2   2017-06-04   apple      12               28.0
3   2017-06-05   apple       8               36.0
4   2017-06-06   apple      14               34.0
5   2017-06-08   apple       1               15.0
6   2017-06-09   apple       4                5.0
7   2017-06-02  orange      13                NaN
8   2017-06-03  orange       9               22.0
9   2017-06-04  orange       9               31.0
10  2017-06-05  orange       2               20.0
11  2017-06-06  orange      11               22.0
12  2017-06-07  orange       6               19.0
13  2017-06-08  orange       3               20.0
14  2017-06-09  orange       3               12.0
15  2017-06-10  orange      13               19.0
16  2017-06-02   grape      14                NaN
17  2017-06-03   grape      16               30.0
18  2017-06-07   grape       4                4.0
19  2017-06-09   grape      15               19.0
20  2017-06-10   grape       5               20.0

回答by Gustavo Linari Rodrigues

I also wanted to use rolling with groupby, this is why I landed on this page, but I believe that I have a workaround that is better than the previous suggestions.

我还想在 groupby 中使用滚动,这就是我登陆此页面的原因,但我相信我有一个比以前的建议更好的解决方法。

You could do the following:

您可以执行以下操作:

pivoted_df = pd.pivot_table(df, index='date', columns='fruits', values='amount')
average_fruits = pivoted_df.rolling(window=3).mean().stack().reset_index()

the .stack()is not necessary, but will transform your pivot table back to a regular df

.stack()是没有必要的,但将改变你的数据透视表回到常规DF

回答by Roman Pekar

you can do it like this:

你可以这样做:

>>> df
>>>
           fruit  amount
20140101   apple       3
20140102   apple       5
20140102  orange      10
20140104  banana       2
20140104   apple      10
20140104  orange       4
20140105  orange       6
20140105   grape       1

>>> g= df.set_index('fruit', append=True).groupby(level=1)
>>> res = g['amount'].apply(pd.rolling_mean, 3, 1).reset_index('fruit')
>>> res

           fruit          0
20140101   apple   3.000000
20140102   apple   4.000000
20140102  orange  10.000000
20140104  banana   2.000000
20140104   apple   6.000000
20140104  orange   7.000000
20140105  orange   6.666667
20140105   grape   1.000000

update

更新

Well, as @cphlewis mentioned in comments, my code will not give the results you want. I've checked different approaches and the one I found so far is something like this (not sure about performance, though):

好吧,正如@cphlewis 在评论中提到的,我的代码不会给出你想要的结果。我检查了不同的方法,到目前为止我发现的方法是这样的(但不确定性能):

>>> df.index = [pd.to_datetime(str(x), format='%Y%m%d') for x in df.index]
>>> df.reset_index(inplace=True)
>>> def avg_3_days(x):
        return df[(df['index'] >= x['index'] - pd.DateOffset(3)) & (df['index'] < x['index']) & (df['fruit'] == x['fruit'])].amount.mean()

>>> df['res'] = df.apply(avg_3_days, axis=1)
>>> df

       index   fruit  amount  res
0 2014-01-01   apple       3  NaN
1 2014-01-02   apple       5    3
2 2014-01-02  orange      10  NaN
3 2014-01-04  banana       2  NaN
4 2014-01-04   apple      10    4
5 2014-01-04  orange       4   10
6 2014-01-05  orange       6    7
7 2014-01-05   grape       1  NaN