pandas 如何按定义的时间间隔对熊猫数据框进行分组?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42255458/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:59:33  来源:igfitidea点击:

How to group a pandas dataframe by a defined time interval?

pythonpandasdatetimegroup-by

提问by EduardoRL

I have a dataFrame like this, I would like to group every 60 minutes and start grouping at 06:30.

我有一个这样的数据帧,我想每 60 分钟分组一次,并在 06:30 开始分组。

                           data
index
2017-02-14 06:29:57    11198648
2017-02-14 06:30:01    11198650
2017-02-14 06:37:22    11198706
2017-02-14 23:11:13    11207728
2017-02-14 23:21:43    11207774
2017-02-14 23:22:36    11207776

I am using:

我在用:

df.groupby(pd.TimeGrouper(freq='60Min'))

I get this grouping:

我得到这个分组:

                      data
index       
2017-02-14 06:00:00     x1
2017-02-14 07:00:00     x2
2017-02-14 08:00:00     x3
2017-02-14 09:00:00     x4
2017-02-14 10:00:00     x5

but I am looking for this result:

但我正在寻找这个结果:

                      data
index       
2017-02-14 06:30:00     x1
2017-02-14 07:30:00     x2
2017-02-14 08:30:00     x3
2017-02-14 09:30:00     x4
2017-02-14 10:30:00     x5

How can I tell the function to start grouping at 6:30 at one-hour intervals?

我怎样才能告诉函数在 6:30 开始以一小时为间隔进行分组?

If it can not be done by the .groupby(pd.TimeGrouper(freq='60Min')), how is the best way to do it?

如果.groupby(pd.TimeGrouper(freq='60Min'))无法完成,最好的方法是什么?

A salute and thanks very much in advance

提前致以敬意和感谢

回答by Nickil Maveli

Use base=30in conjunction with label='right'parameters in pd.Grouper.

使用base=30会同label='right'中的参数pd.Grouper

Specifying label='right'makes the time-period to start grouping from 6:30 (higher side) and not 5:30. Also, baseis set to 0 by default, hence the need to offset those by 30 to account for the forward propagation of dates.

指定label='right'使时间段从 6:30(较高侧)而不是 5:30 开始分组。此外,默认base设置为 0 ,因此需要将这些偏移 30 以考虑日期的前向传播。

Suppose, you want to aggregate the first element of every sub-group, then:

假设,您要聚合每个子组的第一个元素,然后:

df.groupby(pd.Grouper(freq='60Min', base=30, label='right')).first()
# same thing using resample - df.resample('60Min', base=30, label='right').first()

yields:

产量:

                           data
index                          
2017-02-14 06:30:00  11198648.0
2017-02-14 07:30:00  11198650.0
2017-02-14 08:30:00         NaN
2017-02-14 09:30:00         NaN
2017-02-14 10:30:00         NaN
2017-02-14 11:30:00         NaN
2017-02-14 12:30:00         NaN
2017-02-14 13:30:00         NaN
2017-02-14 14:30:00         NaN
2017-02-14 15:30:00         NaN
2017-02-14 16:30:00         NaN
2017-02-14 17:30:00         NaN
2017-02-14 18:30:00         NaN
2017-02-14 19:30:00         NaN
2017-02-14 20:30:00         NaN
2017-02-14 21:30:00         NaN
2017-02-14 22:30:00         NaN
2017-02-14 23:30:00  11207728.0

回答by Erfan

Using DataFrame.resamplewhich is a dedicated method for resampling time series, this way we dont need DataFrame.GroupByand pd.Grouper:

使用DataFrame.resamplewhich 是重新采样时间序列的专用方法,这样我们就不需要DataFrame.GroupBypd.Grouper

df.resample('60min', base=30, label='right').first()

Output

输出

                           data
index                          
2017-02-14 06:30:00  11198648.0
2017-02-14 07:30:00  11198650.0
2017-02-14 08:30:00         NaN
2017-02-14 09:30:00         NaN
2017-02-14 10:30:00         NaN
2017-02-14 11:30:00         NaN
2017-02-14 12:30:00         NaN
2017-02-14 13:30:00         NaN
2017-02-14 14:30:00         NaN
2017-02-14 15:30:00         NaN
2017-02-14 16:30:00         NaN
2017-02-14 17:30:00         NaN
2017-02-14 18:30:00         NaN
2017-02-14 19:30:00         NaN
2017-02-14 20:30:00         NaN
2017-02-14 21:30:00         NaN
2017-02-14 22:30:00         NaN
2017-02-14 23:30:00  11207728.0


Notice: when you have multiple columns in your dataframe, you have to specify the column you want to aggregate on:

注意:当您的数据框中有多列时,您必须指定要聚合的列:

df.resample('60min', base=30, label='right')['data'].first()