Python Pandas:使用 groupby 重新采样时间序列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32012012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:52:21  来源:igfitidea点击:

Pandas: resample timeseries with groupby

pythonpandasgroup-bytime-series

提问by AshB

Given the below pandas DataFrame:

鉴于以下熊猫数据帧:

In [115]: times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00','2014-08-25 21:04:00',
                                            '2014-08-25 22:07:00','2014-08-25 22:09:00']))
          locations = ['HK', 'LDN', 'LDN', 'LDN']
          event = ['foo', 'bar', 'baz', 'qux']
          df = pd.DataFrame({'Location': locations,
                             'Event': event}, index=times)
          df
Out[115]:
                               Event Location
          2014-08-25 21:00:00  foo   HK
          2014-08-25 21:04:00  bar   LDN
          2014-08-25 22:07:00  baz   LDN
          2014-08-25 22:09:00  qux   LDN

I would like resample the data to aggregate it hourly by count while grouping by location to produce a data frame that looks like this:

我想重新采样数据以按计数每小时聚合一次,同时按位置分组以生成如下所示的数据框:

Out[115]:
                               HK    LDN
          2014-08-25 21:00:00  1     1
          2014-08-25 22:00:00  0     2

I've tried various combinations of resample() and groupby() but with no luck. How would I go about this?

我尝试了 resample() 和 groupby() 的各种组合,但没有运气。我该怎么办?

采纳答案by unutbu

In my original post, I suggested using pd.TimeGrouper. Nowadays, use pd.Grouperinstead of pd.TimeGrouper. The syntax is largely the same, but TimeGrouperis now deprecatedin favor of pd.Grouper.

在我原来的帖子中,我建议使用pd.TimeGrouper. 现在,使用pd.Grouper代替pd.TimeGrouper. 语法在很大程度上是相同的,但TimeGrouper现在已不支持pd.Grouper

Moreover, while pd.TimeGroupercould only group by DatetimeIndex, pd.Groupercan group by datetime columnswhich you can specify through the keyparameter.

此外,虽然pd.TimeGrouper只能按日期时间索引分组,但可以按日期时间pd.Grouper分组,您可以通过参数指定。key



You could use a pd.Grouperto group the DatetimeIndex'ed DataFrame by hour:

您可以使用 apd.Grouper按小时对 DatetimeIndex 的 DataFrame 进行分组:

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])

use countto count the number of events in each group:

用于count计算每组中的事件数:

grouper['Event'].count()
#                      Location
# 2014-08-25 21:00:00  HK          1
#                      LDN         1
# 2014-08-25 22:00:00  LDN         2
# Name: Event, dtype: int64

use unstackto move the Locationindex level to a column level:

用于unstackLocation索引级别移动到列级别:

grouper['Event'].count().unstack()
# Out[49]: 
# Location             HK  LDN
# 2014-08-25 21:00:00   1    1
# 2014-08-25 22:00:00 NaN    2

and then use fillnato change the NaNs into zeros.

然后用于fillna将 NaN 更改为零。



Putting it all together,

把这一切放在一起,

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
result = grouper['Event'].count().unstack('Location').fillna(0)

yields

产量

Location             HK  LDN
2014-08-25 21:00:00   1    1
2014-08-25 22:00:00   0    2

回答by Little Bobby Tables

Multiple Column Group By

多列分组依据

untubu is spot on with his answer but I wanted to add in what you could do if you had a third column, say Costand wanted to aggregate it like above. It was through combining unutbu's answer and this onethat I found out how to do this and thought I would share for future users.

untubu 对他的回答很到位,但我想补充一下,如果你有第三列,比如Cost并想像上面那样聚合它,你可以做什么。正是通过将 unutbu 的答案和这个答案相结合,我发现了如何做到这一点,并认为我会为未来的用户分享。

Create a DataFrame with Costcolumn:

创建一个带有Cost列的 DataFrame :

In[1]:
import pandas as pd
import numpy as np
times = pd.to_datetime([
    "2014-08-25 21:00:00", "2014-08-25 21:04:00",
    "2014-08-25 22:07:00", "2014-08-25 22:09:00"
])
df = pd.DataFrame({
    "Location": ["HK", "LDN", "LDN", "LDN"],
    "Event":    ["foo", "bar", "baz", "qux"],
    "Cost":     [20, 24, 34, 52]
}, index = times)
df

Out[1]:
                     Location  Event  Cost
2014-08-25 21:00:00        HK    foo    20
2014-08-25 21:04:00       LDN    bar    24
2014-08-25 22:07:00       LDN    baz    34
2014-08-25 22:09:00       LDN    qux    52

Now we group by using the aggfunction to specify each column's aggregation method, e.g. count, mean, sum, etc.

现在我们通过使用agg函数进行分组,指定每列的聚合方法,例如计数、均值、总和等。

In[2]:
grp = df.groupby([pd.Grouper(freq = "1H"), "Location"]) \
      .agg({"Event": np.size, "Cost": np.mean})
grp

Out[2]:
                               Event  Cost
                     Location
2014-08-25 21:00:00  HK            1    20
                     LDN           1    24
2014-08-25 22:00:00  LDN           2    43

Then the final unstackwith fill NaNwith zeros and display as intbecause it's nice.

然后最后unstackNaN零填充并显示为int因为它很好。

In[3]: 
grp.unstack().fillna(0).astype(int)

Out[3]:
                    Event     Cost
Location               HK LDN   HK LDN
2014-08-25 21:00:00     1   1   20  24
2014-08-25 22:00:00     0   2    0  43

回答by Ted Petrou

Pandas 0.21 answer: TimeGrouper is getting deprecated

Pandas 0.21 答案:TimeGrouper 已被弃用

There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.

执行此操作有两种选择。他们实际上可以根据您的数据给出不同的结果。第一个选项按位置分组,在位置组内按小时分组。第二个选项同时按位置和小时分组。

Option 1: Use groupby + resample

选项 1:使用groupby + resample

grouped = df.groupby('Location').resample('H')['Event'].count()

Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)

选项 2:将位置和日期时间索引与groupby(pd.Grouper)

grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count()

They both will result in the following:

它们都将导致以下结果:

Location                     
HK        2014-08-25 21:00:00    1
LDN       2014-08-25 21:00:00    1
          2014-08-25 22:00:00    2
Name: Event, dtype: int64

And then reshape:

然后重塑:

grouped.unstack('Location', fill_value=0)

Will output

会输出

Location             HK  LDN
2014-08-25 21:00:00   1    1
2014-08-25 22:00:00   0    2

回答by Alexandru Papiu

This can be done without using resampleor Grouperas follows:

这可以不使用resampleGrouper如下完成:

df.groupby([df.index.floor("1H"), "Location"]).count()

df.groupby([df.index.floor("1H"), "Location"]).count()