Python Pandas:使用 groupby 重新采样时间序列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32012012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: resample timeseries with groupby
提问by AshB
Given the below pandas DataFrame:
鉴于以下熊猫数据帧:
In [115]: times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00','2014-08-25 21:04:00',
'2014-08-25 22:07:00','2014-08-25 22:09:00']))
locations = ['HK', 'LDN', 'LDN', 'LDN']
event = ['foo', 'bar', 'baz', 'qux']
df = pd.DataFrame({'Location': locations,
'Event': event}, index=times)
df
Out[115]:
Event Location
2014-08-25 21:00:00 foo HK
2014-08-25 21:04:00 bar LDN
2014-08-25 22:07:00 baz LDN
2014-08-25 22:09:00 qux LDN
I would like resample the data to aggregate it hourly by count while grouping by location to produce a data frame that looks like this:
我想重新采样数据以按计数每小时聚合一次,同时按位置分组以生成如下所示的数据框:
Out[115]:
HK LDN
2014-08-25 21:00:00 1 1
2014-08-25 22:00:00 0 2
I've tried various combinations of resample() and groupby() but with no luck. How would I go about this?
我尝试了 resample() 和 groupby() 的各种组合,但没有运气。我该怎么办?
采纳答案by unutbu
In my original post, I suggested using pd.TimeGrouper
.
Nowadays, use pd.Grouper
instead of pd.TimeGrouper
. The syntax is largely the same, but TimeGrouper
is now deprecatedin favor of pd.Grouper
.
在我原来的帖子中,我建议使用pd.TimeGrouper
. 现在,使用pd.Grouper
代替pd.TimeGrouper
. 语法在很大程度上是相同的,但TimeGrouper
现在已不支持pd.Grouper
。
Moreover, while pd.TimeGrouper
could only group by DatetimeIndex, pd.Grouper
can group by datetime columnswhich you can specify through the key
parameter.
此外,虽然pd.TimeGrouper
只能按日期时间索引分组,但可以按日期时间列pd.Grouper
分组,您可以通过参数指定。key
You could use a pd.Grouper
to group the DatetimeIndex'ed DataFrame by hour:
您可以使用 apd.Grouper
按小时对 DatetimeIndex 的 DataFrame 进行分组:
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
use count
to count the number of events in each group:
用于count
计算每组中的事件数:
grouper['Event'].count()
# Location
# 2014-08-25 21:00:00 HK 1
# LDN 1
# 2014-08-25 22:00:00 LDN 2
# Name: Event, dtype: int64
use unstack
to move the Location
index level to a column level:
用于unstack
将Location
索引级别移动到列级别:
grouper['Event'].count().unstack()
# Out[49]:
# Location HK LDN
# 2014-08-25 21:00:00 1 1
# 2014-08-25 22:00:00 NaN 2
and then use fillna
to change the NaNs into zeros.
然后用于fillna
将 NaN 更改为零。
Putting it all together,
把这一切放在一起,
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
result = grouper['Event'].count().unstack('Location').fillna(0)
yields
产量
Location HK LDN
2014-08-25 21:00:00 1 1
2014-08-25 22:00:00 0 2
回答by Little Bobby Tables
Multiple Column Group By
多列分组依据
untubu is spot on with his answer but I wanted to add in what you could do if you had a third column, say Cost
and wanted to aggregate it like above. It was through combining unutbu's answer and this onethat I found out how to do this and thought I would share for future users.
untubu 对他的回答很到位,但我想补充一下,如果你有第三列,比如Cost
并想像上面那样聚合它,你可以做什么。正是通过将 unutbu 的答案和这个答案相结合,我发现了如何做到这一点,并认为我会为未来的用户分享。
Create a DataFrame with Cost
column:
创建一个带有Cost
列的 DataFrame :
In[1]:
import pandas as pd
import numpy as np
times = pd.to_datetime([
"2014-08-25 21:00:00", "2014-08-25 21:04:00",
"2014-08-25 22:07:00", "2014-08-25 22:09:00"
])
df = pd.DataFrame({
"Location": ["HK", "LDN", "LDN", "LDN"],
"Event": ["foo", "bar", "baz", "qux"],
"Cost": [20, 24, 34, 52]
}, index = times)
df
Out[1]:
Location Event Cost
2014-08-25 21:00:00 HK foo 20
2014-08-25 21:04:00 LDN bar 24
2014-08-25 22:07:00 LDN baz 34
2014-08-25 22:09:00 LDN qux 52
Now we group by using the agg
function to specify each column's aggregation method, e.g. count, mean, sum, etc.
现在我们通过使用agg
函数进行分组,指定每列的聚合方法,例如计数、均值、总和等。
In[2]:
grp = df.groupby([pd.Grouper(freq = "1H"), "Location"]) \
.agg({"Event": np.size, "Cost": np.mean})
grp
Out[2]:
Event Cost
Location
2014-08-25 21:00:00 HK 1 20
LDN 1 24
2014-08-25 22:00:00 LDN 2 43
Then the final unstack
with fill NaN
with zeros and display as int
because it's nice.
然后最后unstack
用NaN
零填充并显示为int
因为它很好。
In[3]:
grp.unstack().fillna(0).astype(int)
Out[3]:
Event Cost
Location HK LDN HK LDN
2014-08-25 21:00:00 1 1 20 24
2014-08-25 22:00:00 0 2 0 43
回答by Ted Petrou
Pandas 0.21 answer: TimeGrouper is getting deprecated
Pandas 0.21 答案:TimeGrouper 已被弃用
There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.
执行此操作有两种选择。他们实际上可以根据您的数据给出不同的结果。第一个选项按位置分组,在位置组内按小时分组。第二个选项同时按位置和小时分组。
Option 1: Use groupby + resample
选项 1:使用groupby + resample
grouped = df.groupby('Location').resample('H')['Event'].count()
Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)
选项 2:将位置和日期时间索引与groupby(pd.Grouper)
grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count()
They both will result in the following:
它们都将导致以下结果:
Location
HK 2014-08-25 21:00:00 1
LDN 2014-08-25 21:00:00 1
2014-08-25 22:00:00 2
Name: Event, dtype: int64
And then reshape:
然后重塑:
grouped.unstack('Location', fill_value=0)
Will output
会输出
Location HK LDN
2014-08-25 21:00:00 1 1
2014-08-25 22:00:00 0 2
回答by Alexandru Papiu
This can be done without using resample
or Grouper
as follows:
这可以不使用resample
或Grouper
如下完成:
df.groupby([df.index.floor("1H"), "Location"]).count()
df.groupby([df.index.floor("1H"), "Location"]).count()