在 Pandas 中使用 groupby 的 TimeSeries
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20805299/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
TimeSeries with a groupby in Pandas
提问by mikebmassey
I would like to look at TimeSeriesdata for every client over various time periods in Pandas.
我想查看TimeSeries不同时间段内每个客户的数据Pandas。
import pandas as pd
import numpy as np
import random
clients = np.random.randint(1, 11, size=100)
dates = pd.date_range('20130101',periods=365)
OrderDates = random.sample(dates,100)
Values = np.random.randint(10, 250, size=100)
df = pd.DataFrame({ 'Client' : clients,'OrderDate' : OrderDates, 'Value' : Values})
df = df.sort(['OrderDate', 'Client'], ascending=['True', 'True'])
df.head()
What I am trying to accomplish is to get the count and the sum of the 'Value' column, grouped by 'Client' for various time periods (Monthly, Quarterly, Yearly - I will likely build 3 different dataframes for this data, then make the dataframes 'wide').
我想要完成的是获取“值”列的计数和总和,按“客户”分组,用于不同时间段(每月、每季度、每年 - 我可能会为此数据构建 3 个不同的数据框,然后使数据帧“宽”)。
For Quarterly, I would expect something like this:
对于季度,我希望是这样的:
Client OrderDate NumberofEntries SumofValues
1 2013-03-31 7 28
1 2013-06-30 2 7
1 2013-09-30 6 20
1 2013-12-31 1 3
2 2013-03-31 1 4
2 2013-06-30 2 8
2 2013-09-30 3 17
2 2013-12-31 4 24
I could append that data frame by getting the quarter for each entry (or Month, or Year), then use Pandasgroupbyfunction, but that seems like it's extra work when I should be using TimeSeries.
我可以通过获取每个条目(或月或年)的季度来附加该数据框,然后使用Pandasgroupby函数,但是当我应该使用TimeSeries.
I've read the documentation and reviewed a TimeSeriesdemonstration by Wes, but I don't see a way to do a groupbyfor the Client, then perform the TimeSeriesover the time periods I am trying to build (Alternatively - I could run a for loopand build the dataframe that way, but again - seems like that's more work than there should be.)
我已经阅读了文档并查看了TimeSeriesWes的演示,但我没有看到groupby为客户端做一个的方法,然后在TimeSeries我尝试构建的时间段内执行(或者 - 我可以运行一个for loop并构建数据框那样,但同样 - 似乎这比应该做的工作要多。)
Is there a way to combine a groupbyprocess with TimeSeries?
有没有办法将一个groupby过程与TimeSeries?
回答by Andy Hayden
A slight alternative is to set_indexbefore doing the groupby:
一个轻微的替代方法是set_index在进行 groupby 之前:
In [11]: df.set_index('OrderDate', inplace=True)
In [12]: g = df.groupby('Client')
In [13]: g['Value'].resample('Q', how=[np.sum, len])
Out[13]:
sum len
Client OrderDate
1 2013-03-31 239 1
2013-06-30 83 1
2013-09-30 249 2
2013-12-31 506 3
2 2013-03-31 581 4
2013-06-30 569 4
2013-09-30 316 4
2013-12-31 465 5
...
Note: you don't need to do the sort before doing this.
注意:在执行此操作之前您不需要进行排序。
回答by joris
Something like this? I'm first doing a groupby, and then applying a resample on each group.
像这样的东西?我首先进行分组,然后对每个组应用重新采样。
In [11]: grouped = df.groupby('Client')
In [12]: result = grouped.apply(lambda x: x.set_index('OrderDate').resample('Q', how=[np.sum, len]))
In [13]: result['Value']
Out[13]:
sum len
Client OrderDate
1 2013-03-31 227 4
2013-06-30 344 2
2013-09-30 234 1
2 2013-03-31 299 2
2013-06-30 538 4
2013-09-30 236 2
2013-12-31 1124 7
3 2013-03-31 496 4
2013-06-30 NaN 0
2013-09-30 167 2
2013-12-31 218 1
Update: with the suggestion of @AndyHayden in his answer, this becomes much cleaner code:
更新:根据@AndyHayden 在他的回答中的建议,这将变得更加清晰:
df = df.set_index('OrderDate')
grouped = df.groupby('Client')
grouped['Value'].resample('Q', how=[np.sum, len])

