如何通过 Pandas 中的多级索引进行“分组”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12190716/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to do a 'groupby' by multilevel index in Pandas
提问by bigbug
I have a dataframe 'RPT' indexed by (STK_ID,RPT_Date), contains the accumulated sales of stocks for each qurter:
我有一个由 (STK_ID,RPT_Date) 索引的数据框“RPT”,包含每个季度的股票累计销售量:
sales
STK_ID RPT_Date
000876 20060331 798627000
20060630 1656110000
20060930 2719700000
20061231 3573660000
20070331 878415000
20070630 2024660000
20070930 3352630000
20071231 4791770000
600141 20060331 270912000
20060630 658981000
20060930 1010270000
20061231 1591500000
20070331 319602000
20070630 790670000
20070930 1250530000
20071231 1711240000
I want to calculate the single qurterly sales using 'groupby' by STK_ID & RPT_Yr ,such as : RPT.groupby('STK_ID','RPT_Yr')['sales'].transform(lambda x: x-x.shift(1)), how to do that ?
我想通过 STK_ID 和 RPT_Yr 使用“groupby”计算单季度销售额,例如:RPT.groupby('STK_ID','RPT_Yr')['sales'].transform(lambda x: x-x.shift(1)),怎么做?
suppose I can get the year by lambda x : datetime.strptime(x, '%Y%m%d').year
假设我可以得到这一年 lambda x : datetime.strptime(x, '%Y%m%d').year
采纳答案by Wouter Overmeire
Assuming here that RPT_Data is a string, any reason why not to use Datetime?
假设这里 RPT_Data 是一个字符串,有什么理由不使用 Datetime?
It is possible to groupby using functions, but only on a non MultiIndex-index. Working around this by resetting the index, and set 'RPT_Date' as index to extract the year (note: pandas toggles between object and int as dtype for 'RPT_Date').
可以使用函数进行分组,但只能在非多索引索引上进行。通过重置索引来解决这个问题,并将“RPT_Date”设置为索引以提取年份(注意:pandas 在 object 和 int 之间切换为 'RPT_Date' 的 dtype)。
In [135]: year = lambda x : datetime.strptime(str(x), '%Y%m%d').year
In [136]: grouped = RPT.reset_index().set_index('RPT_Date').groupby(['STK_ID', year])
In [137]: for key, df in grouped:
.....: print key
.....: print df
.....:
(876, 2006)
STK_ID sales
RPT_Date
20060331 876 798627000
20060630 876 1656110000
20060930 876 2719700000
20061231 876 3573660000
(876, 2007)
STK_ID sales
RPT_Date
20070331 876 878415000
20070630 876 2024660000
20070930 876 3352630000
20071231 876 4791770000
(600141, 2006)
STK_ID sales
RPT_Date
20060331 600141 270912000
20060630 600141 658981000
20060930 600141 1010270000
20061231 600141 1591500000
(600141, 2007)
STK_ID sales
RPT_Date
20070331 600141 319602000
20070630 600141 790670000
20070930 600141 1250530000
20071231 600141 1711240000
Other option is to use a tmp column
其他选项是使用 tmp 列
In [153]: RPT_tmp = RPT.reset_index()
In [154]: RPT_tmp['year'] = RPT_tmp['RPT_Date'].apply(year)
In [155]: grouped = RPT_tmp.groupby(['STK_ID', 'year'])
EDITReorganising your frame make it much easier.
编辑重新组织您的框架使其更容易。
In [48]: RPT
Out[48]:
sales
STK_ID RPT_Year RPT_Quarter
876 2006 0 798627000
1 1656110000
2 2719700000
3 3573660000
2007 0 878415000
1 2024660000
2 3352630000
3 4791770000
600141 2006 0 270912000
1 658981000
2 1010270000
3 1591500000
2007 0 319602000
1 790670000
2 1250530000
3 1711240000
In [49]: RPT.groupby(level=['STK_ID', 'RPT_Year'])['sales'].apply(sale_per_q)
Out[49]:
STK_ID RPT_Year RPT_Quarter
876 2006 0 798627000
1 857483000
2 1063590000
3 853960000
2007 0 878415000
1 1146245000
2 1327970000
3 1439140000
600141 2006 0 270912000
1 388069000
2 351289000
3 581230000
2007 0 319602000
1 471068000
2 459860000
3 460710000
回答by Jonathan
Try
尝试
RPT['sales'].groupby([RPT['STK_ID'],RPT['RPT_Yr']]).sum()
^^ you need to reference the indices within a list. this worked for me
^^ 您需要引用列表中的索引。这对我有用

