在 Pandas 的 groupby 中排名
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23976176/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
ranks within groupby in pandas
提问by user3576212
I have a typical "panel data" (in econometric terms, not pandas panel object). The dataframe has a Datecolumn and a IDcolumn, and other columns that contain certain values. For each Date, I need to cross-sectionally rank across IDs based on V1 into 10 groups (deciles) and create a new column called rank_col(take values 1 to 10) to identify rank. Then pool all the rank1, rank2,...rank10 across time to get some stats like mean,std.
我有一个典型的“面板数据”(用计量经济学术语,而不是Pandas面板对象)。数据框有一个Date列和一个ID列,以及包含某些值的其他列。对于每个日期,我需要将基于 V1 的 ID 横断面排名为 10 个组(十分位数),并创建一个名为rank_col(取值 1 到 10)的新列来标识排名。然后汇集所有 rank1、rank2、...rank10 跨时间获得一些统计数据,如均值、标准差。
This can be accomplished easily in SAS by following code and it also illustrate what my purpose:
这可以通过以下代码在 SAS 中轻松完成,它还说明了我的目的:
proc sort data=df;
by Date;
proc rank data=df out=df_ranked groups=10;
var V1;
ranks rank_col;
by Date;
run;
df_rankedis identical to dfexcept that it has more column called rank_colwhich containes the rank group that each row belongs to.
df_ranked与 相同,df只是它有更多的列被调用rank_col,其中包含每行所属的等级组。
Sorry I don't have a sample data to show the structure, a real long data is needed to illustrate. But the SAS code shows exactly what I am after.
抱歉,我没有示例数据来显示结构,需要一个真正的长数据来说明。但是 SAS 代码准确地显示了我所追求的内容。
Thanks for your help!
谢谢你的帮助!
采纳答案by user3576212
A way I just find figured out:
我刚刚发现的一种方法:
def grouping(data):
dec=pd.qcut(data['V1'],10,labels=False)
data['ranks']=dec
return data
df_ranked=df.groupby('Date').apply(grouping)
This assumes decpreserve the right position for each row.
这假设dec为每一行保留正确的位置。
Please post if you have a better way, or point out any mistakes in this method.
如果您有更好的方法,请发表,或指出此方法中的任何错误。
Thanks!
谢谢!
Edit: You can just return a single new rankscolumn if you do something like the following:
编辑:ranks如果您执行以下操作,您可以只返回一个新列:
>>> df
Date id V1
0 2013-01-01 1 10
1 2013-01-01 2 8
2 2013-01-01 3 6
3 2013-01-01 4 11
4 2013-01-01 5 13
5 2013-01-01 6 4
6 2013-01-01 7 2
7 2013-02-01 1 1
8 2013-02-01 2 3
9 2013-02-01 3 9
10 2013-02-01 4 11
11 2013-02-01 5 7
12 2013-02-01 6 4
13 2013-02-01 7 6
14 2013-02-01 8 14
>>> foo = lambda x: pd.Series(pd.qcut(x,10,labels=False),index=x.index)
>>> df['ranks'] = df.groupby('Date')['V1'].apply(foo)
>>> df
Date id V1 ranks
0 2013-01-01 1 10 6
1 2013-01-01 2 8 4
2 2013-01-01 3 6 3
3 2013-01-01 4 11 8
4 2013-01-01 5 13 9
5 2013-01-01 6 4 1
6 2013-01-01 7 2 0
7 2013-02-01 1 1 0
8 2013-02-01 2 3 1
9 2013-02-01 3 9 7
10 2013-02-01 4 11 8
11 2013-02-01 5 7 5
12 2013-02-01 6 4 2
13 2013-02-01 7 6 4
14 2013-02-01 8 14 9
回答by Zero
It could be simpler without needing separate foo
不需要单独的 foo 可能会更简单
In [782]: df.groupby('Date')['V1'].transform(lambda x: pd.qcut(x, 10, labels=False))
Out[782]:
0 6
1 4
2 3
3 8
4 9
5 1
6 0
7 0
8 1
9 7
10 8
11 5
12 2
13 4
14 9
Name: V1, dtype: int64
Assign to column
分配给列
In [783]: df['ranks'] = df.groupby('Date')['V1'].transform(pd.qcut, 10, labels=False)
In [784]: df
Out[784]:
Date id V1 ranks
0 2013-01-01 1 10 6
1 2013-01-01 2 8 4
2 2013-01-01 3 6 3
3 2013-01-01 4 11 8
4 2013-01-01 5 13 9
5 2013-01-01 6 4 1
6 2013-01-01 7 2 0
7 2013-02-01 1 1 0
8 2013-02-01 2 3 1
9 2013-02-01 3 9 7
10 2013-02-01 4 11 8
11 2013-02-01 5 7 5
12 2013-02-01 6 4 2
13 2013-02-01 7 6 4
14 2013-02-01 8 14 9
Details
细节
In [786]: df
Out[786]:
Date id V1
0 2013-01-01 1 10
1 2013-01-01 2 8
2 2013-01-01 3 6
3 2013-01-01 4 11
4 2013-01-01 5 13
5 2013-01-01 6 4
6 2013-01-01 7 2
7 2013-02-01 1 1
8 2013-02-01 2 3
9 2013-02-01 3 9
10 2013-02-01 4 11
11 2013-02-01 5 7
12 2013-02-01 6 4
13 2013-02-01 7 6
14 2013-02-01 8 14

