在 Pandas 的 groupby 中排名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23976176/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:06:39  来源:igfitidea点击:

ranks within groupby in pandas

pythonpandasgroup-by

提问by user3576212

I have a typical "panel data" (in econometric terms, not pandas panel object). The dataframe has a Datecolumn and a IDcolumn, and other columns that contain certain values. For each Date, I need to cross-sectionally rank across IDs based on V1 into 10 groups (deciles) and create a new column called rank_col(take values 1 to 10) to identify rank. Then pool all the rank1, rank2,...rank10 across time to get some stats like mean,std.

我有一个典型的“面板数据”(用计量经济学术语,而不是Pandas面板对象)。数据框有一个Date列和一个ID列,以及包含某些值的其他列。对于每个日期,我需要将基于 V1 的 ID 横断面排名为 10 个组(十分位数),并创建一个名为rank_col(取值 1 到 10)的新列来标识排名。然后汇集所有 rank1、rank2、...rank10 跨时间获得一些统计数据,如均值、标准差。

This can be accomplished easily in SAS by following code and it also illustrate what my purpose:

这可以通过以下代码在 SAS 中轻松完成,它还说明了我的目的:

proc sort data=df;
    by Date;
proc rank data=df out=df_ranked groups=10;
    var V1;
    ranks rank_col;
    by Date;
run;

df_rankedis identical to dfexcept that it has more column called rank_colwhich containes the rank group that each row belongs to.

df_ranked与 相同,df只是它有更多的列被调用rank_col,其中包含每行所属的等级组。

Sorry I don't have a sample data to show the structure, a real long data is needed to illustrate. But the SAS code shows exactly what I am after.

抱歉,我没有示例数据来显示结构,需要一个真正的长数据来说明。但是 SAS 代码准确地显示了我所追求的内容。

Thanks for your help!

谢谢你的帮助!

采纳答案by user3576212

A way I just find figured out:

我刚刚发现的一种方法:

def grouping(data):
    dec=pd.qcut(data['V1'],10,labels=False)
    data['ranks']=dec
    return data
df_ranked=df.groupby('Date').apply(grouping)

This assumes decpreserve the right position for each row.

这假设dec为每一行保留正确的位置。

Please post if you have a better way, or point out any mistakes in this method.

如果您有更好的方法,请发表,或指出此方法中的任何错误。

Thanks!

谢谢!

Edit: You can just return a single new rankscolumn if you do something like the following:

编辑ranks如果您执行以下操作,您可以只返回一个新列:

>>> df


         Date  id  V1
0  2013-01-01   1  10
1  2013-01-01   2   8
2  2013-01-01   3   6
3  2013-01-01   4  11
4  2013-01-01   5  13
5  2013-01-01   6   4
6  2013-01-01   7   2
7  2013-02-01   1   1
8  2013-02-01   2   3
9  2013-02-01   3   9
10 2013-02-01   4  11
11 2013-02-01   5   7
12 2013-02-01   6   4
13 2013-02-01   7   6
14 2013-02-01   8  14

>>> foo = lambda x: pd.Series(pd.qcut(x,10,labels=False),index=x.index)
>>> df['ranks'] = df.groupby('Date')['V1'].apply(foo)
>>> df

         Date  id  V1  ranks
0  2013-01-01   1  10      6
1  2013-01-01   2   8      4
2  2013-01-01   3   6      3
3  2013-01-01   4  11      8
4  2013-01-01   5  13      9
5  2013-01-01   6   4      1
6  2013-01-01   7   2      0
7  2013-02-01   1   1      0
8  2013-02-01   2   3      1
9  2013-02-01   3   9      7
10 2013-02-01   4  11      8
11 2013-02-01   5   7      5
12 2013-02-01   6   4      2
13 2013-02-01   7   6      4
14 2013-02-01   8  14      9

回答by Zero

It could be simpler without needing separate foo

不需要单独的 foo 可能会更简单

In [782]: df.groupby('Date')['V1'].transform(lambda x: pd.qcut(x, 10, labels=False))
Out[782]:
0     6
1     4
2     3
3     8
4     9
5     1
6     0
7     0
8     1
9     7
10    8
11    5
12    2
13    4
14    9
Name: V1, dtype: int64

Assign to column

分配给列

In [783]: df['ranks'] = df.groupby('Date')['V1'].transform(pd.qcut, 10, labels=False)

In [784]: df
Out[784]:
          Date  id  V1  ranks
0   2013-01-01   1  10      6
1   2013-01-01   2   8      4
2   2013-01-01   3   6      3
3   2013-01-01   4  11      8
4   2013-01-01   5  13      9
5   2013-01-01   6   4      1
6   2013-01-01   7   2      0
7   2013-02-01   1   1      0
8   2013-02-01   2   3      1
9   2013-02-01   3   9      7
10  2013-02-01   4  11      8
11  2013-02-01   5   7      5
12  2013-02-01   6   4      2
13  2013-02-01   7   6      4
14  2013-02-01   8  14      9


Details

细节

In [786]: df
Out[786]:
          Date  id  V1
0   2013-01-01   1  10
1   2013-01-01   2   8
2   2013-01-01   3   6
3   2013-01-01   4  11
4   2013-01-01   5  13
5   2013-01-01   6   4
6   2013-01-01   7   2
7   2013-02-01   1   1
8   2013-02-01   2   3
9   2013-02-01   3   9
10  2013-02-01   4  11
11  2013-02-01   5   7
12  2013-02-01   6   4
13  2013-02-01   7   6
14  2013-02-01   8  14