在 Pandas 的 groupby 中排名

Question

提问by user3576212

I have a typical "panel data" (in econometric terms, not pandas panel object). The dataframe has a Datecolumn and a IDcolumn, and other columns that contain certain values. For each Date, I need to cross-sectionally rank across IDs based on V1 into 10 groups (deciles) and create a new column called rank_col(take values 1 to 10) to identify rank. Then pool all the rank1, rank2,...rank10 across time to get some stats like mean,std.

我有一个典型的“面板数据”（用计量经济学术语，而不是Pandas面板对象）。数据框有一个Date列和一个ID列，以及包含某些值的其他列。对于每个日期，我需要将基于 V1 的 ID 横断面排名为 10 个组（十分位数），并创建一个名为rank_col（取值 1 到 10）的新列来标识排名。然后汇集所有 rank1、rank2、...rank10 跨时间获得一些统计数据，如均值、标准差。

This can be accomplished easily in SAS by following code and it also illustrate what my purpose:

这可以通过以下代码在 SAS 中轻松完成，它还说明了我的目的：

proc sort data=df;
    by Date;
proc rank data=df out=df_ranked groups=10;
    var V1;
    ranks rank_col;
    by Date;
run;

df_rankedis identical to dfexcept that it has more column called rank_colwhich containes the rank group that each row belongs to.

df_ranked与相同，df只是它有更多的列被调用rank_col，其中包含每行所属的等级组。

Sorry I don't have a sample data to show the structure, a real long data is needed to illustrate. But the SAS code shows exactly what I am after.

抱歉，我没有示例数据来显示结构，需要一个真正的长数据来说明。但是 SAS 代码准确地显示了我所追求的内容。

Thanks for your help!

谢谢你的帮助！

Answer 1

采纳答案by user3576212

A way I just find figured out:

我刚刚发现的一种方法：

def grouping(data):
    dec=pd.qcut(data['V1'],10,labels=False)
    data['ranks']=dec
    return data
df_ranked=df.groupby('Date').apply(grouping)

This assumes decpreserve the right position for each row.

这假设dec为每一行保留正确的位置。

Please post if you have a better way, or point out any mistakes in this method.

如果您有更好的方法，请发表，或指出此方法中的任何错误。

Thanks!

谢谢！

Edit: You can just return a single new rankscolumn if you do something like the following:

编辑：ranks如果您执行以下操作，您可以只返回一个新列：

>>> df


         Date  id  V1
0  2013-01-01   1  10
1  2013-01-01   2   8
2  2013-01-01   3   6
3  2013-01-01   4  11
4  2013-01-01   5  13
5  2013-01-01   6   4
6  2013-01-01   7   2
7  2013-02-01   1   1
8  2013-02-01   2   3
9  2013-02-01   3   9
10 2013-02-01   4  11
11 2013-02-01   5   7
12 2013-02-01   6   4
13 2013-02-01   7   6
14 2013-02-01   8  14

>>> foo = lambda x: pd.Series(pd.qcut(x,10,labels=False),index=x.index)
>>> df['ranks'] = df.groupby('Date')['V1'].apply(foo)
>>> df

         Date  id  V1  ranks
0  2013-01-01   1  10      6
1  2013-01-01   2   8      4
2  2013-01-01   3   6      3
3  2013-01-01   4  11      8
4  2013-01-01   5  13      9
5  2013-01-01   6   4      1
6  2013-01-01   7   2      0
7  2013-02-01   1   1      0
8  2013-02-01   2   3      1
9  2013-02-01   3   9      7
10 2013-02-01   4  11      8
11 2013-02-01   5   7      5
12 2013-02-01   6   4      2
13 2013-02-01   7   6      4
14 2013-02-01   8  14      9

Answer 2

回答by Zero

It could be simpler without needing separate foo

不需要单独的 foo 可能会更简单

In [782]: df.groupby('Date')['V1'].transform(lambda x: pd.qcut(x, 10, labels=False))
Out[782]:
0     6
1     4
2     3
3     8
4     9
5     1
6     0
7     0
8     1
9     7
10    8
11    5
12    2
13    4
14    9
Name: V1, dtype: int64

Assign to column

分配给列

In [783]: df['ranks'] = df.groupby('Date')['V1'].transform(pd.qcut, 10, labels=False)

In [784]: df
Out[784]:
          Date  id  V1  ranks
0   2013-01-01   1  10      6
1   2013-01-01   2   8      4
2   2013-01-01   3   6      3
3   2013-01-01   4  11      8
4   2013-01-01   5  13      9
5   2013-01-01   6   4      1
6   2013-01-01   7   2      0
7   2013-02-01   1   1      0
8   2013-02-01   2   3      1
9   2013-02-01   3   9      7
10  2013-02-01   4  11      8
11  2013-02-01   5   7      5
12  2013-02-01   6   4      2
13  2013-02-01   7   6      4
14  2013-02-01   8  14      9

Details

细节

In [786]: df
Out[786]:
          Date  id  V1
0   2013-01-01   1  10
1   2013-01-01   2   8
2   2013-01-01   3   6
3   2013-01-01   4  11
4   2013-01-01   5  13
5   2013-01-01   6   4
6   2013-01-01   7   2
7   2013-02-01   1   1
8   2013-02-01   2   3
9   2013-02-01   3   9
10  2013-02-01   4  11
11  2013-02-01   5   7
12  2013-02-01   6   4
13  2013-02-01   7   6
14  2013-02-01   8  14

在 Pandas 的 groupby 中排名

提问by user3576212

采纳答案by user3576212

回答by Zero

相关推荐

最近更新

标签

在 Pandas 的 groupby 中排名

提问by user3576212

采纳答案by user3576212

回答by Zero

相关推荐

通过索引和列名数组对 Pandas 数据框进行切片

pandas SettingWithCopyWarning，即使使用 loc (?)

pandas Series.sort() 和 Series.order() 有什么区别？

pandas Python：选择最常用的分组依据

相关推荐

最近更新

标签