pandas 大熊猫在群体中的百分位排名
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/22339758/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
percentile rank in pandas in groups
提问by itjcms18
I can't quite figure out how to write function to accomplish a grouped percentile.  I have all teams from years 1985-2012 in a data frame; the first 10 are shown below: it's currently sorted by year.  I was looking to give a percentile for LgRnkgrouped by Year.  So for instance, 23 LgRank (worst team) for 1985 would be a 100 percentile and a 1 LgRank (best team) for 1985 would be a 1 percentile. 30 LgRank (worst team) for 2010 would be 100 percentile, etc.   It needs to be grouped by year b/c of the differing number of LgRnks. 
我不太清楚如何编写函数来完成分组的百分位数。我将 1985 年至 2012 年的所有团队放在一个数据框中;前 10 个如下所示:目前按年份排序。我想给LgRnk分组的百分位数Year。因此,例如,1985 年 23 LgRank(最差球队)将是 100 个百分点,而 1985 年的 1 LgRank(最佳球队)将是 1 个百分点。2010 年的 30 LgRank(最差团队)将是 100 个百分位,等等。它需要按不同数量的LgRnks 的b/c 年分组。
    Team                WLPer   Year LgRnk   W  L
19  Sacramento Kings    0.378   1985    18  31  51
0   Atlanta Hawks       0.415   1985    17  34  48
17  Phoenix Suns        0.439   1985    16  36  46
4   Cleveland Cavaliers 0.439   1985    15  36  46
13  Milwaukee Bucks     0.720   1985    3   59  23
3   Chicago Bulls       0.463   1985    14  38  44
16  Philadelphia 76ers  0.707   1985    4   58  24
22  Washington Wizards  0.488   1985    13  40  42
20  San Antonio Spurs   0.500   1985    12  41  41
21  Utah Jazz           0.500   1985    11  41  41
I've tried creating a function using: scipy.stats.percentileofscoreand I can't quite get it.
我试过使用:创建一个函数scipy.stats.percentileofscore,但我不太明白。
回答by Andy Hayden
You can do an apply on the LgRnk column:
您可以对 LgRnk 列进行申请:
# just for me to normalize this, so my numbers will go from 0 to 1 in this example
In [11]: df['LgRnk'] = g.LgRnk.rank()
In [12]: g = df.groupby('Year')
In [13]: g.LgRnk.apply(lambda x: x / len(x))
Out[13]:
19    1.0
0     0.9
17    0.8
4     0.7
13    0.1
3     0.6
16    0.2
22    0.5
20    0.4
21    0.3
Name: 1985, dtype: float64
The Series groupby rank (which just applies Series.rank) take a pct argument to do just this:
系列 groupby 排名(仅适用Series.rank)采用 pct 参数来执行此操作:
In [21]: g.LgRnk.rank(pct=True)
Out[21]:
19    1.0
0     0.9
17    0.8
4     0.7
13    0.1
3     0.6
16    0.2
22    0.5
20    0.4
21    0.3
Name: 1985, dtype: float64
and directly on the WLPercolumn (although this is slightly different due to draws):
并直接在WLPer列上(尽管由于抽签而略有不同):
In [22]: g.WLPer.rank(pct=True, ascending=False)
Out[22]:
19    1.00
0     0.90
17    0.75
4     0.75
13    0.10
3     0.60
16    0.20
22    0.50
20    0.35
21    0.35
Name: 1985, dtype: float64
Note: I've changed the numbers on the first line, so you'll get different scores on your completeframe.
注意:我已经更改了第一行的数字,因此您将在整个框架上获得不同的分数。
回答by user636224
You need to calculate rank within the group before normalizing within the group. The other answers will result in percentiles over 100%. I suggest:
在组内归一化之前,您需要计算组内的排名。其他答案将导致百分位数超过 100%。我建议:
df['percentile'] = df.groupby('year')['LgRnk'].rank(pct=True)

