如何获得 Pandas 数据框中一行的百分位数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50804120/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I get the percentile for a row in a pandas dataframe?
提问by bbennett36
Example DataFrame Values -
0 78
1 38
2 42
3 48
4 31
5 89
6 94
7 102
8 122
9 122
stats.percentileofscore(temp['INCOME'].values, 38, kind='mean')
15.0
stats.percentileofscore(temp['INCOME'].values, 38, kind='strict')
10.0
stats.percentileofscore(temp['INCOME'].values, 38, kind='weak')
20.0
stats.percentileofscore(temp['INCOME'].values, 38, kind='rank')
20.0
temp['INCOME'].rank(pct=True)
1 0.20 (Only showing the 38 value index)
temp['INCOME'].quantile(0.11)
37.93
temp['INCOME'].quantile(0.12)
38.31999999999999
Based on the results above, you can see none of the methods are consistent
with the pd.quantiles() method.
I need to get the percentile for one column for each row in a dataframe (255M rows) but can't find any functions/methods that return the 'linear interpolation'method that they use in pd.quantile
& np.percentile
.
我需要为数据帧(2.55 亿行)中的每一行获取一列的百分位数,但找不到任何返回它们在& 中使用的“线性插值”方法的函数/方法。 pd.quantile
np.percentile
I've tried the following methods/functions -
我尝试了以下方法/功能 -
.rank(pct=True)
This method only returns the values ranked in order, not using the percentile method that I'm looking for. Inconsistent with pd.quantiles
此方法仅返回按顺序排列的值,不使用我正在寻找的百分位数方法。不一致pd.quantiles
scipy.stats.percentileofscore
This method almost is closer to what I'm looking for but still is not 100% consistent with the 'linear interpolation' method for some reason. Related question to this problem with no real answer
这种方法几乎更接近我正在寻找的方法,但由于某种原因仍然不是 100% 与“线性插值”方法一致。 与此问题相关的问题,没有真正的答案
I've looked through every SO answer that is related to this question but none of them use the same interpolation method that I need to use so please do not mark this as a duplicate unless you can verify they're using the same method.
我已经查看了与此问题相关的每个 SO 答案,但没有一个使用我需要使用的相同插值方法,因此请不要将其标记为重复,除非您可以验证它们使用相同的方法。
At this point my last option is to just find the bin cut-offs for all 100 percentiles and apply it that way or calculate the linear interpolation myself but this seems very inefficient and will take forever to apply to 255M records.
在这一点上,我的最后一个选择是找到所有 100 个百分位数的 bin 截止值并以这种方式应用它或自己计算线性插值,但这似乎非常低效,并且将永远应用于 255M 记录。
Any other suggestions to do this?
还有其他建议可以做到这一点吗?
Thanks!
谢谢!
回答by igrinis
TL; DR
TL; DR
Use
用
sz = temp['INCOME'].size-1
temp['PCNT_LIN'] = temp['INCOME'].rank(method='max').apply(lambda x: 100.0*(x-1)/sz)
INCOME PCNT_LIN
0 78 44.444444
1 38 11.111111
2 42 22.222222
3 48 33.333333
4 31 0.000000
5 89 55.555556
6 94 66.666667
7 102 77.777778
8 122 100.000000
9 122 100.000000
Answer
回答
It is actually very simple, once your understand the mechanics. When you are looking for percentile of a score, you already have the scores in each row. The only step left is understanding that you need percentile of numbers that are less or equalto the selected value. This is exactly what parameters kind='weak'of scipy.stats.percentileofscore()
and method='average'of DataFrame.rank()
do. In order to invert it, run Series.quantile()
with interpolation='lower'.
一旦你理解了机制,它实际上非常简单。当您在寻找分数的百分位数时,您已经在每一行中获得了分数。剩下的唯一一步是了解您需要小于或等于所选值的数字的百分位数。这正是参数“弱”样=的scipy.stats.percentileofscore()
和方法=“平均”的DataFrame.rank()
事。为了反转它,请Series.quantile()
使用插值='lower' 运行。
So, the behavior of the scipy.stats.percentileofscore()
, Series.rank()
and Series.quantile()
isconsistent, see below:
因此,的行为scipy.stats.percentileofscore()
,Series.rank()
并且Series.quantile()
是一致的,见下图:
In[]:
temp = pd.DataFrame([ 78, 38, 42, 48, 31, 89, 94, 102, 122, 122], columns=['INCOME'])
temp['PCNT_RANK']=temp['INCOME'].rank(method='max', pct=True)
temp['POF'] = temp['INCOME'].apply(lambda x: scipy.stats.percentileofscore(temp['INCOME'], x, kind='weak'))
temp['QUANTILE_VALUE'] = temp['PCNT_RANK'].apply(lambda x: temp['INCOME'].quantile(x, 'lower'))
temp['RANK']=temp['INCOME'].rank(method='max')
sz = temp['RANK'].size - 1
temp['PCNT_LIN'] = temp['RANK'].apply(lambda x: (x-1)/sz)
temp['CHK'] = temp['PCNT_LIN'].apply(lambda x: temp['INCOME'].quantile(x))
temp
Out[]:
INCOME PCNT_RANK POF QUANTILE_VALUE RANK PCNT_LIN CHK
0 78 0.5 50.0 78 5.0 0.444444 78.0
1 38 0.2 20.0 38 2.0 0.111111 38.0
2 42 0.3 30.0 42 3.0 0.222222 42.0
3 48 0.4 40.0 48 4.0 0.333333 48.0
4 31 0.1 10.0 31 1.0 0.000000 31.0
5 89 0.6 60.0 89 6.0 0.555556 89.0
6 94 0.7 70.0 94 7.0 0.666667 94.0
7 102 0.8 80.0 102 8.0 0.777778 102.0
8 122 1.0 100.0 122 10.0 1.000000 122.0
9 122 1.0 100.0 122 10.0 1.000000 122.0
Now in a column PCNT_RANK
you get ratio of values that are smaller or equal to the one in a column INCOME
. But if you want the "interpolated" ratio, it is in column PCNT_LIN
. And as you use Series.rank()
for calculations, it is pretty fast and will crunch you 255M numbers in seconds.
现在,在列中,PCNT_RANK
您将获得小于或等于列中值的比率INCOME
。但是如果你想要“插值”比率,它在列中PCNT_LIN
。当您Series.rank()
用于计算时,它非常快,可以在几秒钟内处理 2.55 亿个数字。
Here I will explain how you get the value from using quantile()
with linear
interpolation:
在这里,我将解释如何通过quantile()
使用linear
插值来获得值:
temp['INCOME'].quantile(0.11)
37.93
Our data temp['INCOME']
has only ten values. According to the formula from your link to Wikithe rank of 11th percentile is
我们的数据temp['INCOME']
只有十个值。根据您链接到 Wiki的公式,第 11 个百分位数的排名是
rank = 11*(10-1)/100 + 1 = 1.99
The truncated part of the rankis 1, which corresponds to the value 31, and the value with the rank 2 (i.e. next bin) is 38. The value of fraction
is the fractional part of the rank. This leads to the result:
秩的截断部分为1,对应的值为31,秩为2(即下一个bin)的值为38。 的值为秩fraction
的小数部分。这导致结果:
31 + (38-31)*(0.99) = 37.93
For the values themselves, the fraction
part have to be zero, so it is very easy to do the inverse calculation to get percentile:
对于值本身,该fraction
部分必须为零,因此很容易进行逆计算以获得百分位数:
p = (rank - 1)*100/(10 - 1)
I hope I made it more clear.
我希望我说得更清楚。
回答by Paul Panzer
This seems to work:
这似乎有效:
A = np.sort(temp['INCOME'].values)
np.interp(sample, A, np.linspace(0, 1, len(A)))
For example:
例如:
>>> temp.INCOME.quantile(np.interp([37.5, 38, 122, 121], A, np.linspace(0, 1, len(A))))
0.103175 37.5
0.111111 38.0
1.000000 122.0
0.883333 121.0
Name: INCOME, dtype: float64
Please note that this strategy only makes sense if you want to query a large enough number of values. Otherwise the sorting is too expensive.
请注意,此策略仅在您要查询足够多的值时才有意义。否则分拣成本太高。
回答by Randel Rodrigues
Let's consider the below dataframe:
让我们考虑以下数据框:
In order to get the percentile of a column in pandas Dataframe we use the following code:
为了获得 Pandas Dataframe 中列的百分位数,我们使用以下代码:
survey['Nationality'].value_counts(normalize='index')
Output:
输出:
USA 0.333333
美国 0.333333
China 0.250000
CN 0.250000
India 0.250000
印度 0.250000
Bangadesh 0.166667
孟加拉国 0.166667
Name: Nationality, dtype: float64
名称:国籍,数据类型:float64
In order to get the percentile of a column in pandas Dataframe with respect to another categorical column
为了获得 Pandas Dataframe 中某一列相对于另一个分类列的百分位数
pd.crosstab(survey.Sex,survey.Handedness,normalize = 'index')
The output would be something like given below
输出将类似于下面给出的内容