如何获得 Pandas 数据框中一行的百分位数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50804120/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:41:12  来源:igfitidea点击:

How do I get the percentile for a row in a pandas dataframe?

pythonpandasnumpyscipypercentile

提问by bbennett36

Example DataFrame Values -  

0     78
1     38
2     42
3     48
4     31
5     89
6     94
7    102
8    122
9    122  

stats.percentileofscore(temp['INCOME'].values, 38, kind='mean')
15.0

stats.percentileofscore(temp['INCOME'].values, 38, kind='strict')
10.0

stats.percentileofscore(temp['INCOME'].values, 38, kind='weak')
20.0

stats.percentileofscore(temp['INCOME'].values, 38, kind='rank')
20.0

temp['INCOME'].rank(pct=True)
1    0.20 (Only showing the 38 value index)

temp['INCOME'].quantile(0.11)
37.93

temp['INCOME'].quantile(0.12)
38.31999999999999

Based on the results above, you can see none of the methods are consistent
with the pd.quantiles() method.

I need to get the percentile for one column for each row in a dataframe (255M rows) but can't find any functions/methods that return the 'linear interpolation'method that they use in pd.quantile& np.percentile.

我需要为数据帧(2.55 亿行)中的每一行获取一列的百分位数,但找不到任何返回它们在& 中使用的“线性插值”方法的函数/方法。 pd.quantilenp.percentile

I've tried the following methods/functions -

我尝试了以下方法/功能 -

.rank(pct=True)

This method only returns the values ranked in order, not using the percentile method that I'm looking for. Inconsistent with pd.quantiles

此方法仅返回按顺序排列的值,不使用我正在寻找的百分位数方法。不一致pd.quantiles

scipy.stats.percentileofscore  

This method almost is closer to what I'm looking for but still is not 100% consistent with the 'linear interpolation' method for some reason. Related question to this problem with no real answer

这种方法几乎更接近我正在寻找的方法,但由于某种原因仍然不是 100% 与“线性插值”方法一致。 与此问题相关的问题,没有真正的答案

I've looked through every SO answer that is related to this question but none of them use the same interpolation method that I need to use so please do not mark this as a duplicate unless you can verify they're using the same method.

我已经查看了与此问题相关的每个 SO 答案,但没有一个使用我需要使用的相同插值方法,因此请不要将其标记为重复,除非您可以验证它们使用相同的方法。

At this point my last option is to just find the bin cut-offs for all 100 percentiles and apply it that way or calculate the linear interpolation myself but this seems very inefficient and will take forever to apply to 255M records.

在这一点上,我的最后一个选择是找到所有 100 个百分位数的 bin 截止值并以这种方式应用它或自己计算线性插值,但这似乎非常低效,并且将永远应用于 255M 记录。

Any other suggestions to do this?

还有其他建议可以做到这一点吗?

Thanks!

谢谢!

回答by igrinis

TL; DR

TL; DR

Use

sz = temp['INCOME'].size-1
temp['PCNT_LIN'] = temp['INCOME'].rank(method='max').apply(lambda x: 100.0*(x-1)/sz)

   INCOME    PCNT_LIN
0      78   44.444444
1      38   11.111111
2      42   22.222222
3      48   33.333333
4      31    0.000000
5      89   55.555556
6      94   66.666667
7     102   77.777778
8     122  100.000000
9     122  100.000000

Answer

回答

It is actually very simple, once your understand the mechanics. When you are looking for percentile of a score, you already have the scores in each row. The only step left is understanding that you need percentile of numbers that are less or equalto the selected value. This is exactly what parameters kind='weak'of scipy.stats.percentileofscore()and method='average'of DataFrame.rank()do. In order to invert it, run Series.quantile()with interpolation='lower'.

一旦你理解了机制,它实际上非常简单。当您在寻找分数的百分位数时,您已经在每一行中获得了分数。剩下的唯一一步是了解您需要小于或等于所选值的数字的百分位数。这正是参数“弱”样=scipy.stats.percentileofscore()方法=“平均”DataFrame.rank()事。为了反转它,请Series.quantile()使用插值='lower' 运行

So, the behavior of the scipy.stats.percentileofscore(), Series.rank()and Series.quantile()isconsistent, see below:

因此,的行为scipy.stats.percentileofscore()Series.rank()并且Series.quantile()一致的,见下图:

In[]:
temp = pd.DataFrame([  78, 38, 42, 48, 31, 89, 94, 102, 122, 122], columns=['INCOME'])
temp['PCNT_RANK']=temp['INCOME'].rank(method='max', pct=True)
temp['POF']  = temp['INCOME'].apply(lambda x: scipy.stats.percentileofscore(temp['INCOME'], x, kind='weak'))
temp['QUANTILE_VALUE'] = temp['PCNT_RANK'].apply(lambda x: temp['INCOME'].quantile(x, 'lower'))
temp['RANK']=temp['INCOME'].rank(method='max')
sz = temp['RANK'].size - 1 
temp['PCNT_LIN'] = temp['RANK'].apply(lambda x: (x-1)/sz)
temp['CHK'] = temp['PCNT_LIN'].apply(lambda x: temp['INCOME'].quantile(x))

temp

Out[]:
   INCOME  PCNT_RANK    POF  QUANTILE_VALUE  RANK  PCNT_LIN    CHK
0      78        0.5   50.0              78   5.0  0.444444   78.0
1      38        0.2   20.0              38   2.0  0.111111   38.0
2      42        0.3   30.0              42   3.0  0.222222   42.0
3      48        0.4   40.0              48   4.0  0.333333   48.0
4      31        0.1   10.0              31   1.0  0.000000   31.0
5      89        0.6   60.0              89   6.0  0.555556   89.0
6      94        0.7   70.0              94   7.0  0.666667   94.0
7     102        0.8   80.0             102   8.0  0.777778  102.0
8     122        1.0  100.0             122  10.0  1.000000  122.0
9     122        1.0  100.0             122  10.0  1.000000  122.0

Now in a column PCNT_RANKyou get ratio of values that are smaller or equal to the one in a column INCOME. But if you want the "interpolated" ratio, it is in column PCNT_LIN. And as you use Series.rank()for calculations, it is pretty fast and will crunch you 255M numbers in seconds.

现在,在列中,PCNT_RANK您将获得小于或等于列中值的比率INCOME。但是如果你想要“插值”比率,它在列中PCNT_LIN。当您Series.rank()用于计算时,它非常快,可以在几秒钟内处理 2.55 亿个数字。



Here I will explain how you get the value from using quantile()with linearinterpolation:

在这里,我将解释如何通过quantile()使用linear插值来获得值:

temp['INCOME'].quantile(0.11)
37.93

Our data temp['INCOME']has only ten values. According to the formula from your link to Wikithe rank of 11th percentile is

我们的数据temp['INCOME']只有十个值。根据您链接到 Wiki的公式,第 11 个百分位数的排名是

rank = 11*(10-1)/100 + 1 = 1.99

The truncated part of the rankis 1, which corresponds to the value 31, and the value with the rank 2 (i.e. next bin) is 38. The value of fractionis the fractional part of the rank. This leads to the result:

的截断部分为1,对应的值为31,秩为2(即下一个bin)的值为38。 的值为fraction的小数部分。这导致结果:

 31 + (38-31)*(0.99) = 37.93

For the values themselves, the fractionpart have to be zero, so it is very easy to do the inverse calculation to get percentile:

对于值本身,该fraction部分必须为零,因此很容易进行逆计算以获得百分位数:

p = (rank - 1)*100/(10 - 1)

I hope I made it more clear.

我希望我说得更清楚。

回答by Paul Panzer

This seems to work:

这似乎有效:

A = np.sort(temp['INCOME'].values)
np.interp(sample, A, np.linspace(0, 1, len(A)))

For example:

例如:

>>> temp.INCOME.quantile(np.interp([37.5, 38, 122, 121], A, np.linspace(0, 1, len(A))))
0.103175     37.5
0.111111     38.0
1.000000    122.0
0.883333    121.0
Name: INCOME, dtype: float64

Please note that this strategy only makes sense if you want to query a large enough number of values. Otherwise the sorting is too expensive.

请注意,此策略仅在您要查询足够多的值时才有意义。否则分拣成本太高。

回答by Randel Rodrigues

Let's consider the below dataframe:

让我们考虑以下数据框:

DataFrame

数据框

In order to get the percentile of a column in pandas Dataframe we use the following code:

为了获得 Pandas Dataframe 中列的百分位数,我们使用以下代码:

 survey['Nationality'].value_counts(normalize='index')

Output:

输出:

USA 0.333333

美国 0.333333

China 0.250000

CN 0.250000

India 0.250000

印度 0.250000

Bangadesh 0.166667

孟加拉国 0.166667

Name: Nationality, dtype: float64

名称:国籍,数据类型:float64

In order to get the percentile of a column in pandas Dataframe with respect to another categorical column

为了获得 Pandas Dataframe 中某一列相对于另一个分类列的百分位数

pd.crosstab(survey.Sex,survey.Handedness,normalize = 'index')

The output would be something like given below

输出将类似于下面给出的内容

Output

输出