pandas 如何使用python计算一列数据相对于另一列的百分位排名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43145715/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:19:24  来源:igfitidea点击:

How to calculate a percentile ranking of a column of data relative to another column using python

pythonpandasquantilepercentile

提问by Doodles

I have two columns of data representing the same quantity; one column is from my training data, the other is from my validation data.

我有两列表示相同数量的数据;一列来自我的训练数据,另一列来自我的验证数据。

I know how to calculate the percentile rankings of the training data efficiently using:

我知道如何使用以下方法有效地计算训练数据的百分位排名:

pandas.DataFrame(training_data).rank(pct = True).values

My question is, how can I efficientlyget a similar set of percentile rankings of the validation data column relativeto the training data column? That is, for each value in the validation data column, how can I find what its percentile ranking would be relative to all the values in the training data column?

我的问题是,我怎样才能有效地获得一组类似的验证数据列对于训练数据列的百分位排名?也就是说,对于验证数据列中的每个值,我如何才能找到其相对于训练数据列中所有值的百分位排名?

I've tried doing this:

我试过这样做:

def percentrank(input_data,comparison_data):
    rescaled_data = np.zeros(input_data.size)
    for idx,datum in enumerate(input_data):
        rescaled_data[idx] =scipy.stats.percentileofscore(comparison_data,datum)
    return rescaled_data/100

But I'm not sure if this is even correct, and on top of that it's incredibly slow because it is doing a lot of redundant calculations for each value in the for loop.

但我不确定这是否正确,最重要的是它非常慢,因为它对 for 循环中的每个值进行了大量冗余计算。

Any help would be greatly appreciated!

任何帮助将不胜感激!

回答by B. Shieh

Here's a solution. Sort the training data. Then use searchsorted on the validation data.

这是一个解决方案。对训练数据进行排序。然后在验证数据上使用 searchsorted。

import pandas as pd
import numpy as np

# Generate Dummy Data
df_train = pd.DataFrame({'Values': 1000*np.random.rand(15712)})

#Sort Data
df_train = df_train.sort_values('Values')

# Calculating Rank and Rank_Pct for demo purposes 
#but note that it is not needed for the solution
# The ranking of the validation data below does not depend on this
df_train['Rank'] = df_train.rank()
df_train['Rank_Pct']= df_train.Values.rank(pct=True)

# Demonstrate how Rank Percentile is calculated
# This gives the same value as .rank(pct=True)
pct_increment = 1./len(df_train)
df_train['Rank_Pct_Manual'] = df_train.Rank*pct_increment

df_train.head()

       Values  Rank  Rank_Pct  Rank_Pct_Manual
2724  0.006174   1.0  0.000064         0.000064
3582  0.016264   2.0  0.000127         0.000127
5534  0.095691   3.0  0.000191         0.000191
944   0.141442   4.0  0.000255         0.000255
7566  0.161766   5.0  0.000318         0.000318

Now use searchsorted to get Rank_Pct of validation data

现在使用 searchsorted 来获取验证数据的 Rank_Pct

# Generate Dummy Validation Data
df_validation = pd.DataFrame({'Values': 1000*np.random.rand(1000)})

# Note searchsorted returns array index. 
# In sorted list rank is the same as the array index +1
df_validation['Rank_Pct'] = (1 + df_train.Values.searchsorted(df_validation.Values))*pct_increment

Here is first few lines of final df_validation dataframe:

这是最终 df_validation 数据帧的前几行:

print df_validation.head()
      Values  Rank_Pct
0  307.378334  0.304290
1  744.247034  0.744208
2  669.223821  0.670825
3  149.797030  0.145621
4  317.742713  0.314218

回答by user3098048

A small improvement to the nice solution above is to average the positions found by searching from the left and searching from the right:

对上述不错的解决方案的一个小改进是对通过从左侧搜索和从右侧搜索找到的位置进行平均:

df_validation['Rank_Pct'] = (0.5 + 0.5*df_train.Values.searchsorted(df_validation.Values, side='left') + 0.5*df_train.Values.searchsorted(df_validation.Values, side='right'))*pct_increment

This change is important in cases where a value occurs multiple times. Consider searching for 2 in [1,2,2,2,4] - searching from the left gives 1, while search from the right gives 3. Averaging the two gives the same percentile ranking as the pandas .rank(pct=True) routine.

在值多次出现的情况下,此更改很重要。考虑在 [1,2,2,2,4] 中搜索 2 - 从左边搜索给出 1,而从右边搜索给出 3。平均这两个给出与Pandas相同的百分位排名 .rank(pct=True)常规。