pandas 计算列中值的百分位数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44824927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculate percentile of value in column
提问by Bluefire
I have a dataframe with a column that has numerical values. This column is not well-approximated by a normal distribution. Given another numerical value, not in this column, how can I calculate its percentile in the column? That is, if the value is greater than 80% of the values in the column but less than the other 20%, it would be in the 20th percentile.
我有一个带有数值列的数据框。该列不能很好地近似于正态分布。给定另一个数值,不在此列中,如何计算它在列中的百分位数?也就是说,如果该值大于列中值的 80% 但小于其他 20%,则它将位于第 20 个百分位。
采纳答案by Binyamin Even
Sort the column, and see if the value is in the first 20% or whatever percentile.
对列进行排序,并查看该值是否在前 20% 或任何百分位数中。
for example:
例如:
def in_percentile(my_series, val, perc=0.2):
myList=sorted(my_series.values.tolist())
l=len(myList)
return val>myList[int(l*perc)]
Or, if you want the actual percentile simply use searchsorted
:
或者,如果您想要实际的百分位数,只需使用searchsorted
:
my_series.values.searchsorted(val)/len(my_series)*100
回答by wingr
To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats.percentileofscore()
.
要查找相对于数组(或在您的情况下为数据框列)的值的百分位数,请使用 scipy 函数stats.percentileofscore()
。
For example, if we have a value x
(the other numerical value not in the dataframe), and a reference array, arr
(the column from the dataframe), we can find the percentile of x
by:
例如,如果我们有一个值x
(另一个不在数据框中的数值)和一个参考数组arr
(数据框中的列),我们可以找到x
by的百分位数:
from scipy import stats
percentile = stats.percentileofscore(arr, x)
Note that there is a third parameter to the stats.percentileofscore()
function that has a significant impact on the resulting value of the percentile, viz. kind
. You can choose from rank
, weak
, strict
, and mean
. See the docsfor more information.
请注意,该stats.percentileofscore()
函数的第三个参数对百分位数的结果值有重大影响,即。kind
. 您可以选择rank
,weak
,strict
,和mean
。有关更多信息,请参阅文档。
For an example of the difference:
对于差异的示例:
>>> df
a
0 1
1 2
2 3
3 4
4 5
>>> stats.percentileofscore(df['a'], 4, kind='rank')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='weak')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='strict')
60.0
>>> stats.percentileofscore(df['a'], 4, kind='mean')
70.0
As a final note, if you have a value that is greater than 80% of the other values in the column, it would be in the 80th percentile (see the example above for how the kind
method affects this final score somewhat) not the 20th percentile. See this Wikipedia articlefor more information.
最后要注意的是,如果您的值大于列中其他值的 80%,它将位于第 80 个百分位数(有关该kind
方法如何影响此最终分数的示例,请参见上面的示例)而不是第 20 个百分位数. 有关更多信息,请参阅此 Wikipedia 文章。
回答by Greg Poppe
Since you're looking for values over/under a specific threshold, you could consider using pandas qcutfunction. If you wanted values under 20% and over 80%, divide your data into 5 equal sized partitions. Each partition would represent a 20% "chunk" of equal size (five 20% partitions is 100%). So, given a DataFrame with 1 column 'a' which represents the column you have data for:
由于您正在寻找超过/低于特定阈值的值,您可以考虑使用 pandas qcut函数。如果您想要低于 20% 和超过 80% 的值,请将您的数据分成 5 个大小相等的分区。每个分区将代表一个 20% 的相同大小的“块”(五个 20% 的分区是 100%)。因此,给定一个带有 1 列“a”的 DataFrame,它代表您拥有数据的列:
df['newcol'] = pd.qcut(df['a'], 5, labels=False)
This will give you a new column to your DataFrame with each row having a value in (0, 1, 2, 3, 4). Where 0 represents your lowest 20% and 4 represents your highest 20% which is the 80% percentile.
这将为您的 DataFrame 提供一个新列,每行的值在 (0, 1, 2, 3, 4) 中。其中 0 代表最低的 20%,4 代表最高的 20%,即 80% 的百分位数。
回答by Amit Gupta
Probably very late but still
可能很晚了,但仍然
df['column_name'].describe()
will give you the regular 25, 50 and 75 percentile with some additional data but if you specifically want percentiles for some specific values then
将为您提供常规的 25、50 和 75 个百分位数以及一些附加数据,但如果您特别想要某些特定值的百分位数,那么
df['column_name'].describe(percentiles=[0.1, 0.2, 0.3, 0.5])
This will give you 10th, 20th, 30th and 50th percentiles. You can give as many values as you want.
这将为您提供第 10、20、30 和 50 个百分位数。您可以根据需要提供任意数量的值。