Python Pandas - 如何通过描述函数计算 25 个百分位

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39567712/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:02:13  来源:igfitidea点击:

Python Pandas - how is 25 percentile calculated by describe function

pythonpandaspercentile

提问by Gublooo

For a given dataset in a data frame, when I apply the describefunction, I get the basic stats which include min, max, 25%, 50% etc.

对于数据框中的给定数据集,当我应用该describe函数时,我会得到基本统计数据,包括最小值、最大值、25%、50% 等。

For example:

例如:

data_1 = pd.DataFrame({'One':[4,6,8,10]},columns=['One'])
data_1.describe()

The output is:

输出是:

        One
count   4.000000
mean    7.000000
std     2.581989
min     4.000000
25%     5.500000
50%     7.000000
75%     8.500000
max     10.000000

My question is: What is the mathematical formula to calculate the 25%?

我的问题是:计算 25% 的数学公式什么?

1) Based on what I know, it is:

1)据我所知,它是:

formula = percentile * n (n is number of values)

In this case:

在这种情况下:

25/100 * 4 = 1

So the first position is number 4 but according to the describe function it is 5.5.

所以第一个位置是数字 4 但根据描述函数它是5.5

2) Another example says - if you get a whole number then take the average of 4 and 6 - which would be 5 - still does not match 5.5given by describe.

2)另一个例子说 - 如果你得到一个整数,那么取 4 和 6 的平均值 - 这将是 5 - 仍然与5.5描述给出的不匹配。

3) Another tutorial says - you take the difference between the 2 numbers - multiply by 25% and add to the lower number:

3)另一个教程说 - 你取两个数字之间的差 - 乘以 25% 并添加到较低的数字:

25/100 * (6-4) = 1/4*2 = 0.5

Adding that to the lower number: 4 + 0.5 = 4.5

将其添加到较低的数字中: 4 + 0.5 = 4.5

Still not getting 5.5.

还是没有得到5.5

Can someone please clarify?

有人可以澄清一下吗?

采纳答案by Nikolas Rieble

In the pandas documentationthere is information about the computation of quantiles, where a reference to numpy.percentile is made:

Pandas文档中有关于分位数计算的信息,其中对 numpy.percentile 进行了引用:

Return value at the given quantile, a la numpy.percentile.

返回给定分位数的值,一个 numpy.percentile。

Then, checking numpy.percentile explanation, we can see that the interpolation method is set to linearby default:

然后,检查 numpy.percentile解释,我们可以看到插值方法默认设置为线性

linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j

线性:i + (j - i) * 分数,其中分数是被 i 和 j 包围的索引的小数部分

For your specfic case, the 25th quantile results from:

对于您的特定情况,第 25 个分位数来自:

res_25 = 4 + (6-4)*(3/4) =  5.5

For the 75th quantile we then get:

对于第 75 个分位数,我们得到:

res_75 = 8 + (10-8)*(1/4) = 8.5

If you set the interpolation method to "midpoint", then you will get the results that you thought of.

如果你将插值方法设置为“中点”,那么你就会得到你想到的结果。

.

.

回答by orli Zhu

I think it's easier to understand by seeing this calculation as min+(max-min)*percentile. It has the same result as this function described in NumPy:

我认为通过将此计算视为min+(max-min)*percentile更容易理解。它具有与 NumPy 中描述的此函数相同的结果:

linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j

线性:i + (j - i) * 分数,其中分数是被 i 和 j 包围的索引的小数部分

res_25 = 4+(10-4)*percentile = 4+(10-4)*25% = 5.5
res_75 = 4+(10-4)*percentile = 4+(10-4)*75% = 8.5