Python 如何计算第一和第三四分位数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45926230/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:21:26  来源:igfitidea点击:

How to calculate 1st and 3rd quartiles?

pythonpython-2.7pandasnumpy

提问by Dinosaurius

I have DataFrame:

我有数据帧:

    time_diff   avg_trips
0   0.450000    1.0
1   0.483333    1.0
2   0.500000    1.0
3   0.516667    1.0
4   0.533333    2.0

I want to get 1st quartile, 3rd quartile and median for the column time_diff. To obtain median, I use np.median(df["time_diff"].values).

我想获得列的第 1 个四分位数、第 3 个四分位数和中位数time_diff。为了获得中位数,我使用np.median(df["time_diff"].values).

How can I calculate quartiles?

如何计算四分位数?

回答by MSeifert

You can use np.percentileto calculate quartiles (including the median):

您可以使用np.percentile计算四分位数(包括中位数):

>>> np.percentile(df.time_diff, 25)  # Q1
0.48333300000000001

>>> np.percentile(df.time_diff, 50)  # median
0.5

>>> np.percentile(df.time_diff, 75)  # Q3
0.51666699999999999

Or all at once:

或者一次全部:

>>> np.percentile(df.time_diff, [25, 50, 75])
array([ 0.483333,  0.5     ,  0.516667])

回答by YOBEN_S

By using pandas:

通过使用pandas

df.time_diff.quantile([0.25,0.5,0.75])


Out[793]: 
0.25    0.483333
0.50    0.500000
0.75    0.516667
Name: time_diff, dtype: float64

回答by piRSquared

Coincidentally, this information is captured with the describemethod:

巧合的是,此信息是使用以下describe方法捕获的:

df.time_diff.describe()

count    5.000000
mean     0.496667
std      0.032059
min      0.450000
25%      0.483333
50%      0.500000
75%      0.516667
max      0.533333
Name: time_diff, dtype: float64

回答by Cyrus

np.percentileDOES NOTcalculate the values of Q1, median, and Q3. Consider the sorted list below:

np.percentile计算 Q1、中位数和 Q3 的值。考虑下面的排序列表:

samples = [1, 1, 8, 12, 13, 13, 14, 16, 19, 22, 27, 28, 31]

running np.percentile(samples, [25, 50, 75])returns the actual values from the list:

运行np.percentile(samples, [25, 50, 75])返回列表中的实际值:

Out[1]: array([12., 14., 22.])

However, the quartiles are Q1=10.0, Median=14, Q3=24.5(you can also use this linkto find the quartiles and median online). One can use the below code to calculate the quartiles and median of a sorted list (because of sorting this approach requires O(nlogn)computations where nis the number of items). Moreover, finding quartiles and median can be done in O(n)computations using the Median of mediansSelection algorithm (order statistics).

但是,四分位数是Q1=10.0, Median=14, Q3=24.5(您也可以使用此链接在线查找四分位数和中位数)。可以使用下面的代码来计算排序列表的四分位数和中位数(因为排序这种方法需要O(nlogn)计算其中n的项目数)。此外,可以O(n)使用中位数选择算法(顺序统计)在计算中找到四分位数和中位数

samples = sorted([28, 12, 8, 27, 16, 31, 14, 13, 19, 1, 1, 22, 13])

def find_median(sorted_list):
    indices = []

    list_size = len(sorted_list)
    median = 0

    if list_size % 2 == 0:
        indices.append(int(list_size / 2) - 1)  # -1 because index starts from 0
        indices.append(int(list_size / 2))

        median = (sorted_list[indices[0]] + sorted_list[indices[1]]) / 2
        pass
    else:
        indices.append(int(list_size / 2))

        median = sorted_list[indices[0]]
        pass

    return median, indices
    pass

median, median_indices = find_median(samples)
Q1, Q1_indices = find_median(samples[:median_indices[0]])
Q2, Q2_indices = find_median(samples[median_indices[-1] + 1:])

quartiles = [Q1, median, Q2]

print("(Q1, median, Q3): {}".format(quartiles))

回答by Stian Ulriksen

Using np.percentile.

使用np.percentile.

q75, q25 = np.percentile(DataFrame, [75,25])
iqr = q75 - q25

Answer from How do you find the IQR in Numpy?

你如何在 Numpy 中找到 IQR 的答案

回答by Shikhar Parashar

Building upon or rather correcting a bit on what Cyrus said....

建立在赛勒斯所说的基础上,或者更确切地说是纠正一点......

[np.percentile][1]DOES VERY MUCHcalculate the values of Q1, median, and Q3. Consider the sorted list below:

[np.percentile][1]非常多地计算 Q1、中位数和 Q3 的值。考虑下面的排序列表:

s1=[18,45,66,70,76,83,88,90,90,95,95,98]

running np.percentile(s1, [25, 50, 75])returns the actual values from the list:

运行np.percentile(s1, [25, 50, 75])返回列表中的实际值:

[69.   85.5  91.25]

However, the quartiles are Q1=68.0, Median=85.5, Q3=92.5, which is the correctthing to say

然而,四分位数是 Q1=68.0,Median=85.5,Q3=92.5,这是正确的说法

What we are missing here is the interpolationparameter of the np.percentileand related functions. By default the value of this argument is linear. This optional parameter specifies the interpolation method to use when the desired quantile lies between two data points i < j:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j, whichever is nearest.
midpoint: (i + j) / 2.

我们在这里缺少的是和相关函数的插值参数np.percentile。默认情况下,此参数的值为linear。此可选参数指定当所需分位数位于两个数据点 i < j 之间时使用的插值方法:
线性:i + (j - i) * 分数,其中分数是由 i 和 j 包围的索引的小数部分。
较低:我。
更高:j。
最近:i 或 j,以最近的为准。
中点:(i + j) / 2。

Thus running np.percentile(s1, [25, 50, 75], interpolation='midpoint')returns the actual results for the list:

因此运行np.percentile(s1, [25, 50, 75], interpolation='midpoint')返回列表的实际结果:

[68.  85.5 92.5]

回答by Ian Jones

In my efforts to learn object-oriented programming alongside learning statistics, I made this, maybe you'll find it useful:

在我努力学习面向对象编程和学习统计学的过程中,我做了这个,也许你会发现它很有用:

samplesCourse = [9, 10, 10, 11, 13, 15, 16, 19, 19, 21, 23, 28, 30, 33, 34, 36, 44, 45, 47, 60]

class sampleSet:
    def __init__(self, sampleList):
        self.sampleList = sampleList
        self.interList = list(sampleList) # interList is sampleList alias; alias used to maintain integrity of original sampleList

    def find_median(self):
        self.median = 0

        if len(self.sampleList) % 2 == 0:
            # find median for even-numbered sample list length
            self.medL = self.interList[int(len(self.interList)/2)-1]
            self.medU = self.interList[int(len(self.interList)/2)]
            self.median = (self.medL + self.medU)/2

        else:
            # find median for odd-numbered sample list length
            self.median = self.interList[int((len(self.interList)-1)/2)]
        return self.median

    def find_1stQuartile(self, median):
        self.lower50List = []
        self.Q1 = 0

        # break out lower 50 percentile from sampleList
        if len(self.interList) % 2 == 0:
            self.lower50List = self.interList[:int(len(self.interList)/2)]
        else:
            # drop median to make list ready to divide into 50 percentiles
            self.interList.pop(interList.index(self.median))
            self.lower50List = self.interList[:int(len(self.interList)/2)]

        # find 1st quartile (median of lower 50 percentiles)
        if len(self.lower50List) % 2 == 0:
            self.Q1L = self.lower50List[int(len(self.lower50List)/2)-1]
            self.Q1U = self.lower50List[int(len(self.lower50List)/2)]
            self.Q1 = (self.Q1L + self.Q1U)/2

        else:
            self.Q1 = self.lower50List[int((len(self.lower50List)-1)/2)]

        return self.Q1

    def find_3rdQuartile(self, median):
        self.upper50List = []
        self.Q3 = 0

        # break out upper 50 percentile from sampleList
        if len(self.sampleList) % 2 == 0:
            self.upper50List = self.interList[int(len(self.interList)/2):]
        else:
            self.interList.pop(interList.index(self.median))
            self.upper50List = self.interList[int(len(self.interList)/2):]

        # find 3rd quartile (median of upper 50 percentiles)
        if len(self.upper50List) % 2 == 0:
            self.Q3L = self.upper50List[int(len(self.upper50List)/2)-1]
            self.Q3U = self.upper50List[int(len(self.upper50List)/2)]
            self.Q3 = (self.Q3L + self.Q3U)/2

        else:
            self.Q3 = self.upper50List[int((len(self.upper50List)-1)/2)]

        return self.Q3

    def find_InterQuartileRange(self, Q1, Q3):
        self.IQR = self.Q3 - self.Q1
        return self.IQR

    def find_UpperFence(self, Q3, IQR):
        self.fence = self.Q3 + 1.5 * self.IQR
        return self.fence

samples = sampleSet(samplesCourse)
median = samples.find_median()
firstQ = samples.find_1stQuartile(median)
thirdQ = samples.find_3rdQuartile(median)
iqr = samples.find_InterQuartileRange(firstQ, thirdQ)
fence = samples.find_UpperFence(thirdQ, iqr)

print("Median is: ", median)
print("1st quartile is: ", firstQ)
print("3rd quartile is: ", thirdQ)
print("IQR is: ", iqr)
print("Upper fence is: ", fence)

回答by Yustina Ivanova

you can use

您可以使用

df.describe()

which would show the information

这将显示信息

df.describe()

df.describe()