Python 如何计算第一和第三四分位数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45926230/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to calculate 1st and 3rd quartiles?
提问by Dinosaurius
I have DataFrame:
我有数据帧:
time_diff avg_trips
0 0.450000 1.0
1 0.483333 1.0
2 0.500000 1.0
3 0.516667 1.0
4 0.533333 2.0
I want to get 1st quartile, 3rd quartile and median for the column time_diff
. To obtain median, I use np.median(df["time_diff"].values)
.
我想获得列的第 1 个四分位数、第 3 个四分位数和中位数time_diff
。为了获得中位数,我使用np.median(df["time_diff"].values)
.
How can I calculate quartiles?
如何计算四分位数?
回答by MSeifert
You can use np.percentile
to calculate quartiles (including the median):
您可以使用np.percentile
计算四分位数(包括中位数):
>>> np.percentile(df.time_diff, 25) # Q1
0.48333300000000001
>>> np.percentile(df.time_diff, 50) # median
0.5
>>> np.percentile(df.time_diff, 75) # Q3
0.51666699999999999
Or all at once:
或者一次全部:
>>> np.percentile(df.time_diff, [25, 50, 75])
array([ 0.483333, 0.5 , 0.516667])
回答by YOBEN_S
By using pandas
:
通过使用pandas
:
df.time_diff.quantile([0.25,0.5,0.75])
Out[793]:
0.25 0.483333
0.50 0.500000
0.75 0.516667
Name: time_diff, dtype: float64
回答by piRSquared
Coincidentally, this information is captured with the describe
method:
巧合的是,此信息是使用以下describe
方法捕获的:
df.time_diff.describe()
count 5.000000
mean 0.496667
std 0.032059
min 0.450000
25% 0.483333
50% 0.500000
75% 0.516667
max 0.533333
Name: time_diff, dtype: float64
回答by Cyrus
np.percentile
DOES NOTcalculate the values of Q1, median, and Q3. Consider the sorted list below:
np.percentile
不计算 Q1、中位数和 Q3 的值。考虑下面的排序列表:
samples = [1, 1, 8, 12, 13, 13, 14, 16, 19, 22, 27, 28, 31]
running np.percentile(samples, [25, 50, 75])
returns the actual values from the list:
运行np.percentile(samples, [25, 50, 75])
返回列表中的实际值:
Out[1]: array([12., 14., 22.])
However, the quartiles are Q1=10.0, Median=14, Q3=24.5
(you can also use this linkto find the quartiles and median online).
One can use the below code to calculate the quartiles and median of a sorted list (because of sorting this approach requires O(nlogn)
computations where n
is the number of items).
Moreover, finding quartiles and median can be done in O(n)
computations using the Median of mediansSelection algorithm (order statistics).
但是,四分位数是Q1=10.0, Median=14, Q3=24.5
(您也可以使用此链接在线查找四分位数和中位数)。可以使用下面的代码来计算排序列表的四分位数和中位数(因为排序这种方法需要O(nlogn)
计算其中n
的项目数)。此外,可以O(n)
使用中位数选择算法(顺序统计)在计算中找到四分位数和中位数。
samples = sorted([28, 12, 8, 27, 16, 31, 14, 13, 19, 1, 1, 22, 13])
def find_median(sorted_list):
indices = []
list_size = len(sorted_list)
median = 0
if list_size % 2 == 0:
indices.append(int(list_size / 2) - 1) # -1 because index starts from 0
indices.append(int(list_size / 2))
median = (sorted_list[indices[0]] + sorted_list[indices[1]]) / 2
pass
else:
indices.append(int(list_size / 2))
median = sorted_list[indices[0]]
pass
return median, indices
pass
median, median_indices = find_median(samples)
Q1, Q1_indices = find_median(samples[:median_indices[0]])
Q2, Q2_indices = find_median(samples[median_indices[-1] + 1:])
quartiles = [Q1, median, Q2]
print("(Q1, median, Q3): {}".format(quartiles))
回答by Stian Ulriksen
Using np.percentile
.
使用np.percentile
.
q75, q25 = np.percentile(DataFrame, [75,25])
iqr = q75 - q25
Answer from How do you find the IQR in Numpy?
回答by Shikhar Parashar
Building upon or rather correcting a bit on what Cyrus said....
建立在赛勒斯所说的基础上,或者更确切地说是纠正一点......
[np.percentile][1]
DOES VERY MUCHcalculate the values of Q1, median, and Q3. Consider the sorted list below:
[np.percentile][1]
非常多地计算 Q1、中位数和 Q3 的值。考虑下面的排序列表:
s1=[18,45,66,70,76,83,88,90,90,95,95,98]
running np.percentile(s1, [25, 50, 75])
returns the actual values from the list:
运行np.percentile(s1, [25, 50, 75])
返回列表中的实际值:
[69. 85.5 91.25]
However, the quartiles are Q1=68.0, Median=85.5, Q3=92.5, which is the correctthing to say
然而,四分位数是 Q1=68.0,Median=85.5,Q3=92.5,这是正确的说法
What we are missing here is the interpolationparameter of the np.percentile
and related functions. By default the value of this argument is linear. This optional parameter specifies the interpolation method to use when the desired quantile lies between two data points i < j:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j, whichever is nearest.
midpoint: (i + j) / 2.
我们在这里缺少的是和相关函数的插值参数np.percentile
。默认情况下,此参数的值为linear。此可选参数指定当所需分位数位于两个数据点 i < j 之间时使用的插值方法:
线性:i + (j - i) * 分数,其中分数是由 i 和 j 包围的索引的小数部分。
较低:我。
更高:j。
最近:i 或 j,以最近的为准。
中点:(i + j) / 2。
Thus running np.percentile(s1, [25, 50, 75], interpolation='midpoint')
returns the actual results for the list:
因此运行np.percentile(s1, [25, 50, 75], interpolation='midpoint')
返回列表的实际结果:
[68. 85.5 92.5]
回答by Ian Jones
In my efforts to learn object-oriented programming alongside learning statistics, I made this, maybe you'll find it useful:
在我努力学习面向对象编程和学习统计学的过程中,我做了这个,也许你会发现它很有用:
samplesCourse = [9, 10, 10, 11, 13, 15, 16, 19, 19, 21, 23, 28, 30, 33, 34, 36, 44, 45, 47, 60]
class sampleSet:
def __init__(self, sampleList):
self.sampleList = sampleList
self.interList = list(sampleList) # interList is sampleList alias; alias used to maintain integrity of original sampleList
def find_median(self):
self.median = 0
if len(self.sampleList) % 2 == 0:
# find median for even-numbered sample list length
self.medL = self.interList[int(len(self.interList)/2)-1]
self.medU = self.interList[int(len(self.interList)/2)]
self.median = (self.medL + self.medU)/2
else:
# find median for odd-numbered sample list length
self.median = self.interList[int((len(self.interList)-1)/2)]
return self.median
def find_1stQuartile(self, median):
self.lower50List = []
self.Q1 = 0
# break out lower 50 percentile from sampleList
if len(self.interList) % 2 == 0:
self.lower50List = self.interList[:int(len(self.interList)/2)]
else:
# drop median to make list ready to divide into 50 percentiles
self.interList.pop(interList.index(self.median))
self.lower50List = self.interList[:int(len(self.interList)/2)]
# find 1st quartile (median of lower 50 percentiles)
if len(self.lower50List) % 2 == 0:
self.Q1L = self.lower50List[int(len(self.lower50List)/2)-1]
self.Q1U = self.lower50List[int(len(self.lower50List)/2)]
self.Q1 = (self.Q1L + self.Q1U)/2
else:
self.Q1 = self.lower50List[int((len(self.lower50List)-1)/2)]
return self.Q1
def find_3rdQuartile(self, median):
self.upper50List = []
self.Q3 = 0
# break out upper 50 percentile from sampleList
if len(self.sampleList) % 2 == 0:
self.upper50List = self.interList[int(len(self.interList)/2):]
else:
self.interList.pop(interList.index(self.median))
self.upper50List = self.interList[int(len(self.interList)/2):]
# find 3rd quartile (median of upper 50 percentiles)
if len(self.upper50List) % 2 == 0:
self.Q3L = self.upper50List[int(len(self.upper50List)/2)-1]
self.Q3U = self.upper50List[int(len(self.upper50List)/2)]
self.Q3 = (self.Q3L + self.Q3U)/2
else:
self.Q3 = self.upper50List[int((len(self.upper50List)-1)/2)]
return self.Q3
def find_InterQuartileRange(self, Q1, Q3):
self.IQR = self.Q3 - self.Q1
return self.IQR
def find_UpperFence(self, Q3, IQR):
self.fence = self.Q3 + 1.5 * self.IQR
return self.fence
samples = sampleSet(samplesCourse)
median = samples.find_median()
firstQ = samples.find_1stQuartile(median)
thirdQ = samples.find_3rdQuartile(median)
iqr = samples.find_InterQuartileRange(firstQ, thirdQ)
fence = samples.find_UpperFence(thirdQ, iqr)
print("Median is: ", median)
print("1st quartile is: ", firstQ)
print("3rd quartile is: ", thirdQ)
print("IQR is: ", iqr)
print("Upper fence is: ", fence)