Python matplotlib 中的箱线图:标记和异常值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17725927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Boxplots in matplotlib: Markers and outliers
提问by Amelio Vazquez-Reina
I have some questions about boxplotsin matplotlib:
我对matplotlib 中的箱线图有一些疑问:
Question A. What do the markers that I highlighted below with Q1, Q2, and Q3represent? I believe Q1is maximum and Q3are outliers, but what is Q2?
质疑。我在下面用Q1、Q2和Q3突出显示的标记代表什么?我相信Q1是最大值,Q3是异常值,但Q2 是什么?
Question BHow does matplotlib identify outliers? (i.e. how does it know that they are not the true max
and min
values?)
问题 Bmatplotlib 如何识别异常值?(即它如何知道它们不是真实值max
和min
值?)
采纳答案by Joooeey
Here's a graphic that illustrates the components of the box from a stats.stackexchange answer. Note that k=1.5 if you don't supply the whis
keyword in Pandas.
这是一个图表,它说明了来自stats.stackexchange 答案的框的组件。请注意,如果您whis
在 Pandas 中不提供关键字,则 k=1.5 。
The boxplot function in Pandas is a wrapper for matplotlib.pyplot.boxplot
. The matplotlib docsexplain the components of the boxes in detail:
Pandas 中的 boxplot 函数是matplotlib.pyplot.boxplot
. 该matplotlib的文档详细解释箱子的部件:
Question A:
问题一:
The box extends from the lower to upper quartile values of the data, with a line at the median.
该框从数据的下四分位数值延伸到上四分位数值,在中位数处有一条线。
i.e. a quarter of the input data values is below the box, a quarter of the data lies in each part of the box, and the remaining quarter lies above the box.
即四分之一的输入数据值位于框下方,四分之一的数据位于框的每个部分,其余四分之一位于框上方。
Question B:
问题乙:
whis : float, sequence, or string (default = 1.5)
As a float, determines the reach of the whiskers to the beyond the first and third quartiles. In other words, where IQR is the interquartile range (Q3-Q1), the upper whisker will extend to last datum less than Q3 + whis*IQR). Similarly, the lower whisker will extend to the first datum greater than Q1 - whis*IQR. Beyond the whiskers, data are considered outliers and are plotted as individual points.
whis :浮点数、序列或字符串(默认值 = 1.5)
作为浮点数,确定胡须到达第一和第三四分位数以外的范围。换句话说,当 IQR 是四分位距 (Q3-Q1) 时,上须将延伸到小于 Q3 + whis*IQR 的最后一个数据。类似地,下部晶须将延伸到大于 Q1 的第一个数据 - whis*IQR。在须线之外,数据被视为异常值并绘制为单个点。
Matplotlib (and Pandas) also gives you a lot of options to change this default definition of the whiskers:
Matplotlib(和 Pandas)还为您提供了许多更改胡须默认定义的选项:
Set this to an unreasonably high value to force the whiskers to show the min and max values. Alternatively, set this to an ascending sequence of percentile (e.g., [5, 95]) to set the whiskers at specific percentiles of the data. Finally, whis can be the string 'range' to force the whiskers to the min and max of the data.
将此设置为不合理的高值以强制胡须显示最小值和最大值。或者,将其设置为百分位数的升序序列(例如,[5, 95])以将胡须设置为数据的特定百分位数。最后,whis 可以是字符串 'range' 以将胡须强制为数据的最小值和最大值。
回答by seth
The box represents the first and third quartiles, with the red line the median (2nd quartile). The documentationgives the default whiskers at 1.5 IQR:
方框代表第一和第三四分位数,红线代表中位数(第二四分位数)。该文件给出了1.5 IQR默认晶须:
boxplot(x, notch=False, sym='+', vert=True, whis=1.5,
positions=None, widths=None, patch_artist=False,
bootstrap=None, usermedians=None, conf_intervals=None)
and
和
whis : [ default 1.5 ]
Defines the length of the whiskers as a function of the inner quartile range. They extend to the most extreme data point within ( whis*(75%-25%) ) data range.
whis : [默认 1.5 ]
将胡须的长度定义为内部四分位数范围的函数。它们扩展到 (whis*(75%-25%)) 数据范围内的最极端数据点。
If you're confused about different box plot representations try reading the description in wikipedia.
如果您对不同的箱线图表示感到困惑,请尝试阅读wikipedia 中的描述。
回答by Dirk
In addition to seth answer (since the documentation is not very precise regarding this): Q1 (the wiskers) are placed at the maximum value below 75% + 1.5 IQR
除了 seth 答案(因为文档对此不是很精确):Q1(wiskers)被放置在低于 75% + 1.5 IQR 的最大值
(minimum value of 25% - 1.5 IQR)
(最小值为 25% - 1.5 IQR)
This is the code that computes the whiskers position:
这是计算胡须位置的代码:
# get high extreme
iq = q3 - q1
hi_val = q3 + whis * iq
wisk_hi = np.compress(d <= hi_val, d)
if len(wisk_hi) == 0 or np.max(wisk_hi) < q3:
wisk_hi = q3
else:
wisk_hi = max(wisk_hi)
# get low extreme
lo_val = q1 - whis * iq
wisk_lo = np.compress(d >= lo_val, d)
if len(wisk_lo) == 0 or np.min(wisk_lo) > q1:
wisk_lo = q1
else:
wisk_lo = min(wisk_lo)
回答by Amelio Vazquez-Reina
A picture is worth a thousand words. Note that the outliers (the +
markers in your plot) are simply points outsideof the wide [(Q1-1.5 IQR), (Q3+1.5 IQR)]
margin below.
一张图片胜过千言万语。请注意,异常值(图中的+
标记)只是下方宽边之外的点[(Q1-1.5 IQR), (Q3+1.5 IQR)]
。
However, the picture is only an example for a normally distributed data set. It is important to understand that matplotlib does notestimate a normal distribution first and calculates the quartiles from the estimated distribution parameters as shown above.
然而,图片只是一个正态分布数据集的例子。重要的是要了解 matplotlib 不会首先估计正态分布,而是根据估计的分布参数计算四分位数,如上所示。
Instead, the median and the quartiles are calculated directly from the data. Thus, your boxplot may look different depending on the distribution of your data and the size of the sample, e.g., asymmetric and with more or less outliers.
相反,中位数和四分位数是直接从数据中计算出来的。因此,您的箱线图可能看起来有所不同,具体取决于您的数据分布和样本大小,例如,不对称且具有或多或少的异常值。
回答by June Skeeter
回答by Michael James Kali Galarnyk
The image below shows the different parts of a boxplot.
下图显示了箱线图的不同部分。
Quantile 1/Q1: 25th Percentile
分位数 1/Q1:第 25 个百分位数
Interquartile Range (IQR): 25th percentile to the 75th percentile.
四分位距 (IQR):第 25 个百分点至第 75 个百分点。
Median (Quantile 2/Q2): 50th Percentile.
中位数(分位数 2/Q2):第 50 个百分位数。
Quantile 3/Q3: 75th Percentile.
分位数 3/Q3:第 75 个百分位数。
I should note that the blue part are the whiskers of the boxplot.
我应该注意到蓝色部分是箱线图的胡须。
The image below compares the box plot of a normal distribution against the probability density function. It should help explain the "Minimum", "Maximum", and outliers.
下图将正态分布的箱线图与概率密度函数进行了比较。它应该有助于解释“最小值”、“最大值”和异常值。
"Minimum": (Q1-1.5 IQR)
“最低”:(Q1-1.5 IQR)
"Maximum": (Q3+1.5 IQR)
“最大值”:(Q3+1.5 IQR)
As zelusp said, 99.3% of data is contained within 2.698σ (standard deviations) for a normal distribution. The green circles (outliers) in the image below are the remaining .7% of the data. Hereis a derivation of how those numbers came to be.
正如 zelusp 所说,对于正态分布,99.3% 的数据包含在 2.698σ(标准差)内。下图中的绿色圆圈(异常值)是剩余的 0.7% 的数据。以下是这些数字如何产生的推导。