Python 如何在matplotlib直方图中选择bins
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33458566/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to choose bins in matplotlib histogram
提问by H.H
Can someone explain to me what "bins" in histogram are (the matplotlib histfunction)? And assuming I need to plot the probability density function of some data, how do the bins I choose influence that? and how do I choose them? (I already read about them in the matplotlib.pyplot.histand the numpy.histogramlibraries but I did not get the idea)
有人可以向我解释直方图中的“bins”是什么(matplotlib hist函数)?假设我需要绘制一些数据的概率密度函数,我选择的 bin 如何影响它?我该如何选择它们?(我已经在matplotlib.pyplot.hist和numpy.histogram库中阅读了它们,但我没有得到这个想法)
回答by Oliver Angelil
Bins are the number of intervals you want to divide all of your data into, such that it can be displayed as bars on a histogram. A simple method to work our how many bins are suitable is to take the square root of the total number of values in your distribution.
Bins 是您想要将所有数据划分成的间隔数,以便它可以在直方图上显示为条形。计算合适的 bin 数量的一种简单方法是取分布中值总数的平方根。
回答by jakevdp
The bins
parameter tells you the number of bins that your data will be divided into. You can specify it as an integer or as a list of bin edges.
该bins
参数告诉您数据将被划分成的 bin 数量。您可以将其指定为整数或 bin 边缘列表。
For example, here we ask for 20 bins:
例如,这里我们要求 20 个 bin:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randn(1000)
plt.hist(x, bins=20)
And here we ask for bin edges at the locations [-4, -3, -2... 3, 4].
在这里,我们要求位置 [-4, -3, -2... 3, 4] 处的 bin 边缘。
plt.hist(x, bins=range(-4, 5))
Your question about how to choose the "best" number of bins is an interesting one, and there's actually a fairly vast literature on the subject. There are some commonly-used rules-of-thumb that have been proposed (e.g. the Freedman-Diaconis Rule, Sturges' Rule, Scott's Rule, the Square-root rule, etc.) each of which has its own strengths and weaknesses.
您关于如何选择“最佳”数量的 bin 的问题很有趣,而且实际上有大量关于该主题的文献。已经提出了一些常用的经验法则(例如Freedman-Diaconis Rule、Sturges' Rule、Scott's Rule、Square-root rule等),每个规则都有自己的优点和缺点。
If you want a nice Python implementation of a variety of these auto-tuning histogram rules, you might check out the histogram functionality in the latest version of the AstroPy package, described here.
This works just like plt.hist
, but lets you use syntax like, e.g. hist(x, bins='freedman')
for choosing bins via the Freedman-Diaconis rule mentioned above.
如果您想要各种这些自动调整直方图规则的良好 Python 实现,您可以查看最新版本的 AstroPy 包中的直方图功能,此处描述。这就像 一样plt.hist
,但允许您使用类似的语法,例如hist(x, bins='freedman')
通过上面提到的 Freedman-Diaconis 规则选择 bin。
My personal favorite is "Bayesian Blocks" (bins="blocks"
), which solves for optimal binning with unequalbin widths. You can read a bit more on that here.
我个人最喜欢的是“贝叶斯块” ( bins="blocks"
),它解决了不等bin 宽度的最佳 binning 。你可以在这里阅读更多内容。
Edit, April 2017: with matplotlib version 2.0 or later and numpy version 1.11 or later, you can now specify automatically-determined bins directly in matplotlib, by specifying, e.g. bins='auto'
. This uses the maximum of the Sturges and Freedman-Diaconis bin choice. You can read more about the options in the numpy.histogram
docs.
编辑,2017 年 4 月:使用 matplotlib 2.0 或更高版本和 numpy 1.11 或更高版本,您现在可以直接在 matplotlib 中指定自动确定的 bin,例如指定bins='auto'
. 这使用了 Sturges 和 Freedman-Diaconis bin 选择的最大值。您可以在numpy.histogram
docs 中阅读有关选项的更多信息。
回答by idnavid
You're correct in expecting that the number of bins has significant impact on approximating the true underlying distribution. I haven't read the original paper myself, but according to Scott 1979, a good rule of thumb is to use:
您期望 bin 的数量对近似真实的基础分布有重大影响是正确的。我自己没有读过原始论文,但根据Scott 1979 的说法,一个好的经验法则是使用:
R(n^(1/3))/(3.49σ)
R(n^(1/3))/(3.49σ)
where
在哪里
Ris the range of data (in your case R = 3-(-3)= 6),
nis the number of samples,
σ is your standard deviation.
R是数据的范围(在你的情况下R = 3-(-3)= 6),
n是样本数,
σ 是您的标准差。
回答by gerrit
To complemented jakes answer, you can use
numpy.histogram_bin_edges
if you just want to calculate the optimal bin edges, without actually doing the histogram. histogram_bin_edges
is a function specifically designed for the optimal calculation of bin edges. You can choose seven different algorithms for the optimisation.
要补充jakes answer,numpy.histogram_bin_edges
如果您只想计算最佳 bin 边缘,而不实际执行直方图,则可以使用
。 histogram_bin_edges
是专门为优化计算 bin 边缘而设计的函数。您可以选择七种不同的算法进行优化。