如何在python中标准化直方图?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22241240/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to normalize a histogram in python?
提问by user40
I'm trying to plot normed histogram, but instead of getting 1 as maximum value on y axis, I'm getting different numbers.
我正在尝试绘制归一直方图,但不是在 y 轴上获得 1 作为最大值,而是获得了不同的数字。
For array k=(1,4,3,1)
对于数组 k=(1,4,3,1)
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(1,4,3,1)
plt.hist(k, normed=1)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
I get this histogram, that doesn't look like normed.
我得到了这个直方图,它看起来不像规范。


For a different array k=(3,3,3,3)
对于不同的数组 k=(3,3,3,3)
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(3,3,3,3)
plt.hist(k, normed=1)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
I get this histogram with max y-value is 10.
我得到最大 y 值为 10 的直方图。


For different k I get different max value of y even though normed=1 or normed=True.
对于不同的 k,即使 normed=1 或 normed=True,我也会得到不同的 y 最大值。
Why the normalization (if it works) changes based on the data and how can I make maximum value of y equals to 1?
为什么标准化(如果有效)会根据数据发生变化,以及如何使 y 的最大值等于 1?
UPDATE:
更新:
I am trying to implement Carsten K?niganswer from plotting histograms whose bar heights sum to 1 in matplotliband getting very weird result:
我试图通过绘制在 matplotlib 中条形高度总和为 1 的直方图来实现Carsten K?nig答案,并得到非常奇怪的结果:
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(1,4,3,1)
weights = np.ones_like(k)/len(k)
plt.hist(k, weights=weights)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
Result:
结果:


What am I doing wrong?
我究竟做错了什么?
Thanks
谢谢
采纳答案by CT Zhu
When you plot a normalized histogram, it is not the height that should sum up to one, but the area underneath the curve should sum up to one:
绘制归一化直方图时,不应将高度加起来为 1,而是曲线下方的面积应加起来为 1:
In [44]:
import matplotlib.pyplot as plt
k=(3,3,3,3)
x, bins, p=plt.hist(k, density=True) # used to be normed=True in older versions
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
In [45]:
print bins
[ 2.5 2.6 2.7 2.8 2.9 3. 3.1 3.2 3.3 3.4 3.5]
Here, this example, the bin width is 0.1, the area underneath the curve sums up to one (0.1*10).
在此示例中,bin 宽度为 0.1,曲线下方的面积总和为 1 (0.1*10)。
To have the sum of height to be 1, add the following before plt.show():
要使高度总和为 1,请在 之前添加以下内容plt.show():
for item in p:
item.set_height(item.get_height()/sum(x))


回答by zhangxaochen
One way is to get the probabilities on your own, and then plot with plt.bar:
一种方法是自己获得概率,然后用plt.bar以下方式绘制:
In [91]: from collections import Counter
...: c=Counter(k)
...: print c
Counter({1: 2, 3: 1, 4: 1})
In [92]: plt.bar(prob.keys(), prob.values())
...: plt.show()
result:

结果:

回答by kthouz
A normed histogram is defined such that the sum of products of width and height of each column is equal to the total count. That's why you are not getting your max equal to one.
规范的直方图定义为每列的宽度和高度的乘积之和等于总计数。这就是为什么你没有让你的最大值等于 1。
However, if you still want to force it to be 1, you could use numpy and matplotlib.pyplot.bar in the following way
但是,如果您仍然想强制它为 1,则可以按以下方式使用 numpy 和 matplotlib.pyplot.bar
sample = np.random.normal(0,10,100)
#generate bins boundaries and heights
bin_height,bin_boundary = np.histogram(sample,bins=10)
#define width of each column
width = bin_boundary[1]-bin_boundary[0]
#standardize each column by dividing with the maximum height
bin_height = bin_height/float(max(bin_height))
#plot
plt.bar(bin_boundary[:-1],bin_height,width = width)
plt.show()
回答by upceric
回答by Tova Halász
How the lines above:
上面的几行:
weights = np.ones_like(myarray)/float(len(myarray))
plt.hist(myarray, weights=weights)
should work when I have a stacked histogram like this?-
当我有这样的堆叠直方图时应该工作吗? -
n, bins, patches = plt.hist([from6to10, from10to14, from14to18, from18to22, from22to6],
label= ['06:00-10:00','10:00-14:00','14:00-18:00','18:00- 22:00','22:00-06:00'],
stacked=True,edgecolor='black', alpha=0.8, linewidth=0.5, range=(np.nanmin(ref1arr),
stacked=True,edgecolor='black', alpha=0.8, linewidth=0.5, range=(np.nanmin(ref1arr), np.nanmax(ref1arr)), bins=10)

