Python 用scipy获得置信区间的正确方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28242593/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Correct way to obtain confidence interval with scipy
提问by Gabriel
I have a 1-dimensional array of data:
我有一个一维数据数组:
a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])
for which I want to obtain the 68% confidence interval (ie: the 1 sigma).
我想获得 68% 的置信区间(即:1 sigma)。
The first comment in this answerstates that this can be achieved using scipy.stats.norm.interval
from the scipy.stats.normfunction, via:
在第一个评论这个回答指出,这可以实现使用scipy.stats.norm.interval
从scipy.stats.norm功能,通过:
from scipy import stats
import numpy as np
mean, sigma = np.mean(a), np.std(a)
conf_int = stats.norm.interval(0.68, loc=mean,
scale=sigma)
But a comment in this poststates that the actualcorrect way of obtaining the confidence interval is:
但是这篇文章中的评论指出,获得置信区间的实际正确方法是:
conf_int = stats.norm.interval(0.68, loc=mean,
scale=sigma / np.sqrt(len(a)))
that is, sigma is divided by the square-root of the sample size: np.sqrt(len(a))
.
也就是说,sigma 除以样本大小的平方根:np.sqrt(len(a))
。
The question is: which version is the correct one?
问题是:哪个版本是正确的?
采纳答案by unutbu
The 68% confidence interval for a single drawfrom a normal distribution with mean mu and std deviation sigma is
从具有均值 mu 和标准偏差 sigma 的正态分布单次抽取的 68% 置信区间为
stats.norm.interval(0.68, loc=mu, scale=sigma)
The 68% confidence interval for the mean of N drawsfrom a normal distribution with mean mu and std deviation sigma is
N 均值的 68% 置信区间取自具有均值 mu 和标准差 sigma 的正态分布
stats.norm.interval(0.68, loc=mu, scale=sigma/sqrt(N))
Intuitively, these formulas make sense, since if you hold up a jar of jelly beans and ask a large number of people to guess the number of jelly beans, each individual may be off by a lot -- the same std deviation sigma
-- but the average of the guesses will do a remarkably fine job of estimating the actual number and this is reflected by the standard deviation of the mean shrinking by a factor of 1/sqrt(N)
.
直觉上,这些公式是有道理的,因为如果你举起一罐软糖并让很多人猜测软糖的数量,每个人可能会偏离很多——同样的标准偏差sigma
——但是猜测的平均值将在估计实际数字方面做得非常好,这反映在平均值的标准偏差缩小了1/sqrt(N)
.
If a single draw has variance sigma**2
, then by the Bienaymé formula, the sum of N
uncorrelateddraws has variance N*sigma**2
.
如果单次抽签有方差sigma**2
,那么根据Bienaymé 公式,N
不相关抽签的总和有方差N*sigma**2
。
The mean is equal to the sum divided by N. When you multiply a random variable (like the sum) by a constant, the variance is multiplied by the constant squared. That is
均值等于总和除以 N。当您将随机变量(如总和)乘以常数时,方差乘以常数平方。那是
Var(cX) = c**2 * Var(X)
So the variance of the mean equals
所以均值的方差等于
(variance of the sum)/N**2 = N * sigma**2 / N**2 = sigma**2 / N
and so the standard deviation of the mean (which is the square root of the variance) equals
因此平均值的标准偏差(即方差的平方根)等于
sigma/sqrt(N).
This is the origin of the sqrt(N)
in the denominator.
这就是sqrt(N)
分母中的由来。
Here is some example code, based on Tom's code, which demonstrates the claims made above:
下面是一些基于 Tom 代码的示例代码,它演示了上述声明:
import numpy as np
from scipy import stats
N = 10000
a = np.random.normal(0, 1, N)
mean, sigma = a.mean(), a.std(ddof=1)
conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma)
print('{:0.2%} of the single draws are in conf_int_a'
.format(((a >= conf_int_a[0]) & (a < conf_int_a[1])).sum() / float(N)))
M = 1000
b = np.random.normal(0, 1, (N, M)).mean(axis=1)
conf_int_b = stats.norm.interval(0.68, loc=0, scale=1 / np.sqrt(M))
print('{:0.2%} of the means are in conf_int_b'
.format(((b >= conf_int_b[0]) & (b < conf_int_b[1])).sum() / float(N)))
prints
印刷
68.03% of the single draws are in conf_int_a
67.78% of the means are in conf_int_b
Beware that if you define conf_int_b
with the estimates for mean
and sigma
based on the sample a
, the mean may not fall in conf_int_b
with the desired
frequency.
请注意,如果你定义conf_int_b
与估计mean
和sigma
基于样本a
,平均可能不属于在conf_int_b
与所需的频率。
If you take a samplefrom a distribution and compute the sample mean and std deviation,
如果你把样品从分布和计算样本均值和标准方差,
mean, sigma = a.mean(), a.std()
be careful to note that there is no guarantee that these will equal the populationmean and standard deviation and that we are assumingthe population is normally distributed -- those are not automatic givens!
请注意,不能保证这些将等于总体均值和标准差,并且我们假设总体呈正态分布——这些不是自动给出的!
If you take a sample and want to estimatethe population mean and standard deviation, you should use
如果您抽取样本并想要估计总体均值和标准差,您应该使用
mean, sigma = a.mean(), a.std(ddof=1)
since this value for sigma is the unbiased estimatorfor the population standard deviation.
因为 sigma 的这个值是总体标准差的无偏估计量。
回答by Tom
I tested out your methods using an array with a known confidence interval. numpy.random.normal(mu,std,size) returns an array centered on mu with a standard deviation of std (in the docs, this is defined as Standard deviation (spread or “width”) of the distribution.
).
我使用具有已知置信区间的数组测试了您的方法。numpy.random.normal(mu,std,size) 返回一个以 mu 为中心且标准差为 std (在文档中,这被定义为Standard deviation (spread or “width”) of the distribution.
)的数组。
from scipy import stats
import numpy as np
from numpy import random
a = random.normal(0,1,10000)
mean, sigma = np.mean(a), np.std(a)
conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma)
conf_int_b = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
conf_int_a
(-1.0011149125527312, 1.0059797764202412)
conf_int_b
(-0.0076030415111100983, 0.012467905378619625)
As the sigma value should be -1 to 1, the / np.sqrt(len(a))
method appears to be incorrect.
由于 sigma 值应为 -1 到 1,因此该/ np.sqrt(len(a))
方法似乎不正确。
Edit
编辑
Since I don't have the reputation to comment above I'll clarify how this answer ties into unutbu's thorough answer. If you populate a random array with a normal distribution, 68% of the total will fall within 1-σ of the mean. In the case above, if you check that you see
由于我没有在上面发表评论的声誉,我将澄清这个答案如何与 unutbu 的彻底答案联系起来。如果您使用正态分布填充随机数组,则总数的 68% 将落在均值的 1-σ 范围内。在上述情况下,如果您检查是否看到
b = a[np.where((a>-1)&(a <1))]
len(a)
> 6781
or 68% of the population falls within 1σ. Well, about 68%. As you use a larger and larger array, you will approach 68% (In a trial of 10, 9 were between -1 and 1). That's because the 1-σ is the inherent distribution of the data, and the more data you have the better you can resolve it.
或 68% 的人口落在 1σ 以内。嗯,大约 68%。随着您使用越来越大的阵列,您将接近 68%(在 10 次试验中,9 次在 -1 和 1 之间)。那是因为 1-σ 是数据的固有分布,您拥有的数据越多,解决它的能力就越好。
Basically, my interpretation of your question was If I have a sample of data I want to use to describe the distribution they were drawn from, what is the method to find the standard deviation of that data?whereas unutbu's interpretation appears to be more What is the interval to which I can place the mean with 68% confidence?. Which would mean, for jelly beans, I answered How are they guessingand unutbu answered What do their guesses tell us about the jelly beans.
基本上,我对你的问题的解释是,如果我有一个数据样本,我想用它来描述它们所来自的分布,那么找到该数据的标准偏差的方法是什么?而 unutbu 的解释似乎更多我可以将均值置于 68% 置信度的区间是多少?. 这意味着,对于软糖,我回答他们如何猜测,而 unutbu 回答他们的猜测告诉我们关于软糖的什么。
回答by Ulrich Stern
I just checked how R and GraphPad calculate confidence intervals, and they increase the interval in case of small sample size (n). E.g., more than 6-fold for n=2 compared to a large n. This code (based on shasan's answer) matches their confidence intervals:
我刚刚检查了 R 和 GraphPad 如何计算置信区间,并且在样本量较小 (n) 的情况下它们会增加区间。例如,与较大的 n 相比,n=2 时超过 6 倍。此代码(基于 shasan 的回答)匹配他们的置信区间:
import numpy as np, scipy.stats as st
# returns confidence interval of mean
def confIntMean(a, conf=0.95):
mean, sem, m = np.mean(a), st.sem(a), st.t.ppf((1+conf)/2., len(a)-1)
return mean - m*sem, mean + m*sem
For R, I checked against t.test(a). GraphPad's confidence interval of a meanpage has "user level" info on the sample size dependency.
对于 R,我检查了 t.test(a)。GraphPad 的平均页面置信区间具有关于样本大小依赖性的“用户级别”信息。
Here the output for Gabriel's example:
这是 Gabriel 示例的输出:
In [2]: a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])
In [3]: confIntMean(a, 0.68)
Out[3]: (3.9974214366806184, 4.877578563319382)
In [4]: st.norm.interval(0.68, loc=np.mean(a), scale=st.sem(a))
Out[4]: (4.0120010966037407, 4.8629989033962593)
Note that the difference between the confIntMean()
and st.norm.interval()
intervals is relatively small here; len(a) == 16 is not too small.
注意这里confIntMean()
和st.norm.interval()
区间的区别比较小;len(a) == 16 不算太小。