Python 从样本数据计算置信区间
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15033511/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Compute a confidence interval from sample data
提问by Bmayer0122
I have sample data which I would like to compute a confidence interval for, assuming a normal distribution.
我有我想计算置信区间的样本数据,假设为正态分布。
I have found and installed the numpy and scipy packages and have gotten numpy to return a mean and standard deviation (numpy.mean(data) with data being a list). Any advice on getting a sample confidence interval would be much appreciated.
我已经找到并安装了 numpy 和 scipy 包,并且已经让 numpy 返回平均值和标准差(numpy.mean(data),数据是一个列表)。任何关于获得样本置信区间的建议将不胜感激。
采纳答案by shasan
import numpy as np
import scipy.stats
def mean_confidence_interval(data, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
return m, m-h, m+h
you can calculate like this way.
你可以这样计算。
回答by bogatron
Start with looking up the z-valuefor your desired confidence interval from a look-up table. The confidence interval is then mean +/- z*sigma, where sigmais the estimated standard deviation of your sample mean, given by sigma = s / sqrt(n), where sis the standard deviation computed from your sample data and nis your sample size.
从查找表中查找所需置信区间的z 值开始。置信区间为,其中是样本均值的估计标准差,由 给出,其中是从样本数据计算的标准差,是样本大小。mean +/- z*sigmasigmasigma = s / sqrt(n)sn
回答by Ulrich Stern
Here a shortened version of shasan's code, calculating the 95% confidence interval of the mean of array a:
这是 shasan 代码的简化版本,计算数组平均值的 95% 置信区间a:
import numpy as np, scipy.stats as st
st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))
But using StatsModels' tconfint_meanis arguably even nicer:
但是使用 StatsModels 的tconfint_mean可以说更好:
import statsmodels.stats.api as sms
sms.DescrStatsW(a).tconfint_mean()
The underlying assumptions for both are that the sample (array a) was drawn independently from a normal distribution with unknown standard deviation (see MathWorldor Wikipedia).
两者的基本假设是样本(数组a)独立于标准偏差未知的正态分布(参见MathWorld或Wikipedia)。
For large sample size n, the sample mean is normally distributed, and one can calculate its confidence interval using st.norm.interval()(as suggested in Jaime's comment). But the above solutions are correct also for small n, where st.norm.interval()gives confidence intervals that are too narrow (i.e., "fake confidence"). See my answerto a similar question for more details (and one of Russ's comments here).
对于大样本量 n,样本均值呈正态分布,并且可以使用st.norm.interval()(如 Jaime 的评论中所建议的)计算其置信区间。但是上述解决方案对于小 n 也是正确的,其中st.norm.interval()给出的置信区间太窄(即“假置信”)。有关更多详细信息,请参阅我对类似问题的回答(以及 Russ 在此处的评论之一)。
Here an example where the correct options give (essentially) identical confidence intervals:
这是一个示例,其中正确的选项给出(基本上)相同的置信区间:
In [9]: a = range(10,14)
In [10]: mean_confidence_interval(a)
Out[10]: (11.5, 9.4457397432391215, 13.554260256760879)
In [11]: st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))
Out[11]: (9.4457397432391215, 13.554260256760879)
In [12]: sms.DescrStatsW(a).tconfint_mean()
Out[12]: (9.4457397432391197, 13.55426025676088)
And finally, the incorrect result using st.norm.interval():
最后,使用st.norm.interval()以下错误的结果:
In [13]: st.norm.interval(0.95, loc=np.mean(a), scale=st.sem(a))
Out[13]: (10.23484868811834, 12.76515131188166)
回答by Xavier Guihot
Starting Python 3.8, the standard library provides the NormalDistobject as part of the statisticsmodule:
开始Python 3.8,标准库提供NormalDist对象作为statistics模块的一部分:
from statistics import NormalDist
def confidence_interval(data, confidence=0.95):
dist = NormalDist.from_samples(data)
z = NormalDist().inv_cdf((1 + confidence) / 2.)
h = dist.stdev * z / ((len(data) - 1) ** .5)
return dist.mean - h, dist.mean + h
This:
这个:
Creates a
NormalDistobject from the data sample (NormalDist.from_samples(data), which gives us access to the sample's mean and standard deviation viaNormalDist.meanandNormalDist.stdev.Compute the
Z-scorebased on the standard normal distribution (represented byNormalDist()) for the given confidence using the inverse of the cumulative distribution function (inv_cdf).Produces the confidence interval based on the sample's standard deviation and mean.
NormalDist从数据样本 ( )创建一个对象NormalDist.from_samples(data),这使我们可以通过NormalDist.mean和访问样本的均值和标准差NormalDist.stdev。计算
Z-score基于标准正态分布(由表示NormalDist()),用于使用累积分布函数的逆函数(给定的置信度inv_cdf)。根据样本的标准偏差和平均值生成置信区间。
This assumes the sample size is big enough (let's say more than ~100 points) in order to use the standard normal distribution rather than the student's t distribution to compute the zvalue.
这假设样本量足够大(比方说超过 100 个点),以便使用标准正态分布而不是学生的 t 分布来计算z值。

