Python 为什么 numpy std() 给出与 matlab std() 不同的结果?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27600207/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why does numpy std() give a different result to matlab std()?
提问by gustavgans
I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
我尝试将 matlab 代码转换为 numpy 并发现 numpy 与 std 函数有不同的结果。
in matlab
在matlab中
std([1,3,4,6])
ans = 2.0817
in numpy
在 numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?
这是正常的吗?我该如何处理?
采纳答案by Alex Riley
The NumPy function np.std
takes an optional parameter ddof
: "Delta Degrees of Freedom". By default, this is 0
. Set it to 1
to get the MATLAB result:
NumPy 函数np.std
采用一个可选参数ddof
:“Delta 自由度”。默认情况下,这是0
. 将其设置1
为获取 MATLAB 结果:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
为了添加更多上下文,在计算方差(其标准偏差是平方根)时,我们通常除以我们拥有的值的数量。
But if we select a random sample of N
elements from a larger distribution and calculate the variance, division by N
can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N
(usually N-1
). The ddof
parameter allows us change the divisor by the amount we specify.
但是,如果我们N
从较大的分布中随机选择元素样本并计算方差,则除法N
可能会导致低估实际方差。为了解决这个问题,我们可以将除以(自由度)的数字降低到小于N
(通常为N-1
)的数字。该ddof
参数允许我们按我们指定的数量更改除数。
Unless told otherwise, NumPy will calculate the biasedestimator for the variance (ddof=0
, dividing by N
). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof
parameter is given, NumPy divides by N - ddof
instead.
除非另有说明,否则 NumPy 将计算方差(,除以)的有偏估计量。如果您正在处理整个分布(而不是从较大分布中随机选取的值的子集),这就是您想要的。如果给定参数,则 NumPy 会除以除以代替。ddof=0
N
ddof
N - ddof
The default behaviour of MATLAB's std
is to correct the bias for sample variance by dividing by N-1
. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
MATLAB 的默认行为std
是通过除以 来校正样本方差的偏差N-1
。这消除了标准偏差中的一些(但可能不是全部)偏差。如果您在较大分布的随机样本上使用该函数,这可能就是您想要的。
The nice answer by @hbaderts gives further mathematical details.
@hbaderts 的好答案提供了更多的数学细节。
回答by hbaderts
The standard deviation is the square root of the variance. The variance of a random variable X
is defined as
标准差是方差的平方根。随机变量的方差X
定义为
An estimator for the variance would therefore be
因此,方差的估计量为
where denotes the sample mean. For randomly selected
, it can be shown that this estimator does not converge to the real variance, but to
其中表示样本均值。对于随机选择的
,可以证明这个估计量不收敛到真实方差,而是收敛到
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
如果您随机选择样本并估计样本均值和方差,则必须使用校正(无偏)估计量
which will converge to . The correction term
is also called Bessel's correction.
这将收敛到。修正项
也称为贝塞尔修正。
Now by default, MATLABs std
calculates the unbiasedestimator with the correction term n-1
. NumPy however (as @ajcr explained) calculates the biasedestimator with no correction term by default. The parameter ddof
allows to set any correction term n-ddof
. By setting it to 1 you get the same result as in MATLAB.
现在默认情况下,MATLAB使用校正项std
计算无偏估计量n-1
。然而,NumPy(如@ajcr 所解释的)在默认情况下计算没有校正项的有偏估计量。该参数ddof
允许设置任何校正项n-ddof
。通过将其设置为 1,您可以获得与 MATLAB 中相同的结果。
Similarly, MATLAB allows to add a second parameter w
, which specifies the "weighing scheme". The default, w=0
, results in the correction term n-1
(unbiased estimator), while for w=1
, only n is used as correction term (biased estimator).
同样,MATLAB 允许添加第二个参数w
,该参数指定“称重方案”。默认值 ,w=0
导致校正项n-1
(无偏估计量),而对于w=1
,只有 n 用作校正项(有偏估计量)。
回答by MJM
For people who aren't great with statistics, a simplistic guide is:
对于不擅长统计的人来说,一个简单的指南是:
Include
ddof=1
if you're calculatingnp.std()
for a sample taken from your full dataset.Ensure
ddof=0
if you're calculatingnp.std()
for the full population
包括
ddof=1
如果你计算np.std()
从您的完整数据集取样。确保
ddof=0
您计算np.std()
的是全部人口
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.
样本包含 DDOF 以抵消数字中可能出现的偏差。