如何计算python中列表的方差?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35583302/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I calculate the variance of a list in python?
提问by minks
If I have a list like this:
如果我有这样的列表:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
I want to calculate the variance of this list in Python which is the average of the squared differences from the mean.
我想用 Python 计算这个列表的方差,它是均值的平方差的平均值。
How can I go about this? Accessing the elements in the list to do the computations is confusing me for getting the square differences.
我该怎么办?访问列表中的元素来进行计算让我无法获得平方差。
采纳答案by Cleb
You can use numpy's built-in function var
:
您可以使用 numpy 的内置函数var
:
import numpy as np
results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
print(np.var(results))
This gives you 28.822364260579157
这给你 28.822364260579157
If - for whatever reason - you cannot use numpy
and/or you don't want to use a built-in function for it, you can also calculate it "by hand" using e.g. a list comprehension:
如果 - 无论出于何种原因 - 您不能使用numpy
和/或您不想为其使用内置函数,您也可以使用例如列表理解来“手动”计算它:
# calculate mean
m = sum(results) / len(results)
# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)
which gives you the identical result.
这会给你相同的结果。
If you are interested in the standard deviation, you can use numpy.std:
如果您对标准偏差感兴趣,可以使用numpy.std:
print(np.std(results))
5.36864640860051
@Serge Ballesta explained very wellthe difference between variance n
and n-1
. In numpy you can easily set this parameter using the option ddof
; its default is 0
, so for the n-1
case you can simply do:
@Serge Ballesta 很好地解释了方差n
和n-1
. 在 numpy 中,您可以使用选项轻松设置此参数ddof
;它的默认值为0
,因此对于这种n-1
情况,您可以简单地执行以下操作:
np.var(results, ddof=1)
The "by hand" solution is given in @Serge Ballesta's answer.
“手工”解决方案在@Serge Ballesta's answer 中给出。
Both approaches yield 32.024849178421285
.
两种方法都产生了32.024849178421285
。
You can set the parameter also for std
:
您也可以为以下设置参数std
:
np.std(results, ddof=1)
5.659050201086865
回答by Serge Ballesta
Well, there are two ways for defining the variance. You have the variance nthat you use when you have a full set, and the variance n-1that you use when you have a sample.
好吧,有两种方法可以定义方差。你有差异ñ当你有一个全套您使用,方差N-1 ,当你有一个样品大家使用。
The difference between the 2 is whether the value m = sum(xi) / n
is the real average or whether it is just an approximation of what the average should be.
2 之间的区别在于该值m = sum(xi) / n
是实际平均值还是只是平均值的近似值。
Example1 : you want to know the average height of the students in a class and its variance : ok, the value m = sum(xi) / n
is the real average, and the formulas given by Cleb are ok (variance n).
Example1:你想知道一个班级学生的平均身高及其方差:ok,这个值m = sum(xi) / n
是真实的平均值,Cleb给出的公式是可以的(方差n)。
Example2 : you want to know the average hour at which a bus passes at the bus stop and its variance. You note the hour for a month, and get 30 values. Here the value m = sum(xi) / n
is only an approximation of the real average, and that approximation will be more accurate with more values. In that case the best approximation for the actual variance is the variance n-1
示例 2:您想知道公交车在公交车站经过的平均小时数及其方差。您记下一个月的小时数,并获得 30 个值。这里的值m = sum(xi) / n
只是实际平均值的近似值,并且该近似值会随着值的增加而更加准确。在这种情况下,实际方差的最佳近似值是方差n-1
varRes = sum([(xi - m)**2 for xi in results]) / (len(results) -1)
Ok, it has nothing to do with Python, but it does have an impact on statistical analysis, and the question is tagged statisticsand variance
好吧,和Python没有关系,但是确实对统计分析有影响,问题是标记了统计和方差
Note: ordinarily, statistical libraries like numpy use the variance nfor what they call var
or variance
, and the variance n-1for the function that gives the standard deviation.
注意:通常,像 numpy 这样的统计库使用方差n表示它们所称的var
or variance
,而方差n-1表示给出标准偏差的函数。
回答by roadrunner66
Numpy is indeed the most elegant and fast way to do it.
Numpy 确实是最优雅、最快速的方法。
I think the actual question was about how to access the individual elements of a list to do such a calculation yourself, so below an example:
我认为实际的问题是关于如何访问列表的各个元素来自己进行这样的计算,下面是一个例子:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
import numpy as np
print 'numpy variance: ', np.var(results)
# without numpy by hand
# there are two ways of calculating the variance
# - 1. direct as central 2nd order moment (https://en.wikipedia.org/wiki/Moment_(mathematics))divided by the length of the vector
# - 2. "mean of square minus square of mean" (see https://en.wikipedia.org/wiki/Variance)
# calculate mean
n= len(results)
sum=0
for i in range(n):
sum = sum+ results[i]
mean=sum/n
print 'mean: ', mean
# calculate the central moment
sum2=0
for i in range(n):
sum2=sum2+ (results[i]-mean)**2
myvar1=sum2/n
print "my variance1: ", myvar1
# calculate the mean of square minus square of mean
sum3=0
for i in range(n):
sum3=sum3+ results[i]**2
myvar2 = sum3/n - mean**2
print "my variance2: ", myvar2
gives you:
给你:
numpy variance: 28.8223642606
mean: -3.731599805
my variance1: 28.8223642606
my variance2: 28.8223642606
回答by Xavier Guihot
Starting Python 3.4
, the standard library comes with the variance
function (sample varianceor variance n-1) as part of the statistics
module:
开始Python 3.4
,标准库附带variance
函数(样本方差或方差 n-1)作为statistics
模块的一部分:
from statistics import variance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
variance(data)
# 32.024849178421285
The population variance(or variance n) can be obtained using the pvariance
function:
所述population方差(或方差ñ可以使用获得)pvariance
功能:
from statistics import pvariance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
pvariance(data)
# 28.822364260579157
Also note that if you already know the mean of your list, the variance
and pvariance
functions take a second argument (respectively xbar
and mu
) in order to spare recomputing the mean of the sample (which is part of the variance computation).
另请注意,如果您已经知道列表的均值,则variance
和pvariance
函数采用第二个参数(分别为xbar
和mu
)以节省重新计算样本的均值(这是方差计算的一部分)。
回答by Mark Lakata
The correct answer is to use one of the packages like NumPy, but if you want to roll your own, and you want to do incrementally, there is a good algorithm that has higher accuracy. See this link https://www.johndcook.com/blog/standard_deviation/
正确的答案是使用 NumPy 之类的包之一,但是如果您想自己推出,并且想要增量进行,则有一种具有更高准确性的好算法。请参阅此链接https://www.johndcook.com/blog/standard_deviation/
I ported my perl implementation to Python. Please point out issues in the comments.
我将我的 perl 实现移植到 Python。请在评论中指出问题。
Mklast = 0
Mk = 0
Sk = 0
k = 0
for xi in results:
k = k +1
Mk = Mklast + (xi - Mklast) / k
Sk = Sk + (xi - Mklast) * ( xi - Mk)
Mklast = Mk
var = Sk / (k -1)
print var
Answer is
答案是
>>> print var
32.0248491784
回答by sim
import numpy as np
def get_variance(xs):
mean = np.mean(xs)
summed = 0
for x in xs:
summed += (x - mean)**2
return summed / (len(xs))
print(get_variance([1,2,3,4,5]))
out 2.0
出 2.0
a = [1,2,3,4,5]
variance = np.var(a, ddof=1)
print(variance)
回答by Shushiro
Without imports, I would use the following python3 script:
如果没有导入,我将使用以下 python3 脚本:
#!/usr/bin/env python3
def createData():
data1=[12,54,60,3,15,6,36]
data2=[1,2,3,4,5]
data3=[100,30000,1567,3467,20000,23457,400,1,15]
dataset=[]
dataset.append(data1)
dataset.append(data2)
dataset.append(data3)
return dataset
def calculateMean(data):
means=[]
# one list of the nested list
for oneDataset in data:
sum=0
mean=0
# one datapoint in one inner list
for number in oneDataset:
# summing up
sum+=number
# mean for one inner list
mean=sum/len(oneDataset)
# adding a tuples of the original data and their mean to
# a list of tuples
item=(oneDataset, mean)
means.append(item)
return means
# to do: substract mean from each element and square the result
# sum up the square results and divide by number of elements
def calculateVariance(meanData):
variances=[]
# meanData is the list of tuples
# pair is one tuple
for pair in meanData:
# pair[0] is the original data
interResult=0
squareSum=0
for element in pair[0]:
interResult=(element-pair[1])**2
squareSum+=interResult
variance=squareSum/len(pair[0])
variances.append((pair[0], pair[1], variance))
return variances
def main():
my_data=createData()
my_means=calculateMean(my_data)
my_variances=calculateVariance(my_means)
print(my_variances)
if __name__ == "__main__":
main()
here you get a print of the original data, their mean and the variance. I know this approach covers a list of several datasets, yet I think you can adapt it quickly for your purpose ;)
在这里,您可以打印原始数据、它们的均值和方差。我知道这种方法涵盖了几个数据集的列表,但我认为您可以根据自己的目的快速调整它;)