Python 如何正确使用 scipy 的偏斜和峰度函数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45483890/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to correctly use scipy's skew and kurtosis functions?
提问by Alf
The skewnessis a parameter to measure the symmetry of a data set and the kurtosisto measure how heavy its tails are compared to a normal distribution, see for example here.
该偏斜度是衡量一个数据集的对称性和参数峰度来衡量多么沉重的尾部比正态分布,例如参见这里。
scipy.stats
provides an easy way to calculate these two quantities, see scipy.stats.kurtosis
and scipy.stats.skew
.
scipy.stats
提供了一种计算这两个量的简单方法,请参阅scipy.stats.kurtosis
和scipy.stats.skew
。
In my understanding, the skewness and kurtosis of a normal distributionshould both be 0 using the functions just mentioned. That is, however, not the case with my code:
根据我的理解,使用刚才提到的函数,正态分布的偏度和峰度应该都是 0。但是,我的代码并非如此:
import numpy as np
from scipy.stats import kurtosis
from scipy.stats import skew
x = np.linspace( -5, 5, 1000 )
y = 1./(np.sqrt(2.*np.pi)) * np.exp( -.5*(x)**2 ) # normal distribution
print( 'excess kurtosis of normal distribution (should be 0): {}'.format( kurtosis(y) ))
print( 'skewness of normal distribution (should be 0): {}'.format( skew(y) ))
The output is:
输出是:
excess kurtosis of normal distribution (should be 0): -0.307393087742
skewness of normal distribution (should be 0): 1.11082371392
正态分布的超额峰度(应为 0):-0.307393087742
正态分布的偏度(应为 0):1.11082371392
What am I doing wrong ?
我究竟做错了什么 ?
The versions I am using are
我使用的版本是
python: 2.7.6
scipy : 0.17.1
numpy : 1.12.1
回答by MSeifert
These functions calculate moments of the probability density distribution(that's why it takes only one parameter) and doesn't care about the "functional form" of the values.
这些函数计算概率密度分布的矩(这就是为什么它只需要一个参数)并且不关心值的“函数形式”。
These are meant for "random datasets" (think of them as measures like mean, standard deviation, variance):
这些用于“随机数据集”(将它们视为平均值、标准偏差、方差等度量):
import numpy as np
from scipy.stats import kurtosis, skew
x = np.random.normal(0, 2, 10000) # create random values based on a normal distribution
print( 'excess kurtosis of normal distribution (should be 0): {}'.format( kurtosis(x) ))
print( 'skewness of normal distribution (should be 0): {}'.format( skew(x) ))
which gives:
这使:
excess kurtosis of normal distribution (should be 0): -0.024291887786943356
skewness of normal distribution (should be 0): 0.009666157036010928
changing the number of random values increases the accuracy:
改变随机值的数量会提高准确性:
x = np.random.normal(0, 2, 10000000)
Leading to:
导致:
excess kurtosis of normal distribution (should be 0): -0.00010309478605163847
skewness of normal distribution (should be 0): -0.0006751744848755031
In your case the function "assumes" that each value has the same "probability" (because the values are equally distributed and each value occurs only once) so from the point of view of skew
and kurtosis
it's dealing with a non-gaussian probability density (not sure what exactly this is) which explains why the resulting values aren't even close to 0
:
你的情况函数“假设”每个值从观点的点具有相同的“概率”(因为值平均分配每个值只发生一次),所以skew
和kurtosis
它在处理非高斯概率密度(不确定这到底是什么)这解释了为什么结果值甚至不接近0
:
import numpy as np
from scipy.stats import kurtosis, skew
x_random = np.random.normal(0, 2, 10000)
x = np.linspace( -5, 5, 10000 )
y = 1./(np.sqrt(2.*np.pi)) * np.exp( -.5*(x)**2 ) # normal distribution
import matplotlib.pyplot as plt
f, (ax1, ax2) = plt.subplots(1, 2)
ax1.hist(x_random, bins='auto')
ax1.set_title('probability density (random)')
ax2.hist(y, bins='auto')
ax2.set_title('(your dataset)')
plt.tight_layout()
回答by Juan Leni
You are using as data the "shape" of the density function. These functions are meant to be used with data sampled from a distribution. If you sample from the distribution, you will obtain sample statistics that will approach the correct value as you increase the sample size. To plot the data, I would recommend a histogram.
您正在使用密度函数的“形状”作为数据。这些函数旨在与从分布中采样的数据一起使用。如果您从分布中抽样,您将获得随着您增加样本数量而接近正确值的样本统计量。要绘制数据,我建议使用直方图。
%matplotlib inline
import numpy as np
import pandas as pd
from scipy.stats import kurtosis
from scipy.stats import skew
import matplotlib.pyplot as plt
plt.style.use('ggplot')
data = np.random.normal(0, 1, 10000000)
np.var(data)
plt.hist(data, bins=60)
print("mean : ", np.mean(data))
print("var : ", np.var(data))
print("skew : ",skew(data))
print("kurt : ",kurtosis(data))
Output:
输出:
mean : 0.000410213500847
var : 0.999827716979
skew : 0.00012294118186476907
kurt : 0.0033554829466604374
Unless you are dealing with an analytical expression, it is extremely unlikely that you will obtain a zero when using data.
除非您正在处理分析表达式,否则在使用数据时获得零的可能性极小。