Python 如何正确使用 scipy 的偏斜和峰度函数？

Question

提问by Alf

The skewnessis a parameter to measure the symmetry of a data set and the kurtosisto measure how heavy its tails are compared to a normal distribution, see for example here.

该偏斜度是衡量一个数据集的对称性和参数峰度来衡量多么沉重的尾部比正态分布，例如参见这里。

scipy.statsprovides an easy way to calculate these two quantities, see scipy.stats.kurtosisand scipy.stats.skew.

scipy.stats提供了一种计算这两个量的简单方法，请参阅scipy.stats.kurtosis和scipy.stats.skew。

In my understanding, the skewness and kurtosis of a normal distributionshould both be 0 using the functions just mentioned. That is, however, not the case with my code:

根据我的理解，使用刚才提到的函数，正态分布的偏度和峰度应该都是 0。但是，我的代码并非如此：

import numpy as np
from scipy.stats import kurtosis
from scipy.stats import skew

x = np.linspace( -5, 5, 1000 )
y = 1./(np.sqrt(2.*np.pi)) * np.exp( -.5*(x)**2  )  # normal distribution

print( 'excess kurtosis of normal distribution (should be 0): {}'.format( kurtosis(y) ))
print( 'skewness of normal distribution (should be 0): {}'.format( skew(y) ))

The output is:

输出是：

excess kurtosis of normal distribution (should be 0): -0.307393087742
skewness of normal distribution (should be 0): 1.11082371392

正态分布的超额峰度（应为 0）：-0.307393087742
正态分布的偏度（应为 0）：1.11082371392

What am I doing wrong ?

我究竟做错了什么？

The versions I am using are

我使用的版本是

python: 2.7.6
scipy : 0.17.1
numpy : 1.12.1

Answer 1

回答by MSeifert

These functions calculate moments of the probability density distribution(that's why it takes only one parameter) and doesn't care about the "functional form" of the values.

这些函数计算概率密度分布的矩（这就是为什么它只需要一个参数）并且不关心值的“函数形式”。

These are meant for "random datasets" (think of them as measures like mean, standard deviation, variance):

这些用于“随机数据集”（将它们视为平均值、标准偏差、方差等度量）：

import numpy as np
from scipy.stats import kurtosis, skew

x = np.random.normal(0, 2, 10000)   # create random values based on a normal distribution

print( 'excess kurtosis of normal distribution (should be 0): {}'.format( kurtosis(x) ))
print( 'skewness of normal distribution (should be 0): {}'.format( skew(x) ))

which gives:

这使：

excess kurtosis of normal distribution (should be 0): -0.024291887786943356
skewness of normal distribution (should be 0): 0.009666157036010928

changing the number of random values increases the accuracy:

改变随机值的数量会提高准确性：

x = np.random.normal(0, 2, 10000000)

Leading to:

导致：

excess kurtosis of normal distribution (should be 0): -0.00010309478605163847
skewness of normal distribution (should be 0): -0.0006751744848755031

In your case the function "assumes" that each value has the same "probability" (because the values are equally distributed and each value occurs only once) so from the point of view of skewand kurtosisit's dealing with a non-gaussian probability density (not sure what exactly this is) which explains why the resulting values aren't even close to 0:

你的情况函数“假设”每个值从观点的点具有相同的“概率”（因为值平均分配每个值只发生一次），所以skew和kurtosis它在处理非高斯概率密度（不确定这到底是什么）这解释了为什么结果值甚至不接近0：

import numpy as np
from scipy.stats import kurtosis, skew

x_random = np.random.normal(0, 2, 10000)

x = np.linspace( -5, 5, 10000 )
y = 1./(np.sqrt(2.*np.pi)) * np.exp( -.5*(x)**2  )  # normal distribution

import matplotlib.pyplot as plt

f, (ax1, ax2) = plt.subplots(1, 2)
ax1.hist(x_random, bins='auto')
ax1.set_title('probability density (random)')
ax2.hist(y, bins='auto')
ax2.set_title('(your dataset)')
plt.tight_layout()

Answer 2

回答by Juan Leni

You are using as data the "shape" of the density function. These functions are meant to be used with data sampled from a distribution. If you sample from the distribution, you will obtain sample statistics that will approach the correct value as you increase the sample size. To plot the data, I would recommend a histogram.

您正在使用密度函数的“形状”作为数据。这些函数旨在与从分布中采样的数据一起使用。如果您从分布中抽样，您将获得随着您增加样本数量而接近正确值的样本统计量。要绘制数据，我建议使用直方图。

%matplotlib inline
import numpy as np
import pandas as pd
from scipy.stats import kurtosis
from scipy.stats import skew

import matplotlib.pyplot as plt

plt.style.use('ggplot')

data = np.random.normal(0, 1, 10000000)
np.var(data)

plt.hist(data, bins=60)

print("mean : ", np.mean(data))
print("var  : ", np.var(data))
print("skew : ",skew(data))
print("kurt : ",kurtosis(data))

Output:

输出：

mean :  0.000410213500847
var  :  0.999827716979
skew :  0.00012294118186476907
kurt :  0.0033554829466604374

Unless you are dealing with an analytical expression, it is extremely unlikely that you will obtain a zero when using data.

除非您正在处理分析表达式，否则在使用数据时获得零的可能性极小。

Python 如何正确使用 scipy 的偏斜和峰度函数？

提问by Alf

回答by MSeifert

回答by Juan Leni

相关推荐

最近更新

标签

Python 如何正确使用 scipy 的偏斜和峰度函数？

提问by Alf

回答by MSeifert

回答by Juan Leni

相关推荐

Python 解析单个 CSV 字符串？

Python Anaconda 与 miniconda

Python 与 xlrd 相比，使用 openpyxl 读取 Excel 文件的速度要慢得多

Python 在 Anaconda 上安装特定版本的 tensorflow

相关推荐

最近更新

标签