pandas 如何估计密度函数并计算其峰值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/31248913/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to estimate density function and calculate its peaks?
提问by Yasmin
I have started to use python for analysis. I would like to do the following:
我已经开始使用python进行分析。我想做以下事情:
- Get the distribution of dataset
- Get the peaks in this distribution
- 获取数据集的分布
- 获取此分布中的峰值
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in questionto try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
我使用 scipy.stats 中的 gaussian_kde 来估计核密度函数。guassian_kde 是否对数据做出任何假设?我正在使用随时间变化的数据。因此,如果数据具有一种分布(例如高斯分布),则稍后可能具有另一种分布。gaussian_kde 在这种情况下有什么缺点吗?。有人建议在问题要尽量合身,以获得数据分布在每个分布的数据。那么,什么是使用gaussian_kde并提供了答案之间的差异问题。我使用了下面的代码,我想知道 gaussian_kde 是否是估计 pdf 数据是否会随着时间改变的好方法?我知道gaussian_kde的一个优点是,它通过计算作为一个经验法则自动带宽这里. 另外,我怎样才能得到它的峰值?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df)) 
y = pdf(x)                          
pl.plot(x, y, color = 'r') 
pl.hist(data_column, normed= True)
pl.show(block=True)       
回答by Jianxun Li
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow questionyou mention). To illustrate the difference between these two, try the following code.
我认为您需要将非参数密度(在 中实现的scipy.stats.kde)与参数密度(您提到的StackOverflow 问题中的那个)区分开来。为了说明这两者之间的区别,请尝试以下代码。
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')


From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing. 
从图中,我们看到非参数密度只不过是直方图的平滑版本。在直方图中,对于特定的观察x=x0,我们使用条形来表示它(将所有概率质量放在那个点上x=x0,将零放在其他地方),而在非参数密度估计中,我们使用钟形曲线(高斯核)来表示那个点(分布在其附近)。结果是平滑的密度曲线。这个内部高斯核与您对基础数据的分布假设无关x。它的唯一目的是平滑。
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
为了得到非参数密度的众数,我们需要做一个详尽的搜索,因为密度不保证是单模的。如上例所示,如果您的拟牛顿优化算法在 [5,10] 之间开始,则很可能以局部最优点而不是全局最优点结束。
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]

