在 Python 中计算累积分布函数 (CDF)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24788200/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:11:25  来源:igfitidea点击:

Calculate the Cumulative Distribution Function (CDF) in Python

pythonnumpymachine-learningstatisticsscipy

提问by wizbcn

How can I calculate in python the Cumulative Distribution Function (CDF)?

如何在 python 中计算累积分布函数 (CDF)

I want to calculate it from an array of points I have (discrete distribution), not with the continuous distributions that, for example, scipy has.

我想从我拥有的一组点(离散分布)中计算它,而不是使用例如 scipy 具有的连续分布。

采纳答案by DrV

(It is possible that my interpretation of the question is wrong. If the question is how to get from a discrete PDF into a discrete CDF, then np.cumsumdivided by a suitable constant will do if the samples are equispaced. If the array is not equispaced, then np.cumsumof the array multiplied by the distances between the points will do.)

(可能我对问题的解释是错误的。如果问题是如何从离散 PDF 转换为离散 CDF,那么np.cumsum如果样本是等距的,则除以合适的常数即可。如果数组不是等距的,然后np.cumsum将数组乘以点之间的距离即可。)

If you have a discrete array of samples, and you would like to know the CDF of the sample, then you can just sort the array. If you look at the sorted result, you'll realize that the smallest value represents 0% , and largest value represents 100 %. If you want to know the value at 50 % of the distribution, just look at the array element which is in the middle of the sorted array.

如果您有一个离散的样本数组,并且您想知道样本的 CDF,那么您可以对数组进行排序。如果查看排序结果,您会发现最小值代表 0% ,最大值代表 100% 。如果您想知道分布的 50% 处的值,只需查看排序数组中间的数组元素。

Let us have a closer look at this with a simple example:

让我们用一个简单的例子来仔细看看这个:

import matplotlib.pyplot as plt
import numpy as np

# create some randomly ddistributed data:
data = np.random.randn(10000)

# sort the data:
data_sorted = np.sort(data)

# calculate the proportional values of samples
p = 1. * np.arange(len(data)) / (len(data) - 1)

# plot the sorted data:
fig = figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')

ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')

This gives the following plot where the right-hand-side plot is the traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally it is not the as long as the number of points is finite.

这给出了以下图,其中右侧图是传统的累积分布函数。它应该反映点后面过程的CDF,但自然不是只要点数有限。

cumulative distribution function

累积分布函数

This function is easy to invert, and it depends on your application which form you need.

此函数很容易反转,这取决于您的应用程序需要哪种形式。

回答by PyRsquared

Assuming you know how your data is distributed (i.e. you know the pdf of your data), then scipydoes support discrete data when calculating cdf's

假设您知道数据的分布方式(即您知道数据的 pdf),那么scipy在计算 cdf 时确实支持离散数据

import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns

x = np.random.randn(10000) # generate samples from normal distribution (discrete data)
norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete

# plot the cdf
sns.lineplot(x=x, y=norm_cdf)
plt.show()

enter image description here

在此处输入图片说明

We can even print the first few values of the cdf to show they are discrete

我们甚至可以打印 cdf 的前几个值来显示它们是离散的

print(norm_cdf[:10])
>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,
       0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])

The same method to calculate the cdf also works for multiple dimensions: we use 2d data below to illustrate

同样的方法计算 cdf 也适用于多个维度:我们使用下面的 2d 数据来说明

mu = np.zeros(2) # mean vector
cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix
# generate 2d normally distributed samples using 0 mean and the covariance matrix above
x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples
norm_cdf = scipy.stats.norm.cdf(x)
print(norm_cdf.shape)
>>> (1000, 2)

In the above examples, I had prior knowledge that my data was normally distributed, which is why I used scipy.stats.norm()- there are multiple distributions scipy supports. But again, you need to know how your data is distributed beforehand to use such functions. If you don't know how your data is distributed and you just use any distribution to calculate the cdf, you most likely will get incorrect results.

在上面的例子中,我事先知道我的数据是正态分布的,这就是我使用的原因scipy.stats.norm()- scipy 支持多个分布。但同样,您需要事先知道您的数据是如何分布的才能使用这些功能。如果您不知道数据是如何分布的,而只是使用任何分布来计算 cdf,则很可能会得到不正确的结果。