在 Python 中计算累积分布函数 (CDF)

Question

提问by wizbcn

How can I calculate in python the Cumulative Distribution Function (CDF)?

I want to calculate it from an array of points I have (discrete distribution), not with the continuous distributions that, for example, scipy has.

我想从我拥有的一组点（离散分布）中计算它，而不是使用例如 scipy 具有的连续分布。

Answer 1

采纳答案by DrV

(It is possible that my interpretation of the question is wrong. If the question is how to get from a discrete PDF into a discrete CDF, then np.cumsumdivided by a suitable constant will do if the samples are equispaced. If the array is not equispaced, then np.cumsumof the array multiplied by the distances between the points will do.)

（可能我对问题的解释是错误的。如果问题是如何从离散 PDF 转换为离散 CDF，那么np.cumsum如果样本是等距的，则除以合适的常数即可。如果数组不是等距的，然后np.cumsum将数组乘以点之间的距离即可。）

If you have a discrete array of samples, and you would like to know the CDF of the sample, then you can just sort the array. If you look at the sorted result, you'll realize that the smallest value represents 0% , and largest value represents 100 %. If you want to know the value at 50 % of the distribution, just look at the array element which is in the middle of the sorted array.

如果您有一个离散的样本数组，并且您想知道样本的 CDF，那么您可以对数组进行排序。如果查看排序结果，您会发现最小值代表 0% ，最大值代表 100% 。如果您想知道分布的 50% 处的值，只需查看排序数组中间的数组元素。

Let us have a closer look at this with a simple example:

让我们用一个简单的例子来仔细看看这个：

import matplotlib.pyplot as plt
import numpy as np

# create some randomly ddistributed data:
data = np.random.randn(10000)

# sort the data:
data_sorted = np.sort(data)

# calculate the proportional values of samples
p = 1. * np.arange(len(data)) / (len(data) - 1)

# plot the sorted data:
fig = figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')

ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')

This gives the following plot where the right-hand-side plot is the traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally it is not the as long as the number of points is finite.

这给出了以下图，其中右侧图是传统的累积分布函数。它应该反映点后面过程的CDF，但自然不是只要点数有限。

cumulative distribution function

累积分布函数

This function is easy to invert, and it depends on your application which form you need.

此函数很容易反转，这取决于您的应用程序需要哪种形式。

Answer 2

回答by PyRsquared

Assuming you know how your data is distributed (i.e. you know the pdf of your data), then scipydoes support discrete data when calculating cdf's

假设您知道数据的分布方式（即您知道数据的 pdf），那么scipy在计算 cdf 时确实支持离散数据

import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns

x = np.random.randn(10000) # generate samples from normal distribution (discrete data)
norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete

# plot the cdf
sns.lineplot(x=x, y=norm_cdf)
plt.show()

We can even print the first few values of the cdf to show they are discrete

我们甚至可以打印 cdf 的前几个值来显示它们是离散的

print(norm_cdf[:10])
>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,
       0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])

The same method to calculate the cdf also works for multiple dimensions: we use 2d data below to illustrate

同样的方法计算 cdf 也适用于多个维度：我们使用下面的 2d 数据来说明

mu = np.zeros(2) # mean vector
cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix
# generate 2d normally distributed samples using 0 mean and the covariance matrix above
x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples
norm_cdf = scipy.stats.norm.cdf(x)
print(norm_cdf.shape)
>>> (1000, 2)

In the above examples, I had prior knowledge that my data was normally distributed, which is why I used scipy.stats.norm()- there are multiple distributions scipy supports. But again, you need to know how your data is distributed beforehand to use such functions. If you don't know how your data is distributed and you just use any distribution to calculate the cdf, you most likely will get incorrect results.

在上面的例子中，我事先知道我的数据是正态分布的，这就是我使用的原因scipy.stats.norm()- scipy 支持多个分布。但同样，您需要事先知道您的数据是如何分布的才能使用这些功能。如果您不知道数据是如何分布的，而只是使用任何分布来计算 cdf，则很可能会得到不正确的结果。

在 Python 中计算累积分布函数 (CDF)

提问by wizbcn

采纳答案by DrV

回答by PyRsquared

相关推荐

最近更新

标签

在 Python 中计算累积分布函数 (CDF)

提问by wizbcn

采纳答案by DrV

回答by PyRsquared

相关推荐

Python 如何在 sublime 文本编辑器中清除控制台

Python 在熊猫数据框中找到最近的日期

Python 当我通过skip_footer arg时，Pandas read_csv忽略列dtypes

Python 如何可视化神经网络

相关推荐

最近更新

标签