累积分布图python

Question

提问by akhiljain

I am doing a project using python where I have two arrays of data. Let's call them pcand pnc. I am required to plot a cumulative distribution of both of these on the same graph. For pcit is supposed to be a less than plot i.e. at (x,y), y points in pcmust have value less than x. For pncit is to be a more than plot i.e. at (x,y), y points in pncmust have value more than x.

我正在使用 python 做一个项目，其中有两个数据数组。我们称它们为pc和pnc。我需要在同一张图上绘制这两个的累积分布。对于pc，它应该是一个小于图，即在 (x,y) 处，pc 中的y 点必须具有小于 x 的值。对于pnc，它是一个多于图，即在 (x,y) 处，pnc 中的y 点的值必须大于 x。

I have tried using histogram function - pyplot.hist. Is there a better and easier way to do what i want? Also, it has to be plotted on a logarithmic scale on the x-axis.

我曾尝试使用直方图函数 - pyplot.hist。有没有更好更简单的方法来做我想做的事？此外，它必须绘制在 x 轴上的对数刻度上。

Answer 1

采纳答案by EnricoGiampieri

You were close. You should not use plt.hist as numpy.histogram, that gives you both the values and the bins, than you can plot the cumulative with ease:

你很接近。您不应该使用 plt.hist 作为 numpy.histogram，它为您提供值和 bin，而不是您可以轻松绘制累积：

import numpy as np
import matplotlib.pyplot as plt

# some fake data
data = np.random.randn(1000)
# evaluate the histogram
values, base = np.histogram(data, bins=40)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
#plot the survival function
plt.plot(base[:-1], len(data)-cumulative, c='green')

plt.show()

enter image description here

在此处输入图片说明

Answer 2

回答by Eric O Lebigot

Using histograms is really unnecessarily heavy and imprecise (the binning makes the data fuzzy): you can just sort all the x values: the index of each value is the number of values that are smaller. This shorter and simpler solution looks like this:

使用直方图确实是不必要的繁重和不精确（分箱使数据模糊）：您可以对所有 x 值进行排序：每个值的索引是较小值的数量。这个更短更简单的解决方案如下所示：

import numpy as np
import matplotlib.pyplot as plt

# Some fake data:
data = np.random.randn(1000)

sorted_data = np.sort(data)  # Or data.sort(), if data can be modified

# Cumulative counts:
plt.step(sorted_data, np.arange(sorted_data.size))  # From 0 to the number of data points-1
plt.step(sorted_data[::-1], np.arange(sorted_data.size))  # From the number of data points-1 to 0

plt.show()

Furthermore, a more appropriate plot style is indeed plt.step()instead of plt.plot(), since the data is in discrete locations.

此外，更合适的绘图样式确实是plt.step()而不是plt.plot()，因为数据位于离散位置。

The result is:

结果是：

enter image description here

在此处输入图片说明

You can see that it is more raggedthan the output of EnricoGiampieri's answer, but this one is the real histogram (instead of being an approximate, fuzzier version of it).

您可以看到它比 EnricoGiampieri 的答案的输出更参差不齐，但这是真正的直方图（而不是它的近似、模糊版本）。

PS: As SebastianRaschka noted, the very last point should ideally show the total count (instead of the total count-1). This can be achieved with:

PS：正如 SebastianRaschka 所指出的，最后一点最好显示总计数（而不是总计数 1）。这可以通过以下方式实现：

plt.step(np.concatenate([sorted_data, sorted_data[[-1]]]),
         np.arange(sorted_data.size+1))
plt.step(np.concatenate([sorted_data[::-1], sorted_data[[0]]]),
         np.arange(sorted_data.size+1))

There are so many points in datathat the effect is not visible without a zoom, but the very last point at the total count does matter when the data contains only a few points.

有很多点，data如果没有缩放就看不到效果，但是当数据只包含几个点时，总计数的最后一个点很重要。

Answer 3

回答by Eric O Lebigot

After conclusive discussion with @EOL, I wanted to post my solution (upper left) using a random Gaussian sample as a summary:

在与@EOL 进行结论性讨论后，我想使用随机高斯样本作为摘要发布我的解决方案（左上角）：

enter image description here

在此处输入图片说明

import numpy as np
import matplotlib.pyplot as plt
from math import ceil, floor, sqrt

def pdf(x, mu=0, sigma=1):
    """
    Calculates the normal distribution's probability density 
    function (PDF).  

    """
    term1 = 1.0 / ( sqrt(2*np.pi) * sigma )
    term2 = np.exp( -0.5 * ( (x-mu)/sigma )**2 )
    return term1 * term2


# Drawing sample date poi
##################################################

# Random Gaussian data (mean=0, stdev=5)
data1 = np.random.normal(loc=0, scale=5.0, size=30)
data2 = np.random.normal(loc=2, scale=7.0, size=30)
data1.sort(), data2.sort()

min_val = floor(min(data1+data2))
max_val = ceil(max(data1+data2))

##################################################




fig = plt.gcf()
fig.set_size_inches(12,11)

# Cumulative distributions, stepwise:
plt.subplot(2,2,1)
plt.step(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.step(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$') 

plt.title('30 samples from a random Gaussian distribution (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()

# Cumulative distributions, smooth:
plt.subplot(2,2,2)

plt.plot(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.plot(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$') 

plt.title('30 samples from a random Gaussian (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()


# Probability densities of the sample points function
plt.subplot(2,2,3)

pdf1 = pdf(data1, mu=0, sigma=5)
pdf2 = pdf(data2, mu=2, sigma=7)
plt.plot(data1, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(data2, pdf2, label='$\mu=2, \sigma=7$')

plt.title('30 samples from a random Gaussian')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()


# Probability density function
plt.subplot(2,2,4)

x = np.arange(min_val, max_val, 0.05)

pdf1 = pdf(x, mu=0, sigma=5)
pdf2 = pdf(x, mu=2, sigma=7)
plt.plot(x, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(x, pdf2, label='$\mu=2, \sigma=7$')

plt.title('PDFs of Gaussian distributions')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()

plt.show()

Answer 4

回答by Marine Galantin

In order to add my own contribution to the community, here I share my function for plotting histograms. This is how I understood the question, plotting the histogram and the cumulative histograme at the same time :

为了向社区添加我自己的贡献，我在这里分享我绘制直方图的功能。这就是我理解这个问题的方式，同时绘制直方图和累积直方图：

def hist(data, bins, title, labels, range = None):
  fig = plt.figure(figsize=(15, 8))
  ax = plt.axes()
  plt.ylabel("Proportion")
  values, base, _ = plt.hist( data  , bins = bins, normed=True, alpha = 0.5, color = "green", range = range, label = "Histogram")
  ax_bis = ax.twinx()
  values = np.append(values,0)
  ax_bis.plot( base, np.cumsum(values)/ np.cumsum(values)[-1], color='darkorange', marker='o', linestyle='-', markersize = 1, label = "Cumulative Histogram" )
  plt.xlabel(labels)
  plt.ylabel("Proportion")
  plt.title(title)
  ax_bis.legend();
  ax.legend();
  plt.show()
  return

if anyone wonders how it looks like, please take a look (with seaborn activated):

如果有人想知道它的样子，请看一看（激活 seaborn）：

累积分布图python

提问by akhiljain

采纳答案by EnricoGiampieri

回答by Eric O Lebigot

回答by Eric O Lebigot

回答by Marine Galantin

相关推荐

最近更新

标签

累积分布图python

提问by akhiljain

采纳答案by EnricoGiampieri

回答by Eric O Lebigot

回答by Eric O Lebigot

回答by Marine Galantin

相关推荐

Python Django：错误：未知命令：'makemigrations'

如何修复错误“AttributeError：'module'对象在python3中没有属性'client'？

python中的平方根

在 Python 中解析非零填充时间戳

相关推荐

最近更新

标签