在python中绘制熊猫系列的CDF

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25577352/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:23:45  来源:igfitidea点击:

Plotting CDF of a pandas series in python

pythonpandasseriescdf

提问by wolfsatthedoor

Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.

有没有办法做到这一点?我似乎无法通过绘制 CDF 来连接熊猫系列的简单方法。

采纳答案by Dan Frank

I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib

我相信您正在寻找的功能在一个系列对象的 hist 方法中,该方法将 hist() 函数包装在 matplotlib 中

Here's the relevant documentation

这是相关文档

In [10]: import matplotlib.pyplot as plt

In [11]: plt.hist?
...
Plot a histogram.

Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : True
    If `True`, then a histogram is computed where each bin gives the
    counts in that bin plus all bins for smaller values. The last bin
    gives the total number of datapoints.  If `normed` is also `True`
    then the histogram is normalized such that the last bin equals 1.
    If `cumulative` evaluates to less than 0 (e.g., -1), the direction
    of accumulation is reversed.  In this case, if `normed` is also
    `True`, then the histogram is normalized such that the first bin
    equals 1.

...

For example

例如

In [12]: import pandas as pd

In [13]: import numpy as np

In [14]: ser = pd.Series(np.random.normal(size=1000))

In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>

In [16]: plt.show()

回答by kadee

A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.

CDF 或累积分布函数图基本上是一个图形,X 轴上是排序值,Y 轴上是累积分布。因此,我将创建一个新系列,将排序的值作为索引,将累积分布作为值。

First create an example series:

首先创建一个示例系列:

import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))

Sort the series:

对系列进行排序:

ser = ser.sort_values()

Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:

现在,在继续之前,再次附加最后一个(也是最大的)值。为了获得无偏的 CDF,这一步对于小样本量尤其重要:

ser[len(ser)] = ser.iloc[-1]

Create a new series with the sorted values as index and the cumulative distribution as values:

创建一个新系列,将排序的值作为索引,将累积分布作为值:

cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)

Finally, plot the function as steps:

最后,按步骤绘制函数:

ser_cdf.plot(drawstyle='steps')

回答by annon

To me, this seemed like a simply way to do it:

对我来说,这似乎是一种简单的方法:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

heights = pd.Series(np.random.normal(size=100))

# empirical CDF
def F(x,data):
    return float(len(data[data <= x]))/len(data)

vF = np.vectorize(F, excluded=['data'])

plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))

回答by wroscoe

This is the easiest way.

这是最简单的方法。

import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )

Image of cumulative histogram

累积直方图的图像

回答by tommy.carstensen

I came here looking for a plot like this with bars anda CDF line: enter image description here

我来这里是为了寻找像这样的带有条形CDF 线的图: 在此处输入图片说明

It can be achieved like this:

可以这样实现:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
    series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')

If you want to remove the vertical line, then it's explained how to accomplish that here. Or you could just do:

如果你想删除的垂直线,那么它解释了如何实现这一这里。或者你可以这样做:

ax.set_xlim((ax.get_xlim()[0], series.max()))

I also saw an elegant solution hereon how to do it with seaborn.

我还在这里看到了一个优雅的解决方案关于如何使用seaborn.

回答by jk.

I found another solution in "pure" Pandas, that does not require specifying the number of bins to use in a histogram:

我在“纯” Pandas 中找到了另一个解决方案,它不需要指定在直方图中使用的 bin 数量:

import pandas as pd
import numpy as np # used only to create example data

series = pd.Series(np.random.normal(size=10000))

cdf = series.value_counts().sort_index().cumsum()
cdf.plot()

回答by Raphvanns

In case you are also interested in the values, not just the plot.

如果您也对值感兴趣,而不仅仅是情节。

import pandas as pd

# If you are in jupyter
%matplotlib inline

This will always work (discrete and continuous distributions)

这将始终有效(离散和连续分布)

# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series

# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})

# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])

# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df

enter image description here

在此处输入图片说明

# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.

# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)

enter image description here

在此处输入图片说明

Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:

从连续分布中抽取样本的替代示例,或者您有很多单独的值:

# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)

enter image description here

在此处输入图片说明

For continuous distributions only

仅适用于连续分布

Please note if it very reasonable to make the assumption that there is only one occurence of each value in the sample(typically encountered in the case of continuous distributions) then the groupby()+ agg('count')is not necessary (since the count is always 1).

请注意,如果假设样本中每个值只出现一次是非常合理的(通常在连续分布的情况下遇到),那么groupby()+agg('count')是不必要的(因为计数始终为 1)。

In this case, a percent rank can be used to get to the cdf directly.

在这种情况下,可以使用百分比排名直接获取 cdf。

Use your best judgment when taking this kind of shortcut! :)

走这种捷径时,请做出最佳判断!:)

# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)

enter image description here

在此处输入图片说明