在python中绘制熊猫系列的CDF
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25577352/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Plotting CDF of a pandas series in python
提问by wolfsatthedoor
Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.
有没有办法做到这一点?我似乎无法通过绘制 CDF 来连接熊猫系列的简单方法。
采纳答案by Dan Frank
I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib
我相信您正在寻找的功能在一个系列对象的 hist 方法中,该方法将 hist() 函数包装在 matplotlib 中
Here's the relevant documentation
这是相关文档
In [10]: import matplotlib.pyplot as plt
In [11]: plt.hist?
...
Plot a histogram.
Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : True
If `True`, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If `normed` is also `True`
then the histogram is normalized such that the last bin equals 1.
If `cumulative` evaluates to less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if `normed` is also
`True`, then the histogram is normalized such that the first bin
equals 1.
...
For example
例如
In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: ser = pd.Series(np.random.normal(size=1000))
In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
In [16]: plt.show()
回答by kadee
A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
CDF 或累积分布函数图基本上是一个图形,X 轴上是排序值,Y 轴上是累积分布。因此,我将创建一个新系列,将排序的值作为索引,将累积分布作为值。
First create an example series:
首先创建一个示例系列:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
对系列进行排序:
ser = ser.sort_values()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
现在,在继续之前,再次附加最后一个(也是最大的)值。为了获得无偏的 CDF,这一步对于小样本量尤其重要:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values:
创建一个新系列,将排序的值作为索引,将累积分布作为值:
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
最后,按步骤绘制函数:
ser_cdf.plot(drawstyle='steps')
回答by annon
To me, this seemed like a simply way to do it:
对我来说,这似乎是一种简单的方法:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.Series(np.random.normal(size=100))
# empirical CDF
def F(x,data):
return float(len(data[data <= x]))/len(data)
vF = np.vectorize(F, excluded=['data'])
plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))
回答by wroscoe
This is the easiest way.
这是最简单的方法。
import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )
回答by tommy.carstensen
I came here looking for a plot like this with bars anda CDF line:

It can be achieved like this:
可以这样实现:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')
If you want to remove the vertical line, then it's explained how to accomplish that here. Or you could just do:
如果你想删除的垂直线,那么它解释了如何实现这一这里。或者你可以这样做:
ax.set_xlim((ax.get_xlim()[0], series.max()))
I also saw an elegant solution hereon how to do it with seaborn.
回答by jk.
I found another solution in "pure" Pandas, that does not require specifying the number of bins to use in a histogram:
我在“纯” Pandas 中找到了另一个解决方案,它不需要指定在直方图中使用的 bin 数量:
import pandas as pd
import numpy as np # used only to create example data
series = pd.Series(np.random.normal(size=10000))
cdf = series.value_counts().sort_index().cumsum()
cdf.plot()
回答by Raphvanns
In case you are also interested in the values, not just the plot.
如果您也对值感兴趣,而不仅仅是情节。
import pandas as pd
# If you are in jupyter
%matplotlib inline
This will always work (discrete and continuous distributions)
这将始终有效(离散和连续分布)
# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series
# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df
# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.
# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:
从连续分布中抽取样本的替代示例,或者您有很多单独的值:
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
For continuous distributions only
仅适用于连续分布
Please note if it very reasonable to make the assumption that there is only one occurence of each value in the sample(typically encountered in the case of continuous distributions) then the groupby()+ agg('count')is not necessary (since the count is always 1).
请注意,如果假设样本中每个值只出现一次是非常合理的(通常在连续分布的情况下遇到),那么groupby()+agg('count')是不必要的(因为计数始终为 1)。
In this case, a percent rank can be used to get to the cdf directly.
在这种情况下,可以使用百分比排名直接获取 cdf。
Use your best judgment when taking this kind of shortcut! :)
走这种捷径时,请做出最佳判断!:)
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)

