如何在 Pandas 中绘制日期的核密度图?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/31348737/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to plot kernel density plot of dates in Pandas?
提问by bhackinen
I have a pandas dataframe where each observation has a date (as a column of entries in datetime[64] format). These dates are spread over a period of about 5 years. I would like to plot a kernel-density plot of the dates of all the observations, with the years labelled on the x-axis.
我有一个 Pandas 数据框,其中每个观察值都有一个日期(作为 datetime[64] 格式的一列条目)。这些日期分布在大约 5 年的时间里。我想绘制所有观察日期的核密度图,并在 x 轴上标记年份。
I have figured out how to create a time-delta relative to some reference date and then create a density plot of the number of hours/days/years between each observation and the reference date:
我已经弄清楚如何创建相对于某个参考日期的时间增量,然后创建每个观察和参考日期之间的小时数/天数/年数的密度图:
df['relativeDate'].astype('timedelta64[D]').plot(kind='kde')
But this isn't exactly what I want: If I convert to year-deltas, then the x-axis is right but I lose the within-year variation. But if I take a smaller unit of time like hour or day, the x-axis labels are much harder to interpret.
但这并不完全是我想要的:如果我转换为年增量,那么 x 轴是正确的,但我失去了年内变化。但是,如果我采用更小的时间单位(如小时或天),则 x 轴标签更难解释。
What's the simplest way to make this work in Pandas?
在 Pandas 中完成这项工作的最简单方法是什么?
采纳答案by Jianxun Li
Inspired by @JohnE 's answer, an alternative approach to convert date to numeric value is to use .toordinal().
受@JohnE 回答的启发,将日期转换为数值的另一种方法是使用.toordinal().
import pandas as pd
import numpy as np
# simulate some artificial data
# ===============================
np.random.seed(0)
dates = pd.date_range('2010-01-01', periods=31, freq='D')
df = pd.DataFrame(np.random.choice(dates,100), columns=['dates'])
# use toordinal() to get datenum
df['ordinal'] = [x.toordinal() for x in df.dates]
print(df)
        dates  ordinal
0  2010-01-13   733785
1  2010-01-16   733788
2  2010-01-22   733794
3  2010-01-01   733773
4  2010-01-04   733776
5  2010-01-28   733800
6  2010-01-04   733776
7  2010-01-08   733780
8  2010-01-10   733782
9  2010-01-20   733792
..        ...      ...
90 2010-01-19   733791
91 2010-01-28   733800
92 2010-01-01   733773
93 2010-01-15   733787
94 2010-01-04   733776
95 2010-01-22   733794
96 2010-01-13   733785
97 2010-01-26   733798
98 2010-01-11   733783
99 2010-01-21   733793
[100 rows x 2 columns]    
# plot non-parametric kde on numeric datenum
ax = df['ordinal'].plot(kind='kde')
# rename the xticks with labels
x_ticks = ax.get_xticks()
ax.set_xticks(x_ticks[::2])
xlabels = [datetime.datetime.fromordinal(int(x)).strftime('%Y-%m-%d') for x in x_ticks[::2]]
ax.set_xticklabels(xlabels)


回答by JohnE
I imagine there is some better and automatic way to do this, but if not then this ought to be a decent workaround. First, let's set up some sample data:
我想有一些更好和自动的方法可以做到这一点,但如果没有,那么这应该是一个不错的解决方法。首先,让我们设置一些示例数据:
np.random.seed(479)
start_date = '2011-1-1'
df = pd.DataFrame({ 'date':np.random.choice( 
                    pd.date_range(start_date, periods=365*5, freq='D'), 50) })
df['rel'] = df['date'] - pd.to_datetime(start_date)
df.rel = df.rel.astype('timedelta64[D]')
        date   rel
0 2014-06-06  1252
1 2011-10-26   298
2 2013-08-24   966
3 2014-09-25  1363
4 2011-12-23   356
As you can see, 'rel' is just the number of days since the starting day. It's essentially an integer, so all you really need to do is normalize it with respect to the starting date.
如您所见,'rel' 只是自开始日期以来的天数。它本质上是一个整数,因此您真正需要做的就是根据开始日期对其进行标准化。
df['year_as_float'] = pd.to_datetime(start_date).year + df.rel / 365.
        date   rel  year_as_float
0 2014-06-06  1252    2014.430137
1 2011-10-26   298    2011.816438
2 2013-08-24   966    2013.646575
3 2014-09-25  1363    2014.734247
4 2011-12-23   356    2011.975342
You'd need to adjust that slightly for a date not starting on Jan 1. That's also ignoring any leap years which really isn't a practical issue if you're just producing a KDE plot over 5 years, but it could matter depending on what else you might want to do.
对于不是从 1 月 1 日开始的日期,您需要稍微调整一下。这也忽略了任何闰年,如果您只是在 5 年内生成 KDE 图,这实际上不是一个实际问题,但这可能很重要,具体取决于您可能还想做什么。
Here's the plot
这是情节
df['year_as_float']d.plot(kind='kde')



