pandas 使用日期时间索引插入和填充熊猫数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30056399/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Interpolate and fill pandas dataframe with datetime index
提问by Delta_Fore
Hi I'm trying to interpolate a Dataframe where I have a datetimeIndex index.
嗨,我正在尝试插入一个 Dataframe,其中我有一个 datetimeIndex 索引。
Here's the data
这是数据
res = pd.DataFrame(cursor.execute("SELECT DATETIME,VALUE FROM {} WHERE DATETIME > ? AND DATETIME < ?".format(table),[start,end]).fetchall(),columns=['date','value'])
res.set_index('date',inplace=True)
which produces
产生
2013-01-31 00:00:00 517
2012-12-31 00:00:00 263
2012-11-30 00:00:00 1917
2012-10-31 00:00:00 391
2012-09-30 00:00:00 782
2012-08-31 00:00:00 700
2012-07-31 00:00:00 799
2012-06-30 00:00:00 914
2012-05-31 00:00:00 141
2012-04-30 00:00:00 342
2012-03-31 00:00:00 199
2012-02-29 00:00:00 533
2012-01-31 00:00:00 1393
2011-12-31 00:00:00 497
2011-11-30 00:00:00 1457
2011-10-31 00:00:00 997
2011-09-30 00:00:00 533
2011-08-31 00:00:00 626
2011-07-31 00:00:00 1933
2011-06-30 00:00:00 4248
2011-05-31 00:00:00 1248
2011-04-30 00:00:00 904
2011-03-31 00:00:00 3280
2011-02-28 00:00:00 390
2011-01-31 00:00:00 601
2010-12-31 00:00:00 423
2010-11-30 00:00:00 748
2010-10-31 00:00:00 433
2010-09-30 00:00:00 734
2010-08-31 00:00:00 845
2010-07-31 00:00:00 1693
2010-06-30 00:00:00 2742
2010-05-31 00:00:00 669
This is all non contiguous. I want to have a daily value so, want to fill in the missing values using some kind of interpolation.
这都是不连续的。我想要一个每日值,所以想要使用某种插值来填充缺失值。
First tried to set the index and then interpolate.
首先尝试设置索引,然后进行插值。
new_index = pd.date_range(date(2010,1,1),date(2014,1,31),freq='D')
df2 = res.reindex(new_index) # This returns NaN
df2.interpolate('cubic') # Fails with error TypeError: Cannot interpolate with all NaNs.
What I would hope to get back is a dataframe with each date value between 2010-2014, with a interpolated value calculated from the points surrounding it.
我希望得到的是一个数据框,每个日期值都在 2010-2014 年之间,并根据它周围的点计算出一个内插值。
It seems like there probably is a way to do this simply, but I'm not sure what.
似乎有一种方法可以简单地做到这一点,但我不确定是什么。
采纳答案by Zero
Here's one way to do it.
这是一种方法。
First get a new index from max minof df.indexdates
首先得到一个新的索引max min的df.index日期
In [152]: df_reindexed = df.reindex(pd.date_range(start=df.index.min(),
end=df.index.max(),
freq='1D'))
Then use interpolate(method='linear')on the series to get values.
然后interpolate(method='linear')在系列上使用以获取值。
In [153]: df_reindexed.interpolate(method='linear')
Out[153]:
Value
2010-05-31 669.000000
2010-06-01 738.100000
2010-06-02 807.200000
2010-06-03 876.300000
2010-06-04 945.400000
2010-06-05 1014.500000
...
2013-01-25 467.838710
2013-01-26 476.032258
2013-01-27 484.225806
2013-01-28 492.419355
2013-01-29 500.612903
2013-01-30 508.806452
2013-01-31 517.000000
[977 rows x 1 columns]
回答by JohnE
Just as an add on to @JohnGalt's answer, you could also use resamplewhich is slightly more convenient than reindexhere:
作为对@JohnGalt 的回答的补充,您还可以使用resample比reindex这里更方便的方法:
df.resample('D').interpolate('cubic')
value
date
2010-05-31 669.000000
2010-06-01 830.400272
2010-06-02 983.988431
2010-06-03 1129.919466
2010-06-04 1268.348368
2010-06-05 1399.430127
2010-06-06 1523.319734
...
2010-06-25 2716.850752
2010-06-26 2729.445324
2010-06-27 2738.102544
2010-06-28 2742.977403
2010-06-29 2744.224892
2010-06-30 2742.000000
2010-07-01 2736.454249
2010-07-02 2727.725284
2010-07-03 2715.947277

