Python 熊猫填补时间序列中缺失的日期
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47231496/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas fill missing dates in time series
提问by Alter
I have a dataframe which has aggregated data for some days. I want to add in the missing days
我有一个数据框,它已经聚合了几天的数据。我想补充缺失的日子
I was following another post, Add missing dates to pandas dataframe, unfortunately, it overwrote my results (maybe functionality was changed slightly?)... the code is below
我正在关注另一篇文章,将缺失的日期添加到熊猫数据框,不幸的是,它覆盖了我的结果(也许功能略有改变?)...代码如下
import random
import datetime as dt
import numpy as np
import pandas as pd
def generate_row(year, month, day):
while True:
date = dt.datetime(year=year, month=month, day=day)
data = np.random.random(size=4)
yield [date] + list(data)
# days I have data for
dates = [(2000, 1, 1), (2000, 1, 2), (2000, 2, 4)]
generators = [generate_row(*date) for date in dates]
# get 5 data points for each
data = [next(generator) for generator in generators for _ in range(5)]
df = pd.DataFrame(data, columns=['date'] + ['f'+str(i) for i in range(1,5)])
# df
groupby_day = df.groupby(pd.PeriodIndex(data=df.date, freq='D'))
results = groupby_day.sum()
idx = pd.date_range(min(df.date), max(df.date))
results.reindex(idx, fill_value=0)
回答by Andy Hayden
You need to use period_range
rather than date_range
:
您需要使用period_range
而不是date_range
:
In [11]: idx = pd.period_range(min(df.date), max(df.date))
...: results.reindex(idx, fill_value=0)
...:
Out[11]:
f1 f2 f3 f4
2000-01-01 2.049157 1.962635 2.756154 2.224751
2000-01-02 2.675899 2.587217 1.540823 1.606150
2000-01-03 0.000000 0.000000 0.000000 0.000000
2000-01-04 0.000000 0.000000 0.000000 0.000000
2000-01-05 0.000000 0.000000 0.000000 0.000000
2000-01-06 0.000000 0.000000 0.000000 0.000000
2000-01-07 0.000000 0.000000 0.000000 0.000000
2000-01-08 0.000000 0.000000 0.000000 0.000000
2000-01-09 0.000000 0.000000 0.000000 0.000000
2000-01-10 0.000000 0.000000 0.000000 0.000000
2000-01-11 0.000000 0.000000 0.000000 0.000000
2000-01-12 0.000000 0.000000 0.000000 0.000000
2000-01-13 0.000000 0.000000 0.000000 0.000000
2000-01-14 0.000000 0.000000 0.000000 0.000000
2000-01-15 0.000000 0.000000 0.000000 0.000000
2000-01-16 0.000000 0.000000 0.000000 0.000000
2000-01-17 0.000000 0.000000 0.000000 0.000000
2000-01-18 0.000000 0.000000 0.000000 0.000000
2000-01-19 0.000000 0.000000 0.000000 0.000000
2000-01-20 0.000000 0.000000 0.000000 0.000000
2000-01-21 0.000000 0.000000 0.000000 0.000000
2000-01-22 0.000000 0.000000 0.000000 0.000000
2000-01-23 0.000000 0.000000 0.000000 0.000000
2000-01-24 0.000000 0.000000 0.000000 0.000000
2000-01-25 0.000000 0.000000 0.000000 0.000000
2000-01-26 0.000000 0.000000 0.000000 0.000000
2000-01-27 0.000000 0.000000 0.000000 0.000000
2000-01-28 0.000000 0.000000 0.000000 0.000000
2000-01-29 0.000000 0.000000 0.000000 0.000000
2000-01-30 0.000000 0.000000 0.000000 0.000000
2000-01-31 0.000000 0.000000 0.000000 0.000000
2000-02-01 0.000000 0.000000 0.000000 0.000000
2000-02-02 0.000000 0.000000 0.000000 0.000000
2000-02-03 0.000000 0.000000 0.000000 0.000000
2000-02-04 1.856158 2.892620 2.986166 2.793448
This is because your groupby uses PeriodIndex, rather than datetime:
这是因为您的 groupby 使用 PeriodIndex 而不是 datetime:
df.groupby(pd.PeriodIndex(data=df.date, freq='D'))
You could have instead used a pd.Grouper
:
您可以改为使用pd.Grouper
:
df.groupby(pd.Grouper(key="date", freq='D'))
which would have give a datetime index.
这将给出一个日期时间索引。
回答by Alter
From c???s????'s hints in the comments:
从c???s???? 评论中的提示:
resample
fits well here.
resample
很适合这里。
Resample: Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Resample: 频率转换和时间序列重采样的便捷方法。对象必须具有类似日期时间的索引(DatetimeIndex、PeriodIndex 或 TimedeltaIndex),或者将类似日期时间的值传递给 on 或 level 关键字。
import random
import datetime as dt
import numpy as np
import pandas as pd
def generate_row(year, month, day):
while True:
date = dt.datetime(year=year, month=month, day=day)
data = np.random.random(size=4)
yield [date] + list(data)
# days I have data for
dates = [(2000, 1, 1), (2000, 1, 2), (2000, 2, 4)]
generators = [generate_row(*date) for date in dates]
# get 5 points for each
data = [next(generator) for generator in generators for _ in range(5)]
# make dataframe
df = pd.DataFrame(data, columns=['date'] + ['f'+str(i) for i in range(1,5)])
# using the resample method
df.set_index(df.date, inplace=True)
df = df.resample('D').sum().fillna(0)