Python 将缺失的日期添加到 Pandas 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19324453/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:26:35  来源:igfitidea点击:

Add missing dates to pandas dataframe

pythondateplotpandasdataframe

提问by KHibma

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.

我的数据可以在给定日期有多个事件,也可以在某个日期没有事件。我接受这些事件,按日期计算并绘制它们。但是,当我绘制它们时,我的两个系列并不总是匹配。

idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()

In the above code idxbecomes a range of say 30 dates. 09-01-2013 to 09-30-2013 However Smay only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:

在上面的代码中idx变成了 30 个日期的范围。09-01-2013 到 09-30-2013 但是S可能只有 25 或 26 天,因为给定日期没有发生任何事件。然后我得到一个 AssertionError 因为当我尝试绘图时尺寸不匹配:

fig, ax = plt.subplots()    
ax.bar(idx.to_pydatetime(), s, color='green')

What's the proper way to tackle this? Do I want to remove dates with no values from IDXor (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindexfunction?

解决这个问题的正确方法是什么?我想从IDX 中删除没有值的日期,还是(我宁愿这样做)将缺失的日期添加到系列中,计数为 0。我宁愿有一个 30 天的完整图表,其中包含 0 个值。如果这种方法是正确的,有关如何开始的任何建议?我需要某种动态reindex功能吗?

Here's a snippet of S( df.groupby(['simpleDate']).size()), notice no entries for 04 and 05.

这是S( df.groupby(['simpleDate']).size())的片段,注意没有 04 和 05 的条目。

09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1

采纳答案by unutbu

You could use Series.reindex:

你可以使用Series.reindex

import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

yields

产量

2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
...

回答by eiTan LaVi

Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_backto fill in, and sort order (date_order) by which to sort the dataframe:

这是一种将缺失日期填充到数据帧中的好方法,您可以选择fill_valuedays_back填充和排序顺序 ( date_order),以此对数据帧进行排序:

def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):

    df.set_index(date_col_name,drop=True,inplace=True)
    df.index = pd.DatetimeIndex(df.index)
    d = datetime.now().date()
    d2 = d - timedelta(days = days_back)
    idx = pd.date_range(d2, d, freq = "D")
    df = df.reindex(idx,fill_value=fill_value)
    df[date_col_name] = pd.DatetimeIndex(df.index)

    return df

回答by Nick Edgar

One issue is that reindexwill fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:

一个问题是,reindex如果存在重复值,将会失败。假设我们正在处理带时间戳的数据,我们希望按日期对其进行索引:

df = pd.DataFrame({
    'timestamps': pd.to_datetime(
        ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
    'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df

yields

产量

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-18  "2016-11-18 04:00:00"  d

Due to the duplicate 2016-11-16date, an attempt to reindex:

由于2016-11-16日期重复,尝试重新索引:

all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)

fails with:

失败:

...
ValueError: cannot reindex from a duplicate axis

(by this it means the index has duplicates, not that it is itself a dup)

(这意味着索引有重复项,而不是它本身就是一个重复项)

Instead, we can use .locto look up entries for all dates in range:

相反,我们可以使用.loc来查找范围内所有日期的条目:

df.loc[all_days]

yields

产量

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-17  NaN                    NaN
2016-11-18  "2016-11-18 04:00:00"  d

fillnacan be used on the column series to fill blanks if needed.

fillna如果需要,可以在列系列上使用以填充空白。

回答by Brad Solomon

A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().

一个更快的解决方法是使用.asfreq(). 这不需要创建新索引以在.reindex().

# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'), 
                  pd.Timestamp('2012-05-04'), 
                  pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)

print(s.asfreq('D'))
2012-05-01    1.0
2012-05-02    NaN
2012-05-03    NaN
2012-05-04    2.0
2012-05-05    NaN
2012-05-06    3.0
Freq: D, dtype: float64

回答by JohnE

An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:

另一种方法是resample,它可以处理除缺失日期之外的重复日期。例如:

df.resample('D').mean()

resampleis a deferred operation like groupbyso you need to follow it with another operation. In this case meanworks well, but you can also use many other pandas methods like max, sum, etc.

resample是一个延迟的操作,groupby所以你需要用另一个操作来跟随它。在这种情况下mean工作得很好,但你也可以使用许多其他的熊猫方法,如maxsum等。

Here is the original data, but with an extra entry for '2013-09-03':

这是原始数据,但有一个额外的“2013-09-03”条目:

             val
date           
2013-09-02     2
2013-09-03    10
2013-09-03    20    <- duplicate date added to OP's data
2013-09-06     5
2013-09-07     1

And here are the results:

结果如下:

             val
date            
2013-09-02   2.0
2013-09-03  15.0    <- mean of original values for 2013-09-03
2013-09-04   NaN    <- NaN b/c date not present in orig
2013-09-05   NaN    <- NaN b/c date not present in orig
2013-09-06   5.0
2013-09-07   1.0

I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0)to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate()to fill with non-zero values based on the neighboring rows.

我将缺失的日期保留为 NaN 以明确其工作原理,但您可以添加fillna(0)以根据 OP 的要求用零替换 NaN,或者使用类似的interpolate()方法填充基于相邻行的非零值。