Pandas Grouper 按频率和完整性要求

Question

提问by Michael Delgado

I have monthly time series data that is both missing some entries and has scattered NaN values for other reasons. I need to aggregate the data into Quarterly and Annual series, but I do not want to report data for quarters/years with missing data. For example, in the data below, I do not want to report data for Q1 2014 because I am missing January of that year.

我有每月的时间序列数据，它既缺少一些条目，又由于其他原因分散了 NaN 值。我需要将数据汇总到季度和年度系列中，但我不想报告缺少数据的季度/年度数据。例如，在下面的数据中，我不想报告 2014 年第一季度的数据，因为我缺少那年的 1 月。

import pandas as pd, numpy as np

df = pd.DataFrame([
  ('Monthly','2014-02-1', 529.1),
  ('Monthly','2014-03-1',  67.1),
  ('Monthly','2014-04-1', np.nan), 
  ('Monthly','2014-05-1', 146.8),
  ('Monthly','2014-06-1', 469.7),
  ('Monthly','2014-07-1',  82.9),
  ('Monthly','2014-08-1', 636.9),
  ('Monthly','2014-09-1', 520.9),
  ('Monthly','2014-10-1', 217.4),
  ('Monthly','2014-11-1', 776.6),
  ('Monthly','2014-12-1',  18.4),
  ('Monthly','2015-01-1', 376.7),
  ('Monthly','2015-02-1', 266.5),
  ('Monthly','2015-03-1', np.nan),
  ('Monthly','2015-04-1', 144.1), 
  ('Monthly','2015-05-1', 385.0),
  ('Monthly','2015-06-1', 527.1),
  ('Monthly','2015-07-1', 748.5),
  ('Monthly','2015-08-1', 518.2)],
  columns=['Frequency','Date','Value'])

df['Date'] = pd.to_datetime(df['Date'])
df.set_index(['Frequency','Date'],inplace=True)
df

                      Value
Frequency Date
          2014-02-01  529.1
          2014-03-01   67.1
          2014-04-01    NaN
          2014-05-01  146.8
          2014-06-01  469.7
          2014-07-01   82.9
          2014-08-01  636.9
          2014-09-01  520.9
          2014-10-01  217.4
          2014-11-01  776.6
          2014-12-01   18.4
          2015-01-01  376.7
          2015-02-01  266.5
          2015-03-01    NaN
          2015-04-01  144.1
          2015-05-01  385.0
          2015-06-01  527.1
          2015-07-01  748.5
          2015-08-01  518.2

I have tried using the Grouper function, but groupby ignores NaN values and the Grouper utility does not enforce time series completeness as far as I can tell:

我曾尝试使用 Grouper 函数，但 groupby 忽略 NaN 值，并且据我所知，Grouper 实用程序不会强制执行时间序列完整性：

df.groupby(pd.Grouper(level='Date', freq='Q')).sum()

             Value
Date
2014-03-31  1571.2
2014-06-30   616.5
2014-09-30  1240.7
2014-12-31  1012.4
2015-03-31   643.2
2015-06-30  1056.2
2015-09-30  1266.7

What I would like to see is this:

我想看到的是：

             Value
Date
2014-03-31     NaN  # Because of missing 2014-01-01
2014-06-30     NaN  # Because of NaN in 2014-04-01
2014-09-30  1240.7
2014-12-31  1012.4
2015-03-31     NaN  # Because of NaN in 2015-03-01
2015-06-30  1056.2
2015-09-30     NaN  # Because of missing 2015-09-01

What's a good way to do this?

有什么好方法可以做到这一点？

Answer 1

回答by CT Zhu

You may want to write your own aggergate function, 1, if there are nan, return a nan; 2, if the period is too short, also return nan; 3, otherwise, return the sum:

您可能想编写自己的 agggate 函数，1，如果有nan，返回一个nan；2、如果周期太短，也返回nan；3、否则返回总和：

In [43]:

gpy = df.groupby(pd.Grouper(level='Date', freq='Q'))

print gpy.agg(lambda x: np.nan if (np.isnan(x).any() or len(x)<3) else x.sum())

             Value
Date              
2014-03-31     NaN
2014-06-30     NaN
2014-09-30  1240.7
2014-12-31  1012.4
2015-03-31     NaN
2015-06-30  1056.2
2015-09-30     NaN

Answer 2

回答by unutbu

You could create a boolean mask which is True for each group which has exactly 3 elements:

您可以为每组创建一个布尔掩码，该掩码为 True，每个组恰好有 3 个元素：

mask = (df.groupby(pd.Grouper(level='Date', freq='Q'))['Value'].count() != 3).values

and then simply set the corresponding rows to NaN.

然后只需将相应的行设置为 NaN。

grouped = df.groupby(pd.Grouper(level='Date', freq='Q'))
result = grouped.sum()
mask = (grouped['Value'].count() != 3).values
result.loc[mask, 'Value'] = np.nan

yields

产量

             Value
Date              
2014-03-31     NaN
2014-06-30     NaN
2014-09-30  1240.7
2014-12-31  1012.4
2015-03-31     NaN
2015-06-30  1056.2
2015-09-30     NaN

Pandas Grouper 按频率和完整性要求

提问by Michael Delgado

回答by CT Zhu

回答by unutbu

相关推荐

最近更新

标签

Pandas Grouper 按频率和完整性要求

提问by Michael Delgado

回答by CT Zhu

回答by unutbu

相关推荐

Pandas groupby 应用执行缓慢

来自带有列表的字典的 Pandas DataFrame

使用 Pandas 使用分隔符读取 txt 文件创建 NaNs 列

具有 MultiIndex 到 Numpy 矩阵的 Pandas DataFrame

相关推荐

最近更新

标签