Python 在 Pandas MultiIndex 内重新采样

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15799162/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:02:42  来源:igfitidea点击:

Resampling Within a Pandas MultiIndex

pythonpandastime-serieshierarchical-data

提问by Snakes McGee

I have some hierarchical data which bottoms out into time series data which looks something like this:

我有一些分层数据,它们最终变成时间序列数据,看起来像这样:

df = pandas.DataFrame(
    {'value_a': values_a, 'value_b': values_b},
    index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df

                               value_a  value_b
State   City       Date                        
Georgia Atlanta    2012-01-01        0       10
                   2012-01-02        1       11
                   2012-01-03        2       12
                   2012-01-04        3       13
        Savanna    2012-01-01        4       14
                   2012-01-02        5       15
                   2012-01-03        6       16
                   2012-01-04        7       17
Alabama Mobile     2012-01-01        8       18
                   2012-01-02        9       19
                   2012-01-03       10       20
                   2012-01-04       11       21
        Montgomery 2012-01-01       12       22
                   2012-01-02       13       23
                   2012-01-03       14       24
                   2012-01-04       15       25

I'd like to perform time resampling per city, so something like

我想对每个城市进行时间重采样,所以像

df.resample("2D", how="sum")

would output

会输出

                             value_a  value_b
State   City       Date                        
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49

as is, df.resample('2D', how='sum')gets me

df.resample('2D', how='sum')照原样,让我

TypeError: Only valid with DatetimeIndex or PeriodIndex

Fair enough, but I'd sort of expect this to work:

很公平,但我有点希望这能奏效:

>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex

at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?

在这一点上,我真的没有想法了……有什么方法 stack 和 unstack 可以帮助我吗?

采纳答案by unutbu

pd.Grouperallows you to specify a "groupby instruction for a target object". In particular, you can use it to group by dates even if df.indexis not a DatetimeIndex:

pd.Grouper允许您指定“目标对象的 groupby 指令”。特别是,即使df.index不是,您也可以使用它按日期分组DatetimeIndex

df.groupby(pd.Grouper(freq='2D', level=-1))

The level=-1tells pd.Grouperto look for the dates in the last level of the MultiIndex. Moreover, you can use this in conjunction with other level values from the index:

level=-1讲述pd.Grouper寻找在多指标的最后一个级别的日期。此外,您可以将其与索引中的其他级别值结合使用:

level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
                      +[pd.Grouper(freq='2D', level=-1)]).sum())

It looks a bit awkward, but using_Grouperturns out to be much faster than my original suggestion, using_reset_index:

看起来有点尴尬,但using_Grouper结果证明比我最初的建议要快得多,using_reset_index

import numpy as np
import pandas as pd
import datetime as DT

def using_Grouper(df):
    level_values = df.index.get_level_values
    return (df.groupby([level_values(i) for i in [0,1]]
                       +[pd.Grouper(freq='2D', level=-1)]).sum())

def using_reset_index(df):
    df = df.reset_index(level=[0, 1])
    return df.groupby(['State','City']).resample('2D').sum()

def using_stack(df):
    # http://stackoverflow.com/a/15813787/190597
    return (df.unstack(level=[0,1])
              .resample('2D').sum()
              .stack(level=[2,1])
              .swaplevel(2,0))

def make_orig():
    values_a = range(16)
    values_b = range(10, 26)
    states = ['Georgia']*8 + ['Alabama']*8
    cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
    dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
    df = pd.DataFrame(
        {'value_a': values_a, 'value_b': values_b},
        index = [states, cities, dates])
    df.index.names = ['State', 'City', 'Date']
    return df

def make_df(N):
    dates = pd.date_range('2000-1-1', periods=N)
    states = np.arange(50)
    cities = np.arange(10)
    index = pd.MultiIndex.from_product([states, cities, dates], 
                                       names=['State', 'City', 'Date'])
    df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
                      columns=['value_a', 'value_b'])
    return df

df = make_orig()
print(using_Grouper(df))

yields

产量

                               value_a  value_b
State   City       Date                        
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33


Here is a benchmark comparing using_Grouper, using_reset_index, using_stackon a 5000-row DataFrame:

这里是一个标杆比较using_Grouperusing_reset_indexusing_stack在一个有5000行数据帧:

In [30]: df = make_df(10)

In [34]: len(df)
Out[34]: 5000

In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop

In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop

In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop

回答by user1827356

An alternative using stack/unstack

使用堆栈/取消堆栈的替代方法

df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)

                               value_a  value_b
State   City       Date
Georgia Atlanta    2012-01-01        1       21
Alabama Mobile     2012-01-01       17       37
        Montgomery 2012-01-01       25       45
Georgia Savanna    2012-01-01        9       29
        Atlanta    2012-01-03        5       25
Alabama Mobile     2012-01-03       21       41
        Montgomery 2012-01-03       29       49
Georgia Savanna    2012-01-03       13       33

Notes:

笔记:

  1. No idea about performance comparison
  2. Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed
  1. 不知道性能比较
  2. 可能的熊猫错误 - stack(level=[2,1]) 有效,但 stack(level=[1,2]) 失败

回答by Kamil Sindi

This works:

这有效:

df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))

                               value_a  value_b
State   City       Date
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33

If the Date column is strings, then convert to datetime beforehand:

如果日期列是字符串,则预先转换为日期时间:

df['Date'] = pd.to_datetime(df['Date'])

回答by Geoff

I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:

我知道这个问题已经有几年了,但我遇到了同样的问题,并找到了一个需要 1 行的更简单的解决方案:

>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1  yyyyyyyyyyyyyyyyyyyyyy1  2012-07-01     1
                                  2012-07-02    13
                                  2012-07-03     1
                                  2012-07-04     1
                                  2012-07-05    10
                                  2012-07-06     4
                                  2012-07-07    47
                                  2012-07-08     0
                                  2012-07-09     3
                                  2012-07-10    22
                                  2012-07-11     3
                                  2012-07-12     0
                                  2012-07-13    22
                                  2012-07-14     1
                                  2012-07-15     2
                                  2012-07-16     2
                                  2012-07-17     8
                                  2012-07-18     0
                                  2012-07-19     1
                                  2012-07-20    10
                                  2012-07-21     0
                                  2012-07-22     3
                                  2012-07-23     0
                                  2012-07-24    35
                                  2012-07-25     6
                                  2012-07-26     1
                                  2012-07-27     0
                                  2012-07-28     6
                                  2012-07-29    23
                                  2012-07-30     0
                                                ..
xxxxxxN  yyyyyyyyyyyyyyyyyyyyyyN  2014-06-02     0
                                  2014-06-03     1
                                  2014-06-04     0
                                  2014-06-05     0
                                  2014-06-06     0
                                  2014-06-07     0
                                  2014-06-08     2
                                  2014-06-09     0
                                  2014-06-10     0
                                  2014-06-11     0
                                  2014-06-12     0
                                  2014-06-13     0
                                  2014-06-14     0
                                  2014-06-15     0
                                  2014-06-16     0
                                  2014-06-17     0
                                  2014-06-18     0
                                  2014-06-19     0
                                  2014-06-20     0
                                  2014-06-21     0
                                  2014-06-22     0
                                  2014-06-23     0
                                  2014-06-24     0
                                  2014-06-25     4
                                  2014-06-26     0
                                  2014-06-27     1
                                  2014-06-28     0
                                  2014-06-29     0
                                  2014-06-30     1
                                  2014-07-01     0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1  yyyyyyyyyyyyyyyyyyyyyy1  2012-06-25/2012-07-01      1
                                  2012-07-02/2012-07-08     76
                                  2012-07-09/2012-07-15     53
                                  2012-07-16/2012-07-22     24
                                  2012-07-23/2012-07-29     71
                                  2012-07-30/2012-08-05     38
                                  2012-08-06/2012-08-12    258
                                  2012-08-13/2012-08-19    144
                                  2012-08-20/2012-08-26    184
                                  2012-08-27/2012-09-02    323
                                  2012-09-03/2012-09-09    198
                                  2012-09-10/2012-09-16    348
                                  2012-09-17/2012-09-23    404
                                  2012-09-24/2012-09-30    380
                                  2012-10-01/2012-10-07    367
                                  2012-10-08/2012-10-14    163
                                  2012-10-15/2012-10-21    338
                                  2012-10-22/2012-10-28    252
                                  2012-10-29/2012-11-04    197
                                  2012-11-05/2012-11-11    336
                                  2012-11-12/2012-11-18    234
                                  2012-11-19/2012-11-25    143
                                  2012-11-26/2012-12-02    204
                                  2012-12-03/2012-12-09    296
                                  2012-12-10/2012-12-16    146
                                  2012-12-17/2012-12-23     85
                                  2012-12-24/2012-12-30    198
                                  2012-12-31/2013-01-06    214
                                  2013-01-07/2013-01-13    229
                                  2013-01-14/2013-01-20    192
                                                          ...
xxxxxxN  yyyyyyyyyyyyyyyyyyyyyyN  2013-12-09/2013-12-15      3
                                  2013-12-16/2013-12-22      0
                                  2013-12-23/2013-12-29      0
                                  2013-12-30/2014-01-05      1
                                  2014-01-06/2014-01-12      3
                                  2014-01-13/2014-01-19      6
                                  2014-01-20/2014-01-26     11
                                  2014-01-27/2014-02-02      0
                                  2014-02-03/2014-02-09      1
                                  2014-02-10/2014-02-16      4
                                  2014-02-17/2014-02-23      3
                                  2014-02-24/2014-03-02      1
                                  2014-03-03/2014-03-09      4
                                  2014-03-10/2014-03-16      0
                                  2014-03-17/2014-03-23      0
                                  2014-03-24/2014-03-30      9
                                  2014-03-31/2014-04-06      1
                                  2014-04-07/2014-04-13      1
                                  2014-04-14/2014-04-20      1
                                  2014-04-21/2014-04-27      2
                                  2014-04-28/2014-05-04      8
                                  2014-05-05/2014-05-11      7
                                  2014-05-12/2014-05-18      5
                                  2014-05-19/2014-05-25      2
                                  2014-05-26/2014-06-01      8
                                  2014-06-02/2014-06-08      3
                                  2014-06-09/2014-06-15      0
                                  2014-06-16/2014-06-22      0
                                  2014-06-23/2014-06-29      5
                                  2014-06-30/2014-07-06      1
dtype: int64

ts.unstack().T.resample('W', how='sum').T.stack()is all it took! Very easy and seems quite performant. The pickle I'm reading in is 331M, so this is a pretty beefy data structure; the resampling takes just a couple seconds on my MacBook Pro.

ts.unstack().T.resample('W', how='sum').T.stack()就这样!非常简单,而且看起来非常高效。我读到的泡菜是 331M,所以这是一个非常强大的数据结构;在我的 MacBook Pro 上重新采样只需几秒钟。

回答by Josh D

I had the same issue, was breaking my head for a while, but then I read the documentation of the .resamplefunction in the 0.19.2 docs, and I see there's a new kwargcalled "level" that you can use to specify a level in a MultiIndex.

我遇到了同样的问题,让我头疼了一段时间,但后来我阅读了0.19.2 docs中的.resample函数文档,我看到有一个新的kwarg叫做“级别”的东西,你可以用它来指定一个级别多索引。

Edit: More details in the "What's New"section.

编辑:“新增功能”部分中的更多详细信息。

回答by LondonRob

I haven't checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual "split-apply-combine" process using a dictionary comprehension.

我还没有检查过它的效率,但我对多索引执行日期时间操作的本能方式是通过一种使用字典理解的手动“拆分-应用-组合”过程。

Assuming your DataFrame is unindexed. (You can do .reset_index()first), this works as follows:

假设您的 DataFrame 未编入索引。(你可以先做.reset_index()),它的工作原理如下:

  1. Group by the non-date columns
  2. Set "Date" as index and resample each chunk
  3. Reassemble using pd.concat
  1. 按非日期列分组
  2. 将“日期”设置为索引并重新采样每个块
  3. 重新组装使用 pd.concat

The final code looks like:

最终代码如下所示:

pd.concat({g: x.set_index("Date").resample("2D").mean()
                   for g, x in house.groupby(["State", "City"])})

回答by fpersyn

You need the groupby()method and provide it with a pd.Grouperfor each level of your MultiIndex you wish to maintain in the resulting DataFrame. You can then apply an operation of choice.

您需要该groupby()方法,并为pd.Grouper您希望在结果 DataFrame 中维护的 MultiIndex 的每个级别提供一个。然后,您可以应用选择的操作。

To resample date or timestamp levels, you need to set the freqargument with the frequency of choice — a similar approach using pd.TimeGrouper()is deprecated in favour of pd.Grouper()with the freqargument set.

要重新采样日期或时间戳级别,您需要freq使用选择的频率设置参数 -pd.TimeGrouper()不推荐使用类似的方法,而赞成pd.Grouper()使用freq参数集。

This should give you the DataFrame you need:

这应该为您提供所需的 DataFrame:

df.groupby([pd.Grouper(level='State'), pd.Grouper(level='City'), pd.Grouper(level='Date', freq='2D')]).sum()

The Time Series Guidein the pandas documentation describes resample()as: "a time-based groupby, followed by a reduction method on each of its groups". Hence, using groupby()should technically be the same operation as using .resample()on a DataFrame with a single index.

pandas 文档中的时间序列指南描述resample()为:“一个基于时间的 groupby,然后是每个组的减少方法”。因此,使用在groupby()技术上应该与.resample()在具有单个索引的 DataFrame 上使用相同的操作。

The same paragraph points to the cookbook section on resamplingfor more advanced examples, where the 'Grouping using a MultiIndex' entry is highly relevant for this question. Hope that helps.

同一段落指向关于更高级示例的重采样食谱部分,其中“使用 MultiIndex 分组”条目与此问题高度相关。希望有帮助。