Python 在 Pandas MultiIndex 内重新采样

Question

提问by Snakes McGee

I have some hierarchical data which bottoms out into time series data which looks something like this:

我有一些分层数据，它们最终变成时间序列数据，看起来像这样：

df = pandas.DataFrame(
    {'value_a': values_a, 'value_b': values_b},
    index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df

                               value_a  value_b
State   City       Date                        
Georgia Atlanta    2012-01-01        0       10
                   2012-01-02        1       11
                   2012-01-03        2       12
                   2012-01-04        3       13
        Savanna    2012-01-01        4       14
                   2012-01-02        5       15
                   2012-01-03        6       16
                   2012-01-04        7       17
Alabama Mobile     2012-01-01        8       18
                   2012-01-02        9       19
                   2012-01-03       10       20
                   2012-01-04       11       21
        Montgomery 2012-01-01       12       22
                   2012-01-02       13       23
                   2012-01-03       14       24
                   2012-01-04       15       25

I'd like to perform time resampling per city, so something like

我想对每个城市进行时间重采样，所以像

df.resample("2D", how="sum")

would output

会输出

                             value_a  value_b
State   City       Date                        
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49

as is, df.resample('2D', how='sum')gets me

df.resample('2D', how='sum')照原样，让我

TypeError: Only valid with DatetimeIndex or PeriodIndex

Fair enough, but I'd sort of expect this to work:

很公平，但我有点希望这能奏效：

>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex

at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?

在这一点上，我真的没有想法了……有什么方法 stack 和 unstack 可以帮助我吗？

Answer 1

采纳答案by unutbu

pd.Grouperallows you to specify a "groupby instruction for a target object". In particular, you can use it to group by dates even if df.indexis not a DatetimeIndex:

pd.Grouper允许您指定“目标对象的 groupby 指令”。特别是，即使df.index不是，您也可以使用它按日期分组DatetimeIndex：

df.groupby(pd.Grouper(freq='2D', level=-1))

The level=-1tells pd.Grouperto look for the dates in the last level of the MultiIndex. Moreover, you can use this in conjunction with other level values from the index:

在level=-1讲述pd.Grouper寻找在多指标的最后一个级别的日期。此外，您可以将其与索引中的其他级别值结合使用：

level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
                      +[pd.Grouper(freq='2D', level=-1)]).sum())

It looks a bit awkward, but using_Grouperturns out to be much faster than my original suggestion, using_reset_index:

看起来有点尴尬，但using_Grouper结果证明比我最初的建议要快得多，using_reset_index：

import numpy as np
import pandas as pd
import datetime as DT

def using_Grouper(df):
    level_values = df.index.get_level_values
    return (df.groupby([level_values(i) for i in [0,1]]
                       +[pd.Grouper(freq='2D', level=-1)]).sum())

def using_reset_index(df):
    df = df.reset_index(level=[0, 1])
    return df.groupby(['State','City']).resample('2D').sum()

def using_stack(df):
    # http://stackoverflow.com/a/15813787/190597
    return (df.unstack(level=[0,1])
              .resample('2D').sum()
              .stack(level=[2,1])
              .swaplevel(2,0))

def make_orig():
    values_a = range(16)
    values_b = range(10, 26)
    states = ['Georgia']*8 + ['Alabama']*8
    cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
    dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
    df = pd.DataFrame(
        {'value_a': values_a, 'value_b': values_b},
        index = [states, cities, dates])
    df.index.names = ['State', 'City', 'Date']
    return df

def make_df(N):
    dates = pd.date_range('2000-1-1', periods=N)
    states = np.arange(50)
    cities = np.arange(10)
    index = pd.MultiIndex.from_product([states, cities, dates], 
                                       names=['State', 'City', 'Date'])
    df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
                      columns=['value_a', 'value_b'])
    return df

df = make_orig()
print(using_Grouper(df))

yields

产量

                               value_a  value_b
State   City       Date                        
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33

Here is a benchmark comparing using_Grouper, using_reset_index, using_stackon a 5000-row DataFrame:

这里是一个标杆比较using_Grouper，using_reset_index，using_stack在一个有5000行数据帧：

In [30]: df = make_df(10)

In [34]: len(df)
Out[34]: 5000

In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop

In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop

In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop

Answer 2

回答by user1827356

An alternative using stack/unstack

使用堆栈/取消堆栈的替代方法

df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)

                               value_a  value_b
State   City       Date
Georgia Atlanta    2012-01-01        1       21
Alabama Mobile     2012-01-01       17       37
        Montgomery 2012-01-01       25       45
Georgia Savanna    2012-01-01        9       29
        Atlanta    2012-01-03        5       25
Alabama Mobile     2012-01-03       21       41
        Montgomery 2012-01-03       29       49
Georgia Savanna    2012-01-03       13       33

Notes:

笔记：

No idea about performance comparison
Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed

不知道性能比较
可能的熊猫错误 - stack(level=[2,1]) 有效，但 stack(level=[1,2]) 失败

Answer 3

回答by Kamil Sindi

This works:

这有效：

df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))

                               value_a  value_b
State   City       Date
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33

If the Date column is strings, then convert to datetime beforehand:

如果日期列是字符串，则预先转换为日期时间：

df['Date'] = pd.to_datetime(df['Date'])

Answer 4

回答by Geoff

I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:

我知道这个问题已经有几年了，但我遇到了同样的问题，并找到了一个需要 1 行的更简单的解决方案：

>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1  yyyyyyyyyyyyyyyyyyyyyy1  2012-07-01     1
                                  2012-07-02    13
                                  2012-07-03     1
                                  2012-07-04     1
                                  2012-07-05    10
                                  2012-07-06     4
                                  2012-07-07    47
                                  2012-07-08     0
                                  2012-07-09     3
                                  2012-07-10    22
                                  2012-07-11     3
                                  2012-07-12     0
                                  2012-07-13    22
                                  2012-07-14     1
                                  2012-07-15     2
                                  2012-07-16     2
                                  2012-07-17     8
                                  2012-07-18     0
                                  2012-07-19     1
                                  2012-07-20    10
                                  2012-07-21     0
                                  2012-07-22     3
                                  2012-07-23     0
                                  2012-07-24    35
                                  2012-07-25     6
                                  2012-07-26     1
                                  2012-07-27     0
                                  2012-07-28     6
                                  2012-07-29    23
                                  2012-07-30     0
                                                ..
xxxxxxN  yyyyyyyyyyyyyyyyyyyyyyN  2014-06-02     0
                                  2014-06-03     1
                                  2014-06-04     0
                                  2014-06-05     0
                                  2014-06-06     0
                                  2014-06-07     0
                                  2014-06-08     2
                                  2014-06-09     0
                                  2014-06-10     0
                                  2014-06-11     0
                                  2014-06-12     0
                                  2014-06-13     0
                                  2014-06-14     0
                                  2014-06-15     0
                                  2014-06-16     0
                                  2014-06-17     0
                                  2014-06-18     0
                                  2014-06-19     0
                                  2014-06-20     0
                                  2014-06-21     0
                                  2014-06-22     0
                                  2014-06-23     0
                                  2014-06-24     0
                                  2014-06-25     4
                                  2014-06-26     0
                                  2014-06-27     1
                                  2014-06-28     0
                                  2014-06-29     0
                                  2014-06-30     1
                                  2014-07-01     0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1  yyyyyyyyyyyyyyyyyyyyyy1  2012-06-25/2012-07-01      1
                                  2012-07-02/2012-07-08     76
                                  2012-07-09/2012-07-15     53
                                  2012-07-16/2012-07-22     24
                                  2012-07-23/2012-07-29     71
                                  2012-07-30/2012-08-05     38
                                  2012-08-06/2012-08-12    258
                                  2012-08-13/2012-08-19    144
                                  2012-08-20/2012-08-26    184
                                  2012-08-27/2012-09-02    323
                                  2012-09-03/2012-09-09    198
                                  2012-09-10/2012-09-16    348
                                  2012-09-17/2012-09-23    404
                                  2012-09-24/2012-09-30    380
                                  2012-10-01/2012-10-07    367
                                  2012-10-08/2012-10-14    163
                                  2012-10-15/2012-10-21    338
                                  2012-10-22/2012-10-28    252
                                  2012-10-29/2012-11-04    197
                                  2012-11-05/2012-11-11    336
                                  2012-11-12/2012-11-18    234
                                  2012-11-19/2012-11-25    143
                                  2012-11-26/2012-12-02    204
                                  2012-12-03/2012-12-09    296
                                  2012-12-10/2012-12-16    146
                                  2012-12-17/2012-12-23     85
                                  2012-12-24/2012-12-30    198
                                  2012-12-31/2013-01-06    214
                                  2013-01-07/2013-01-13    229
                                  2013-01-14/2013-01-20    192
                                                          ...
xxxxxxN  yyyyyyyyyyyyyyyyyyyyyyN  2013-12-09/2013-12-15      3
                                  2013-12-16/2013-12-22      0
                                  2013-12-23/2013-12-29      0
                                  2013-12-30/2014-01-05      1
                                  2014-01-06/2014-01-12      3
                                  2014-01-13/2014-01-19      6
                                  2014-01-20/2014-01-26     11
                                  2014-01-27/2014-02-02      0
                                  2014-02-03/2014-02-09      1
                                  2014-02-10/2014-02-16      4
                                  2014-02-17/2014-02-23      3
                                  2014-02-24/2014-03-02      1
                                  2014-03-03/2014-03-09      4
                                  2014-03-10/2014-03-16      0
                                  2014-03-17/2014-03-23      0
                                  2014-03-24/2014-03-30      9
                                  2014-03-31/2014-04-06      1
                                  2014-04-07/2014-04-13      1
                                  2014-04-14/2014-04-20      1
                                  2014-04-21/2014-04-27      2
                                  2014-04-28/2014-05-04      8
                                  2014-05-05/2014-05-11      7
                                  2014-05-12/2014-05-18      5
                                  2014-05-19/2014-05-25      2
                                  2014-05-26/2014-06-01      8
                                  2014-06-02/2014-06-08      3
                                  2014-06-09/2014-06-15      0
                                  2014-06-16/2014-06-22      0
                                  2014-06-23/2014-06-29      5
                                  2014-06-30/2014-07-06      1
dtype: int64

ts.unstack().T.resample('W', how='sum').T.stack()is all it took! Very easy and seems quite performant. The pickle I'm reading in is 331M, so this is a pretty beefy data structure; the resampling takes just a couple seconds on my MacBook Pro.

ts.unstack().T.resample('W', how='sum').T.stack()就这样！非常简单，而且看起来非常高效。我读到的泡菜是 331M，所以这是一个非常强大的数据结构；在我的 MacBook Pro 上重新采样只需几秒钟。

Answer 5

回答by Josh D

I had the same issue, was breaking my head for a while, but then I read the documentation of the .resamplefunction in the 0.19.2 docs, and I see there's a new kwargcalled "level" that you can use to specify a level in a MultiIndex.

我遇到了同样的问题，让我头疼了一段时间，但后来我阅读了0.19.2 docs中的.resample函数文档，我看到有一个新的kwarg叫做“级别”的东西，你可以用它来指定一个级别多索引。

Edit: More details in the "What's New"section.

编辑：“新增功能”部分中的更多详细信息。

Answer 6

回答by LondonRob

I haven't checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual "split-apply-combine" process using a dictionary comprehension.

我还没有检查过它的效率，但我对多索引执行日期时间操作的本能方式是通过一种使用字典理解的手动“拆分-应用-组合”过程。

Assuming your DataFrame is unindexed. (You can do .reset_index()first), this works as follows:

假设您的 DataFrame 未编入索引。（你可以先做.reset_index()），它的工作原理如下：

Group by the non-date columns
Set "Date" as index and resample each chunk
Reassemble using pd.concat

按非日期列分组
将“日期”设置为索引并重新采样每个块
重新组装使用 pd.concat

The final code looks like:

最终代码如下所示：

pd.concat({g: x.set_index("Date").resample("2D").mean()
                   for g, x in house.groupby(["State", "City"])})

Answer 7

回答by fpersyn

You need the groupby()method and provide it with a pd.Grouperfor each level of your MultiIndex you wish to maintain in the resulting DataFrame. You can then apply an operation of choice.

您需要该groupby()方法，并为pd.Grouper您希望在结果 DataFrame 中维护的 MultiIndex 的每个级别提供一个。然后，您可以应用选择的操作。

To resample date or timestamp levels, you need to set the freqargument with the frequency of choice — a similar approach using pd.TimeGrouper()is deprecated in favour of pd.Grouper()with the freqargument set.

要重新采样日期或时间戳级别，您需要freq使用选择的频率设置参数 -pd.TimeGrouper()不推荐使用类似的方法，而赞成pd.Grouper()使用freq参数集。

This should give you the DataFrame you need:

这应该为您提供所需的 DataFrame：

df.groupby([pd.Grouper(level='State'), pd.Grouper(level='City'), pd.Grouper(level='Date', freq='2D')]).sum()

The Time Series Guidein the pandas documentation describes resample()as: "a time-based groupby, followed by a reduction method on each of its groups". Hence, using groupby()should technically be the same operation as using .resample()on a DataFrame with a single index.

pandas 文档中的时间序列指南描述resample()为：“一个基于时间的 groupby，然后是每个组的减少方法”。因此，使用在groupby()技术上应该与.resample()在具有单个索引的 DataFrame 上使用相同的操作。

The same paragraph points to the cookbook section on resamplingfor more advanced examples, where the 'Grouping using a MultiIndex' entry is highly relevant for this question. Hope that helps.

同一段落指向关于更高级示例的重采样的食谱部分，其中“使用 MultiIndex 分组”条目与此问题高度相关。希望有帮助。

Python 在 Pandas MultiIndex 内重新采样

提问by Snakes McGee

采纳答案by unutbu

回答by user1827356

回答by Kamil Sindi

回答by Geoff

回答by Josh D

回答by LondonRob

回答by fpersyn

相关推荐

最近更新

标签

Python 在 Pandas MultiIndex 内重新采样

提问by Snakes McGee

采纳答案by unutbu

回答by user1827356

回答by Kamil Sindi

回答by Geoff

回答by Josh D

回答by LondonRob

回答by fpersyn

相关推荐

python日志记录：如何确保创建日志文件目录？

使用 Python 请求：会话、Cookie 和 POST

Python PyQt QPushButton 背景色

Python Tkinter清除帧

相关推荐

最近更新

标签