pandas 熊猫最后五分钟的滚动总和

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27062652/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:41:36  来源:igfitidea点击:

pandas rolling sum of last five minutes

pythonpandassumtime-series

提问by Sajith Dilshan

Assume I have below data frame

假设我有以下数据框

Date, A
2014-11-21 11:00:00, 1
2014-11-21 11:03:00, 4
2014-11-21 11:04:00, 1
2014-11-21 11:05:00, 2
2014-11-21 11:07:00, 4
2014-11-21 11:08:00, 1
2014-11-21 11:12:00, 1
2014-11-21 11:13:00, 2

First column is datetime object and second column is an integer. What I want is to calculate the sum of column 'A' for the last five minute for each row.

第一列是日期时间对象,第二列是整数。我想要的是计算每行最后五分钟的“A”列的总和。

As an example for the row 2014-11-21 11:12:00, 1, the sum of column 'A' would be 2(1+1) and the sum of column 'A' for the row 2014-11-21 11:05:00, 2would be 7(2+1+4). Important thing is that the number of past rows for the time window(5 minutes) is not the same for each row (Because time series is irregular).

作为行的示例, 2014-11-21 11:12:00, 1“A”列的总和将为 2(1+1),而行的“A”列的总和2014-11-21 11:05:00, 2将为 7(2+1+4)。重要的是,时间窗口(5 分钟)的过去行数对于每一行都不相同(因为时间序列是不规则的)。

How can I get the sum of last five minutes for column 'A' using the rolling_sum method in pandas? Thanks in advance.

如何使用 pandas 中的滚动总和方法获得“A”列的最后五分钟的总和?提前致谢。

回答by unutbu

In general, if the dates are completely arbitrary, I think you would be forced to use a Python for-loopover the rows or use df.apply, (which under the hood, also uses a Python loop.)

一般来说,如果日期是完全任意的,我认为您将被迫for-loop在行上使用 Python或使用df.apply, (在幕后,也使用 Python 循环。)

However, if your Dates share a common frequency, as is the case above, then there is a trick which should be much quicker than using df.apply: Expand the timeseries according to the common frequency -- in this case, 1 minute -- fill in the NaNs with zeros, and then call rolling_sum:

但是,如果您的日期共享一个公共频率,就像上面的情况一样,那么有一个技巧应该比使用快得多df.apply:根据公共频率扩展时间序列 - 在这种情况下为 1 分钟 - 填写带有零的 NaN,然后​​调用rolling_sum

In [279]: pd.rolling_sum(df.set_index(['Date']).asfreq('1T').fillna(0), window=5, min_periods=1).reindex(df['Date'])
Out[279]: 
                      A
Date                   
2014-11-21 11:00:00   1
2014-11-21 11:03:00   5
2014-11-21 11:04:00   6
2014-11-21 11:05:00   7
2014-11-21 11:07:00  11
2014-11-21 11:08:00   8
2014-11-21 11:12:00   2
2014-11-21 11:13:00   3

Of course, any time series has a common frequency if you are willing to accept a small enough granularity, but the required size of df.asfreq(...)may make this trick impractical.

当然,如果你愿意接受足够小的粒度,任何时间序列都有一个共同的频率,但所需的大小df.asfreq(...)可能会使这个技巧不切实际。



Here is an example of the more general approach using df.apply. Note that calling searchsortedrelies on df['Date']being in sorted order.

这是使用df.apply. 请注意,调用searchsorted依赖df['Date']于排序顺序。

import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')
start_dates = df['Date'] - pd.Timedelta(minutes=5)
df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
df['end_index'] = np.arange(len(df))

def sum_window(row):
    return df['A'].iloc[row['start_index']:row['end_index']+1].sum()
df['rolling_sum'] = df.apply(sum_window, axis=1)

print(df[['Date', 'A', 'rolling_sum']])

yields

产量

                 Date  A  rolling_sum
0 2014-11-21 11:00:00  1            1
1 2014-11-21 11:03:00  4            5
2 2014-11-21 11:04:00  1            6
3 2014-11-21 11:05:00  2            7
4 2014-11-21 11:07:00  4           11
5 2014-11-21 11:08:00  1            8
6 2014-11-21 11:12:00  1            2
7 2014-11-21 11:13:00  2            3


Here is a benchmark comparing the df.asfreqtrick versus calling df.apply:

这是一个比较df.asfreq技巧与调用的基准df.apply

import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')

def big_df(df):
    df = df.copy()
    for i in range(7):
        dates = df['Date'] + pd.Timedelta(df.iloc[-1]['Date']-df.iloc[0]['Date']) + pd.Timedelta('1 minute')
        df2 = pd.DataFrame({'Date': dates, 'A': df['A']})
        df = pd.concat([df, df2])
    df = df.reset_index(drop=True)
    return df

def using_apply():
    start_dates = df['Date'] - pd.Timedelta(minutes=5)
    df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
    df['end_index'] = np.arange(len(df))

    def sum_window(row):
        return df['A'].iloc[row['start_index']:row['end_index']+1].sum()

    df['rolling_sum'] = df.apply(sum_window, axis=1)
    return df[['Date', 'rolling_sum']]

def using_asfreq():
    result = (pd.rolling_sum(
        df.set_index(['Date']).asfreq('1T').fillna(0), 
        window=5, min_periods=1).reindex(df['Date']))
    return result


In [364]: df = big_df(df)

In [367]: %timeit using_asfreq()
1000 loops, best of 3: 1.21 ms per loop

In [368]: %timeit using_apply()
1 loops, best of 3: 208 ms per loop