滑动窗口上的 Pandas 滚动计算(不均匀间隔)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14631139/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:38:01  来源:igfitidea点击:

Pandas Rolling Computations on Sliding Windows (Unevenly spaced)

pythonpandas

提问by radikalus

Consider you've got some unevenly time series data:

考虑您有一些不均匀的时间序列数据:

import pandas as pd
import random as randy
ts = pd.Series(range(1000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),1000)).sort_index()
print ts.head()


2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499
2013-02-01 09:00:00.003838    797
2013-02-01 09:00:00.004727    295
2013-02-01 09:00:00.006287    253

Let's say I wanted to do the rolling sum over a 1ms window to get this:

假设我想在 1 毫秒的窗口内进行滚动求和以得到这个:

2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499 + 995
2013-02-01 09:00:00.003838    797 + 499 + 995
2013-02-01 09:00:00.004727    295 + 797 + 499
2013-02-01 09:00:00.006287    253

Currently, I cast everything back to longs and do this in cython, but is this possible in pure pandas? I'm aware that you can do something like .asfreq('U') and then fill and use the traditional functions but this doesn't scale once you've got more than a toy # of rows.

目前,我将所有内容都转换回 long 并在 cython 中执行此操作,但这在纯Pandas中可能吗?我知道您可以执行类似 .asfreq('U') 的操作,然后填充并使用传统函数,但是一旦您获得的行数超过玩具 #,这将无法扩展。

As a point of reference, here's a hackish, not fast Cython version:

作为参考,这是一个hackish的,不是快速的Cython版本:

%%cython
import numpy as np
cimport cython
cimport numpy as np

ctypedef np.double_t DTYPE_t

def rolling_sum_cython(np.ndarray[long,ndim=1] times, np.ndarray[double,ndim=1] to_add, long window_size):
    cdef long t_len = times.shape[0], s_len = to_add.shape[0], i =0, win_size = window_size, t_diff, j, window_start
    cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(t_len, dtype=np.double)
    assert(t_len==s_len)
    for i in range(0,t_len):
        window_start = times[i] - win_size
        j = i
        while times[j]>= window_start and j>=0:
            res[i] += to_add[j]
            j-=1
    return res   

Demonstrating this on a slightly larger series:

在一个稍大的系列上证明这一点:

ts = pd.Series(range(100000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e8,freq='U'),100000)).sort_index()

%%timeit
res2 = rolling_sum_cython(ts.index.astype(int64),ts.values.astype(double),long(1e6))
1000 loops, best of 3: 1.56 ms per loop

采纳答案by signalseeker

You can solve most problems of this sort with cumsum and binary search.

您可以使用 cumsum 和二分搜索解决大多数此类问题。

from datetime import timedelta

def msum(s, lag_in_ms):
    lag = s.index - timedelta(milliseconds=lag_in_ms)
    inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
    cs = s.cumsum()
    return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)

res = msum(ts, 100)
print pd.DataFrame({'a': ts, 'a_msum_100': res})


                            a  a_msum_100
2013-02-01 09:00:00.073479  5           5
2013-02-01 09:00:00.083717  8          13
2013-02-01 09:00:00.162707  1          14
2013-02-01 09:00:00.171809  6          20
2013-02-01 09:00:00.240111  7          14
2013-02-01 09:00:00.258455  0          14
2013-02-01 09:00:00.336564  2           9
2013-02-01 09:00:00.536416  3           3
2013-02-01 09:00:00.632439  4           7
2013-02-01 09:00:00.789746  9           9

[10 rows x 2 columns]

You need a way of handling NaNs and depending on your application, you may need the prevailing value asof the lagged time or not (ie difference between using kdb+ bin vs np.searchsorted).

您需要一种处理 NaN 的方法,并且根据您的应用程序,您可能需要滞后时间的主要值(即使用 kdb+ bin 与 np.searchsorted 之间的差异)。

Hope this helps.

希望这可以帮助。

回答by Kevin Wang

This is an old question, but for those who stumble upon this from google: in pandas 0.19 this is built-in as the function

这是一个老问题,但对于那些从 google 偶然发现的人来说:在 pandas 0.19 中,这是作为函数内置的

http://pandas.pydata.org/pandas-docs/stable/computation.html#time-aware-rolling

http://pandas.pydata.org/pandas-docs/stable/computation.html#time-aware-rolling

So to get 1 ms windows it looks like you get a Rolling object by doing

因此,要获得 1 毫秒的窗口,您似乎可以通过执行以下操作获得滚动对象

dft.rolling('1ms')

and the sum would be

总和将是

dft.rolling('1ms').sum()

回答by Andy Hayden

Perhaps it makes more sense to use rolling_sum:

也许使用更有意义rolling_sum

pd.rolling_sum(ts, window=1, freq='1ms')

回答by Zelazny7

How about something like this:

这样的事情怎么样:

Create an offset for 1 ms:

创建 1 ms 的偏移量:

In [1]: ms = tseries.offsets.Milli()

Create a series of index positions the same length as your timeseries:

创建一系列与您的时间序列长度相同的索引位置:

In [2]: s = Series(range(len(ts)))

Apply a lambda function that indexes the current time from the ts series. The function returns the sum of all ts entries between x - ms and x.

应用从 ts 系列索引当前时间的 lambda 函数。该函数返回 之间所有 ts 条目的总和x - ms and x

In [3]: s.apply(lambda x: ts.between_time(start_time=ts.index[x]-ms, end_time=ts.index[x]).sum())

In [4]: ts.head()
Out[4]:
2013-02-01 09:00:00.000558    348
2013-02-01 09:00:00.000647    361
2013-02-01 09:00:00.000726    312
2013-02-01 09:00:00.001012    550
2013-02-01 09:00:00.002208    758

Results of the above function:

上述函数的结果:

0     348
1     709
2    1021
3    1571
4     758