pandas 基于值而不是计数的带窗口的熊猫滚动计算

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14300768/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:35:17  来源:igfitidea点击:

pandas rolling computation with window based on values instead of counts

pythonpandas

提问by BrenBarn

I'm looking for a way to do something like the various rolling_*functions of pandas, but I want the window of the rolling computation to be defined by a range of values (say, a range of values of a column of the DataFrame), not by the number of rows in the window.

我正在寻找一种方法来执行类似于 的各种rolling_*功能的操作pandas,但我希望滚动计算的窗口由一系列值(例如,DataFrame 列的一系列值)定义,而不是由窗口中的行数。

As an example, suppose I have this data:

例如,假设我有以下数据:

>>> print d
   RollBasis  ToRoll
0          1       1
1          1       4
2          1      -5
3          2       2
4          3      -4
5          5      -2
6          8       0
7         10     -13
8         12      -2
9         13      -5

If I do something like rolling_sum(d, 5), I get a rolling sum in which each window contains 5 rows. But what I want is a rolling sum in which each window contains a certain range of values of RollBasis. That is, I'd like to be able to do something like d.roll_by(sum, 'RollBasis', 5), and get a result where the first window contains all rows whose RollBasisis between 1 and 5, then the second window contains all rows whose RollBasisis between 2 and 6, then the third window contains all rows whose RollBasisis between 3 and 7, etc. The windows will not have equal numbers of rows, but the range of RollBasisvalues selected in each window will be the same. So the output should be like:

如果我做类似的事情rolling_sum(d, 5),我会得到一个滚动总和,其中每个窗口包含 5 行。但我想要的是一个滚动总和,其中每个窗口包含RollBasis. 也就是说,我希望能够执行类似的操作d.roll_by(sum, 'RollBasis', 5),并得到一个结果,其中第一个窗口包含RollBasis1 到 5 之间的所有行,然后第二个窗口包含RollBasis2 到 6 之间的所有行,然后是第三个窗口窗口包含RollBasis3 到 7 之间的所有行,依此类推。窗口的行数不会相等,但RollBasis每个窗口中选择的值范围将相同。所以输出应该是这样的:

>>> d.roll_by(sum, 'RollBasis', 5)
    1    -4    # sum of elements with 1 <= Rollbasis <= 5
    2    -4    # sum of elements with 2 <= Rollbasis <= 6
    3    -6    # sum of elements with 3 <= Rollbasis <= 7
    4    -2    # sum of elements with 4 <= Rollbasis <= 8
    # etc.

I can't do this with groupby, because groupbyalways produces disjoint groups. I can't do it with the rolling functions, because their windows always roll by number of rows, not by values. So how can I do it?

我不能这样做groupby,因为groupby总是产生不相交的组。我不能用滚动函数来做,因为它们的窗口总是按行数滚动,而不是按值滚动。那么我该怎么做呢?

采纳答案by Zelazny7

I think this does what you want:

我认为这可以满足您的要求:

In [1]: df
Out[1]:
   RollBasis  ToRoll
0          1       1
1          1       4
2          1      -5
3          2       2
4          3      -4
5          5      -2
6          8       0
7         10     -13
8         12      -2
9         13      -5

In [2]: def f(x):
   ...:     ser = df.ToRoll[(df.RollBasis >= x) & (df.RollBasis < x+5)]
   ...:     return ser.sum()

The above function takes a value, in this case RollBasis, and then indexes the data frame column ToRoll based on that value. The returned series consists of ToRoll values that meet the RollBasis + 5 criterion. Finally, that series is summed and returned.

上述函数采用一个值,在本例中为 RollBasis,然后根据该值索引数据框列 ToRoll。返回的系列由满足 RollBasis + 5 标准的 ToRoll 值组成。最后,该系列被求和并返回。

In [3]: df['Rolled'] = df.RollBasis.apply(f)

In [4]: df
Out[4]:
   RollBasis  ToRoll  Rolled
0          1       1      -4
1          1       4      -4
2          1      -5      -4
3          2       2      -4
4          3      -4      -6
5          5      -2      -2
6          8       0     -15
7         10     -13     -20
8         12      -2      -7
9         13      -5      -5

Code for the toy example DataFrame in case someone else wants to try:

玩具示例 DataFrame 的代码,以防其他人想尝试:

In [1]: from pandas import *

In [2]: import io

In [3]: text = """\
   ...:    RollBasis  ToRoll
   ...: 0          1       1
   ...: 1          1       4
   ...: 2          1      -5
   ...: 3          2       2
   ...: 4          3      -4
   ...: 5          5      -2
   ...: 6          8       0
   ...: 7         10     -13
   ...: 8         12      -2
   ...: 9         13      -5
   ...: """

In [4]: df = read_csv(io.BytesIO(text), header=0, index_col=0, sep='\s+')

回答by BrenBarn

Based on Zelazny7's answer, I created this more general solution:

基于 Zelazny7 的回答,我创建了这个更通用的解决方案:

def rollBy(what, basis, window, func):
    def applyToWindow(val):
        chunk = what[(val<=basis) & (basis<val+window)]
        return func(chunk)
    return basis.apply(applyToWindow)

>>> rollBy(d.ToRoll, d.RollBasis, 5, sum)
0    -4
1    -4
2    -4
3    -4
4    -6
5    -2
6   -15
7   -20
8    -7
9    -5
Name: RollBasis

It's still not ideal as it is very slow compared to rolling_apply, but perhaps this is inevitable.

它仍然不理想,因为与 相比它非常慢rolling_apply,但这也许是不可避免的。

回答by Ian Sudbery

Based on BrenBarns's answer, but speeded up by using label based indexing rather than boolean based indexing:

基于 BrenBarns 的回答,但通过使用基于标签的索引而不是基于布尔的索引来加速:

def rollBy(what,basis,window,func,*args,**kwargs):
    #note that basis must be sorted in order for this to work properly     
    indexed_what = pd.Series(what.values,index=basis.values)
    def applyToWindow(val):
        # using slice_indexer rather that what.loc [val:val+window] allows
        # window limits that are not specifically in the index
        indexer = indexed_what.index.slice_indexer(val,val+window,1)
        chunk = indexed_what[indexer]
        return func(chunk,*args,**kwargs)
    rolled = basis.apply(applyToWindow)
    return rolled

This is muchfaster than not using an indexed column:

这比不使用索引列快得多:

In [46]: df = pd.DataFrame({"RollBasis":np.random.uniform(0,1000000,100000), "ToRoll": np.random.uniform(0,10,100000)})

In [47]: df = df.sort("RollBasis")

In [48]: timeit("rollBy_Ian(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Ian,df", number =3)
Out[48]: 67.6615059375763

In [49]: timeit("rollBy_Bren(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Bren,df", number =3)
Out[49]: 515.0221037864685

Its worth noting that the index based solution is O(n), while the logical slicing version is O(n^2) in the average case (I think).

值得注意的是,基于索引的解决方案是 O(n),而逻辑切片版本在一般情况下是 O(n^2)(我认为)。

I find it more useful to do this over evenly spaced windows from the min value of Basis to the max value of Basis, rather than at every value of basis. This means altering the function thus:

我发现在从 Basis 最小值到 Basis 最大值的均匀间隔窗口上执行此操作更有用,而不是在每个 base 值上执行此操作。这意味着改变函数:

def rollBy(what,basis,window,func,*args,**kwargs):
    #note that basis must be sorted in order for this to work properly
    windows_min = basis.min()
    windows_max = basis.max()
    window_starts = np.arange(windows_min, windows_max, window)
    window_starts = pd.Series(window_starts, index = window_starts)
    indexed_what = pd.Series(what.values,index=basis.values)
    def applyToWindow(val):
        # using slice_indexer rather that what.loc [val:val+window] allows
        # window limits that are not specifically in the index
        indexer = indexed_what.index.slice_indexer(val,val+window,1)
        chunk = indexed_what[indexer]
        return func(chunk,*args,**kwargs)
    rolled = window_starts.apply(applyToWindow)
    return rolled