Python pandas 滚动对象如何工作?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45254174/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:52:12  来源:igfitidea点击:

How do pandas Rolling objects work?

pythonpandasnumpydataframecython

提问by Brad Solomon

Edit:I condensed this question given that it was probably too involved to begin with. The meat of the question is in bold below.

编辑:我浓缩了这个问题,因为它可能太复杂了。问题的重点在下面以粗体显示。

I'd like to know more about the object that is actually created when using DataFrame.rollingor Series.rolling:

我想了解更多有关使用DataFrame.rollingor时实际创建的对象的信息Series.rolling

print(type(df.rolling))
<class 'pandas.core.window.Rolling'>

Some background: consider the oft-used alternative with np.as_strided. This code snippet itself isn't important, but its result is my reference point in asking this question.

一些背景:考虑使用np.as_strided. 这个代码片段本身并不重要,但它的结果是我提出这个问题的参考点。

def rwindows(a, window):
    if a.ndim == 1:
        a = a.reshape(-1, 1)
    shape = a.shape[0] - window + 1, window, a.shape[-1]
    strides = (a.strides[0],) + a.strides
    windows = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return np.squeeze(windows)

Here rwindowswill take a 1d or 2d ndarrayand build rolling "blocks" equal to the specified window size (as below). How does a .rollingobject compare to the ndarrayoutput below?Is it an iterator, with certain attributes stored for each block? Or something else entirely? I've tried playing around with tab completion on the object with attributes/methods such as __dict__and _get_index()and they're not telling me much. I've also seen a _create_blocksmethod in pandas--does it at all resemble the stridedmethod?

这里rwindows将采用 1d 或 2dndarray并构建等于指定窗口大小的滚动“块”(如下所示)。 对象与下面输出相比如何?.rollingndarray它是一个迭代器,为每个块存储某些属性吗?或者完全是别的什么?我试过在对象上使用诸如__dict__and 之类的属性/方法来完成制表符_get_index(),但它们并没有告诉我太多。我还在_create_blocks熊猫中看到了一种方法——它与该strided方法完全相似吗?

# as_strided version

a = np.arange(5)
print(rwindows(a, 3))           # 1d input
[[0 1 2]
 [1 2 3]
 [2 3 4]]

b = np.arange(10).reshape(5,2)
print(rwindows(b, 4))           # 2d input
[[[0 1]
  [2 3]
  [4 5]
  [6 7]]

 [[2 3]
  [4 5]
  [6 7]
  [8 9]]]

Part 2, extra credit

第 2 部分,额外学分

Using the NumPy approach above (OLS implementation here) is necessitated by the fact that funcwithin pandas.core.window.Rolling.applymust

使用上面的 NumPy 方法(此处为OLS 实现)是必要的,因为funcpandas.core.window.Rolling.apply 中必须

produce a single value from an ndarray input *args and **kwargs are passed to the function

从 ndarray 输入生成单个值 *args 和 **kwargs 传递给函数

So the argument can't be another rolling object. I.e.

所以参数不能是另一个滚动对象。IE

def prod(a, b):
    return a * b
df.rolling(3).apply(prod, args=((df + 2).rolling(3),))
-----------------------------------------------------------------------
...
TypeError: unsupported operand type(s) for *: 'float' and 'Rolling'

So this is really from where my question above stems. Why is it that the passed function must use a NumPy array and produce a single scalar value, and what does this have to do with the layout of a .rollingobject?

所以这真的来自我上面的问题。为什么传递的函数必须使用 NumPy 数组并生成单个标量值,这与.rolling对象的布局有什么关系?

回答by André C. Andersen

I suggest you have a look at the source code in order to get into the nitty gritty of what rolling does. In particular I suggest you have a look at the rollingfunctions in generic.pyand window.py. From there you can have a look at the Windowclasswhich is used if you specify a window type or the default Rollingclass. The last one inherits from _Rolling_and_Expandingand ultimately _Rollingand _Window.

我建议您查看源代码,以便深入了解滚动的作用。我特别建议您查看generic.pywindow.py中的rolling函数。从那里,你可以看一下,如果你指定一个窗口类型或默认所使用的类。最后一个继承自and 最终and 。WindowRolling_Rolling_and_Expanding_Rolling_Window

That said, I'll give my two cents: Pandas' whole rolling mechanism relies on the numpy function apply_along_axis. In particular it is used herein pandas. It is used in conjunction with the windows.pyxcython module. In goes your series, out comes the aggregated rolling window. For typical aggregation functions it handles them for you efficiently, but for custom ones (using apply()) it uses a roll_generic()in windows.pyx.

也就是说,我会给我两分钱:Pandas 的整个滚动机制依赖于 numpy 函数apply_along_axis。特别是它在这里用于熊猫。它与windows.pyxcython 模块结合使用。在你的系列中,出现了聚合滚动窗口。对于典型的聚合函数,它会有效地为您处理它们,但对于自定义的(使用apply()),它使用roll_generic()in windows.pyx

The rolling function in pandas operates on pandas data frame columns independently. It is not a python iterator, and is lazy loaded, meaning nothing is computed until you apply an aggregation function to it. The functions which actually apply the rolling window of data aren't used until right before an aggregation is done.

pandas 中的滚动功能独立地对 Pandas 数据框列进行操作。它不是python 迭代器,并且是延迟加载的,这意味着在对它应用聚合函数之前不会计算任何内容。实际应用数据滚动窗口的函数直到聚合完成之前才使用。

A source of confusion might be that you're thinking of the rolling object as a dataframe. (You have named the rolling object dfin your last code snippet). It really isn't. It is an object which can produce dataframes by applying aggregations over the window logic it houses.

混淆的一个来源可能是您将滚动对象视为数据框。(您已df在上一个代码片段中命名了滚动对象)。真的不是。它是一个对象,它可以通过在它所包含的窗口逻辑上应用聚合来生成数据帧。

The lambda you are supplying is applied for each cell of your new dataframe. It takes a window backwards (along each column) in your old dataframe, and it aggregates it to one single cell in the new dataframe. The aggregation can be things like sum, mean, something custom you've made, etc., over some window size, say 3. Here are some examples:

您提供的 lambda 应用于新数据帧的每个单元格。它在旧数据框中向后(沿着每一列)使用一个窗口,并将其聚合到新数据框中的一个单元格中。聚合可以是诸如summean、您自定义的内容等,在某些窗口大小上,例如 3。以下是一些示例:

a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
df.rolling(3).mean().dropna()

... which can also be done by:

...也可以通过以下方式完成:

df.rolling(3).apply(np.mean).dropna()

... and produces:

...并产生:

     a
2  3.0
3  6.0
4  9.0

(The first column is the index value and can be ignored here, and for the next examples.)

(第一列是索引值,这里和下一个例子可以忽略。)

Notice how we supplied an existing numpy aggregation function. That's the idea. We're supposed to be able to supply anything we want as long as it conforms to what aggregation functions do, i.e., take a vector of values and produce a single value from it. Here is another one where we create a custom aggregation function, in this case the L2 norm of the window:

请注意我们如何提供现有的 numpy 聚合函数。这就是想法。我们应该能够提供我们想要的任何东西,只要它符合聚合函数的作用,即,获取一个值向量并从中生成一个值。这是我们创建自定义聚合函数的另一种方法,在本例中为窗口的 L2 范数:

df.rolling(3).apply(lambda x: np.sqrt(x.dot(x))).dropna()

if you're not familiar with lambda functions this is the same as:

如果您不熟悉 lambda 函数,这与以下内容相同:

def euclidean_dist(x):
    return np.sqrt(x.dot(x))

df.rolling(3).apply(euclidean_dist).dropna()

... yielding:

...产生:

          a
2  2.236068
3  3.741657
4  5.385165

Just to make sure, we can manually check that np.sqrt(0**2 + 1**2 + 2**2)is indeed 2.236068.

为了确保,我们可以手动检查np.sqrt(0**2 + 1**2 + 2**2)确实是2.236068.

[In your original edit, in the] last code snippet, your code is probably failing early than you expect. It is failing before the invocation of df.apply(...)You are trying to add a rolling object named dfto the number 2 before it is passed to df.apply(...). The rolling object isn't something you do operations on. The aggregation function you have supplied also doesn't conform to an aggregation function in general. The ais a list with the values of a window, bwould be a constant extra parameter you pass in. It can be a rolling object if you want, but it wouldn't typically be something you would like to do. To make it more clear, here is something which is similar to what you were doing in your original edit but works:

[在您的原始编辑中,在]最后一个代码片段中,您的代码可能比您预期的更早失败。在调用df.apply(...)You 尝试添加名为df2的滚动对象之前,它失败了,然后再将其传递给df.apply(...)。滚动对象不是您对其进行操作的对象。您提供的聚合函数通常也不符合聚合函数。这a是一个包含窗口值的列表,b将是您传入的常量额外参数。如果您愿意,它可以是滚动对象,但通常不会是您想要做的事情。为了更清楚,这里有一些与您在原始编辑中所做的类似但有效的事情:

a = np.arange(8)
df = pd.DataFrame(a, columns=['a'])
n = 4
rol = df.rolling(n)

def prod(window_list, constant_rol):
    return window_list.dot(constant_rol.sum().dropna().head(n))

rol.apply(prod, args=(rol,)).dropna()

# [92.0, 140.0, 188.0, 236.0, 284.0]

It is a contrived example, but I'm showing it to make the point that you can pass in whatever you want as a constant, even the rolling object you are using itself. The dynamic part is the first argument ain your case or window_listin my case. All defined windows, in the form of individual lists, are passed into that function one by one.

这是一个人为的例子,但我展示它是为了说明您可以将任何您想要的作为常量传递,甚至是您正在使用的滚动对象本身。动态部分是a您或window_list我的情况的第一个参数。所有定义的窗口,以单独列表的形式,被一一传递到该函数中。

Based on your followup comments this might be what you're looking for:

根据您的后续评论,这可能是您要查找的内容:

import numpy as np
import pandas as pd

n = 3
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])

def keep(window, windows):
    windows.append(window.copy())
    return window[-1]

windows = list()
df['a'].rolling(n).apply(keep, args=(windows,))
df = df.tail(n)
df['a_window'] = windows

which adds arrays/vectors to each rolling block thus producing:

它将数组/向量添加到每个滚动块,从而产生:

   a         a_window
2  2  [0.0, 1.0, 2.0]
3  3  [1.0, 2.0, 3.0]
4  4  [2.0, 3.0, 4.0]

Note that it only works if you do it on a column at a time. If you want to do some math on the window before you store it away in keepthat is fine too.

请注意,它仅在您一次对一列执行此操作时才有效。如果您想在将其存放之前在窗口上进行一些数学运算,keep那也很好。

That said, without more input on exactly what you are trying to achieve it is hard to construct an example which suits your needs.

也就是说,如果没有更多关于您想要实现的目标的输入,就很难构建一个适合您需求的示例。

If your ultimate goal is to create a dataframe of lagging variables then I'd go for using real columns using shift():

如果您的最终目标是创建滞后变量的数据框,那么我会使用以下方法使用真实列shift()

import numpy as np
import pandas as pd

a = np.arange(5)

df = pd.DataFrame(a, columns=['a'])
for i in range(1,3):
    df['a-%s' % i] = df['a'].shift(i)

df.dropna()

... giving:

...给:

   a  a-1  a-2
2  2  1.0  0.0
3  3  2.0  1.0
4  4  3.0  2.0

(There might be some more beautiful way of doing it, but it gets the job done.)

(可能有一些更漂亮的方法,但它可以完成工作。)

Regarding your variable bin your first code snippet, remember DataFrames in pandas aren't typically handled as tensors of arbitrary dimensions/object. You can probably stuff whatever you want into it, but ultimately strings, time objects, ints and floats is what is expected. That might be the reasons the designers of pandas haven't bothered with allowing rolling aggregation to non-scalar values. It doesn't even seem like a simple string is allowed as output of the aggregation function.

关于b第一个代码片段中的变量,请记住 Pandas 中的 DataFrames 通常不会作为任意维度/对象的张量处理。你可能可以把任何你想要的东西塞进去,但最终字符串、时间对象、整数和浮点数是预期的。这可能是 Pandas 的设计者没有考虑允许滚动聚合到非标量值的原因。似乎甚至不允许使用简单的字符串作为聚合函数的输出。

Anyway, I hope this answer some of your questions. If not let me know, and I'll try to help you out in the comments, or an update.

无论如何,我希望这能回答你的一些问题。如果没有让我知道,我会尽力在评论或更新中帮助你。



Final note on the _create_blocks()function of rolling objects.

关于_create_blocks()滚动物体功能的最后说明。

The _create_blocks()function handles the reindexing and binning when you use the freqargument of rolling.

_create_blocks()当您使用 的freq参数时,该函数会处理重新索引和分箱rolling

If you use freq with, say, weeks such that freq=W:

如果您将 freq 与周一起使用,例如freq=W

import pandas as pd

a = np.arange(50)
df = pd.DataFrame(a, columns=['a'])
df.index = pd.to_datetime('2016-01-01') + pd.to_timedelta(df['a'], 'D')
blocks, obj, index = df.rolling(4, freq='W')._create_blocks(how=None)
for b in blocks:
    print(b)

... then we get the binned (not rolling) original data week-by-week:

...然后我们逐周获得分箱(非滚动)原始数据:

               a
a               
2016-01-03   2.0
2016-01-10   9.0
2016-01-17  16.0
2016-01-24  23.0
2016-01-31  30.0
2016-02-07  37.0
2016-02-14  44.0
2016-02-21   NaN

Notice that this isn't the output of the aggregated rolling. This is simply the new blocks it works on. After this. We do an aggregation like sumand get:

请注意,这不是聚合滚动的输出。这只是它处理的新块。在这之后。我们做一个像这样的聚合sum并得到:

                a
a                
2016-01-03    NaN
2016-01-10    NaN
2016-01-17    NaN
2016-01-24   50.0
2016-01-31   78.0
2016-02-07  106.0
2016-02-14  134.0
2016-02-21    NaN

... which checks out with a test summation: 50 = 2 + 9 + 16 + 23.

...通过测试总和进行检查:50 = 2 + 9 + 16 + 23。

If you don't use freqas an argument it simply returns the original data structure:

如果您不用freq作参数,它只会返回原始数据结构:

import pandas as pd
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
blocks, obj, index = df.rolling(3)._create_blocks(how=None)

for b in blocks:
    print(b)

... which produces ...

...产生...

            a
a            
2016-01-01  0
2016-01-02  1
2016-01-03  2
2016-01-04  3
2016-01-05  4

... and is used for rolling window aggregation.

...并用于滚动窗口聚合。