pandas 熊猫滚动窗口和日期时间索引:“偏移”是什么意思?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48855400/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:12:14  来源:igfitidea点击:

pandas rolling window & datetime indexes: What does `offset` mean?

pythonpandasdatetimedataframe

提问by ascripter

The rolling window function pandas.DataFrame.rollingof pandas 0.22 takes a windowargument that is described as follows:

pandas.DataFrame.rollingpandas 0.22的滚动窗口函数采用window如下描述的参数:

window: int, or offset

Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.

If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. This is new in 0.19.0

窗口:整数,或偏移量

移动窗口的大小。这是用于计算统计量的观察数。每个窗口的大小都是固定的。

如果它是一个偏移量,那么这将是每个窗口的时间段。每个窗口的大小都将根据时间段中包含的观察结果而变化。这仅对类似日期时间的索引有效。这是 0.19.0 中的新功能

What actually is an offsetin this context?

在这种情况下,偏移实际上是什么?

回答by JohnE

In a nutshell, if you use an offsetlike "2D" (2 days), pandas will use the datetime info in the index (if available), potentially accounting for any missing rows or irregular frequencies. But if you use a simple intlike 2, then pandas will treat the index as a simple integer index [0,1,2,...] and ignore any datetime info in the index.

简而言之,如果您使用offset类似的“2D”(2 天),pandas 将使用索引中的日期时间信息(如果可用),可能会考虑任何丢失的行或不规则的频率。但是如果你使用int像 2 这样的简单,那么 Pandas 会将索引视为一个简单的整数索引 [0,1,2,...] 并忽略索引中的任何日期时间信息。

A simple example should make this clear:

一个简单的例子应该可以清楚地说明这一点:

df=pd.DataFrame({'x':range(4)}, 
    index=pd.to_datetime(['1-1-2018','1-2-2018','1-4-2018','1-5-2018']))

            x
2018-01-01  0
2018-01-02  1
2018-01-04  2
2018-01-05  3

Note that (1) the index is a datetime, but also (2) it is missing '2018-01-03'. So if you use a plain integer like 2, rollingwill just look at the last two rows, regardless of the datetime value (in a sense it's behaving like iloc[i-1:i]where iis the current row):

请注意,(1)索引是日期时间,但(2)它缺少“2018-01-03”。所以,如果你使用像2纯整数,rolling将只是看最后两行,而不管datetime值(在一定意义上它表现得像的iloc[i-1:i]地方i是当前行):

df.rolling(2).count()

              x
2018-01-01  1.0
2018-01-02  2.0
2018-01-04  2.0
2018-01-05  2.0

Conversely, if you use an offset of 2 days ('2D'), rollingwill use the actual datetime values and accounts for any irregularities in the datetime index.

相反,如果您使用 2 天 ( '2D')的偏移量,rolling将使用实际日期时间值并说明日期时间索引中的任何不规则性。

df.rolling('2D').count()

              x
2018-01-01  1.0
2018-01-02  2.0
2018-01-04  1.0
2018-01-05  2.0

Also note, you need the index to be sorted in ascending order when using a date offset, but it doesn't matter when using a simple integer (since you're just ignoring the index anyway).

另请注意,在使用日期偏移量时,您需要按升序对索引进行排序,但在使用简单整数时并不重要(因为您只是忽略了索引)。