pandas Python：回顾 n 天滚动标准差

Question

提问by Julia

I have a question about dealing with the rolling standard deviation:

我有一个关于处理滚动标准偏差的问题：

The data frame looks like this:

数据框如下所示：

2010-01-20 05:00:00   -0.011
2010-01-20 05:02:00   -0.032
2010-01-20 05:02:00   -0.037
2010-01-20 05:04:00    0.001
2010-01-20 05:06:00    0.023
2010-01-20 05:06:00    0.011
2010-01-20 05:08:00    0.049
2010-01-20 05:10:00    0.102
....
2010-05-20 17:00:00    0.022

This is 2-min data from 5am to 5pm (The format of index 'yyyy-mm-dd hh:mm:ss' is datestamp)

这是从早上 5 点到下午 5 点的 2 分钟数据（索引 'yyyy-mm-dd hh:mm:ss' 的格式是日期戳）

I want to calculate the 8-day look-back on the standard deviation. My intuition is to split the data frame into daily data set and then calculate the rolling standard deviation, but I don't know how to deal with these indexand i guess my methods may takes a lot of time to calculate. Thanks a lot for your help!

我想计算标准差的 8 天回顾。我的直觉是将数据框拆分成每日数据集，然后计算滚动标准差，但我不知道如何处理这些索引，我猜我的方法可能需要很多时间来计算。非常感谢你的帮助！

Finally, I would like the result like this:

最后，我想要这样的结果：

2010-01-20   0.0
2010-01-21   0.0
2010-01-22   0.0
....
2010-01-26   0.0
2010-01-27   0.12
2010-01-28   0.02
2010-01-29   0.07
...
2010-05-20   0.10

Thank you for your help. @unutbu

感谢您的帮助。@unutbu

Just found the problem in the data: The data frame is not completely including the whole 2-min data. For example:

刚刚发现数据中的问题：数据框没有完全包含整个2分钟的数据。例如：

2010-01-21 15:08:00    0.044
2010-01-22 05:10:00    0.102

The data ends at 15:08 on 2010-01-21 and start at 05:10:00 on 2010-01-22. so setting window size with a constant may not fixed this problem. Any suggestions? thanks a lot

数据在 2010-01-21 的 15:08 结束，并在 2010-01-22 的 05:10:00 开始。所以用常数设置窗口大小可能无法解决这个问题。有什么建议？多谢

Answer 1

回答by unutbu

If the time series has a constant frequency:

如果时间序列具有恒定频率：

You could compute the number of 2 second interals in 8 days:

您可以计算 8 天内 2 秒内的数量：

window_size = pd.Timedelta('8D')/pd.Timedelta('2min')

and then use pd.rolling_stdwith window=window_size:

然后用pd.rolling_std用window=window_size：

import pandas as pd
import numpy as np
np.random.seed(1)

index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
# the number of 2 second intervals in 8 days
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')    # 5760.0

df['std'] = pd.rolling_std(df['val'], window=window_size)
print(df.tail())

yields

产量

                          val       std
2010-05-20 16:52:00  0.768918  0.291137
2010-05-20 16:54:00  0.486348  0.291098
2010-05-20 16:56:00  0.679610  0.291099
2010-05-20 16:58:00  0.951798  0.291114
2010-05-20 17:00:00  0.059935  0.291109

To resample this time series so as to get one value per day, you could use the resamplemethodand aggregate the values by taking the mean:

要重新采样此时间序列以获得每天一个值，您可以使用该resample方法并通过取平均值来聚合这些值：

df['std'].resample('D', how='mean')

yields

产量

...
2010-05-16    0.289019
2010-05-17    0.289988
2010-05-18    0.289713
2010-05-19    0.289269
2010-05-20    0.288890
Freq: D, Name: std, Length: 121

Above, we computed the rolling standard deviation and then resampled to a time series with daily frequency.

上面，我们计算了滚动标准偏差，然后重新采样到每日频率的时间序列。

If we were to resample the original data to daily frequency firstand then compute the rolling standard deviation then in general the result would be different.

如果我们重新采样原始数据每天频率第一，然后计算滚动标准偏差，则一般而言，结果会有所不同。

Note also that your data looks like it has quite a bit of variation within each day, so resampling by taking the mean might (wrongly?) hide that variation. So it is probably better to compute the std first.

另请注意，您的数据看起来每天都有相当多的变化，因此通过取平均值进行重新采样可能（错误地？）隐藏该变化。所以最好先计算std。

If the time series does not have a constant frequency:

如果时间序列没有恒定频率：

If you have enough memory, I think the easiest way to deal with this situation is to use asfreqto expand the time series to one that has a constant frequency.

如果你有足够的内存，我认为处理这种情况最简单的方法是使用asfreq将时间序列扩展到一个具有恒定频率的时间序列。

import pandas as pd
import numpy as np
np.random.seed(1)

# make an example df
index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
mask = np.random.randint(2, size=N).astype(bool)
df = df.loc[mask]

# expand the time series, filling in missing values with NaN
df = df.asfreq('2T', method=None)

# now we can use the constant-frequency solution
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')    
df['std'] = pd.rolling_std(df['val'], window=window_size, min_periods=1)

result = df['std'].resample('D', how='mean')
print(result.head())

yields

产量

2010-01-20    0.301834
2010-01-21    0.292505
2010-01-22    0.293897
2010-01-23    0.291018
2010-01-24    0.290444
Freq: D, Name: std, dtype: float64

The alternative to expanding the time series is to write code to compute the correct sub-Series for each 8-day window. While this is possible, the fact that you would have to compute this for each row of the time series could make this method very slow. Thus, I think the faster approach is to expand the time series.

扩展时间序列的替代方法是编写代码来计算每个 8 天窗口的正确子序列。虽然这是可能的，但您必须为时间序列的每一行计算它的事实可能会使此方法非常慢。因此，我认为更快的方法是扩展时间序列。

pandas Python：回顾 n 天滚动标准差

提问by Julia

回答by unutbu

相关推荐

最近更新

标签

pandas Python：回顾 n 天滚动标准差

提问by Julia

回答by unutbu

相关推荐

pandas - 在分组数据帧后仅保留 True 值

pandas 如何在read_csv中指定日期时间格式

pandas 计算pandas数据帧行之间的相似度

pandas 熊猫：改变日

相关推荐

最近更新

标签