pandas Python:回顾 n 天滚动标准差
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29108785/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: look back n days rolling standard deviation
提问by Julia
I have a question about dealing with the rolling standard deviation:
我有一个关于处理滚动标准偏差的问题:
The data frame looks like this:
数据框如下所示:
2010-01-20 05:00:00 -0.011
2010-01-20 05:02:00 -0.032
2010-01-20 05:02:00 -0.037
2010-01-20 05:04:00 0.001
2010-01-20 05:06:00 0.023
2010-01-20 05:06:00 0.011
2010-01-20 05:08:00 0.049
2010-01-20 05:10:00 0.102
....
2010-05-20 17:00:00 0.022
This is 2-min data from 5am to 5pm (The format of index 'yyyy-mm-dd hh:mm:ss' is datestamp)
这是从早上 5 点到下午 5 点的 2 分钟数据(索引 'yyyy-mm-dd hh:mm:ss' 的格式是日期戳)
I want to calculate the 8-day look-back on the standard deviation. My intuition is to split the data frame into daily data set and then calculate the rolling standard deviation, but I don't know how to deal with these indexand i guess my methods may takes a lot of time to calculate. Thanks a lot for your help!
我想计算标准差的 8 天回顾。我的直觉是将数据框拆分成每日数据集,然后计算滚动标准差,但我不知道如何处理这些索引,我猜我的方法可能需要很多时间来计算。非常感谢你的帮助!
Finally, I would like the result like this:
最后,我想要这样的结果:
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
....
2010-01-26 0.0
2010-01-27 0.12
2010-01-28 0.02
2010-01-29 0.07
...
2010-05-20 0.10
Thank you for your help. @unutbu
感谢您的帮助。@unutbu
Just found the problem in the data: The data frame is not completely including the whole 2-min data. For example:
刚刚发现数据中的问题:数据框没有完全包含整个2分钟的数据。例如:
2010-01-21 15:08:00 0.044
2010-01-22 05:10:00 0.102
The data ends at 15:08 on 2010-01-21 and start at 05:10:00 on 2010-01-22. so setting window size with a constant may not fixed this problem. Any suggestions? thanks a lot
数据在 2010-01-21 的 15:08 结束,并在 2010-01-22 的 05:10:00 开始。所以用常数设置窗口大小可能无法解决这个问题。有什么建议?多谢
回答by unutbu
If the time series has a constant frequency:
如果时间序列具有恒定频率:
You could compute the number of 2 second interals in 8 days:
您可以计算 8 天内 2 秒内的数量:
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')
and then use pd.rolling_stdwith window=window_size:
然后用pd.rolling_std用window=window_size:
import pandas as pd
import numpy as np
np.random.seed(1)
index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
# the number of 2 second intervals in 8 days
window_size = pd.Timedelta('8D')/pd.Timedelta('2min') # 5760.0
df['std'] = pd.rolling_std(df['val'], window=window_size)
print(df.tail())
yields
产量
val std
2010-05-20 16:52:00 0.768918 0.291137
2010-05-20 16:54:00 0.486348 0.291098
2010-05-20 16:56:00 0.679610 0.291099
2010-05-20 16:58:00 0.951798 0.291114
2010-05-20 17:00:00 0.059935 0.291109
To resample this time series so as to get one value per day, you could use the resamplemethodand aggregate the values by taking the mean:
要重新采样此时间序列以获得每天一个值,您可以使用该resample方法并通过取平均值来聚合这些值:
df['std'].resample('D', how='mean')
yields
产量
...
2010-05-16 0.289019
2010-05-17 0.289988
2010-05-18 0.289713
2010-05-19 0.289269
2010-05-20 0.288890
Freq: D, Name: std, Length: 121
Above, we computed the rolling standard deviation and then resampled to a time series with daily frequency.
上面,我们计算了滚动标准偏差,然后重新采样到每日频率的时间序列。
If we were to resample the original data to daily frequency firstand then compute the rolling standard deviation then in general the result would be different.
如果我们重新采样原始数据每天频率第一,然后计算滚动标准偏差,则一般而言,结果会有所不同。
Note also that your data looks like it has quite a bit of variation within each day, so resampling by taking the mean might (wrongly?) hide that variation. So it is probably better to compute the std first.
另请注意,您的数据看起来每天都有相当多的变化,因此通过取平均值进行重新采样可能(错误地?)隐藏该变化。所以最好先计算std。
If the time series does not have a constant frequency:
如果时间序列没有恒定频率:
If you have enough memory, I think the easiest way to deal with this situation
is to use asfreqto expand the time series to one that has a constant
frequency.
如果你有足够的内存,我认为处理这种情况最简单的方法是使用asfreq将时间序列扩展到一个具有恒定频率的时间序列。
import pandas as pd
import numpy as np
np.random.seed(1)
# make an example df
index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
mask = np.random.randint(2, size=N).astype(bool)
df = df.loc[mask]
# expand the time series, filling in missing values with NaN
df = df.asfreq('2T', method=None)
# now we can use the constant-frequency solution
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')
df['std'] = pd.rolling_std(df['val'], window=window_size, min_periods=1)
result = df['std'].resample('D', how='mean')
print(result.head())
yields
产量
2010-01-20 0.301834
2010-01-21 0.292505
2010-01-22 0.293897
2010-01-23 0.291018
2010-01-24 0.290444
Freq: D, Name: std, dtype: float64
The alternative to expanding the time series is to write code to compute the correct sub-Series for each 8-day window. While this is possible, the fact that you would have to compute this for each row of the time series could make this method very slow. Thus, I think the faster approach is to expand the time series.
扩展时间序列的替代方法是编写代码来计算每个 8 天窗口的正确子序列。虽然这是可能的,但您必须为时间序列的每一行计算它的事实可能会使此方法非常慢。因此,我认为更快的方法是扩展时间序列。

