pandas Python - 滚动窗口 OLS 回归估计

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44759309/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:52:54  来源:igfitidea点击:

Python - Rolling window OLS Regression estimation

pythonpandasnumpyscikit-learnstatsmodels

提问by Desta Haileselassie Hagos

For my evaluation, I have a dataset found in this link(https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format. The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate).

对于我的评估,我在此链接( https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) 中找到了一个数据集,格式如下。我数据集中的第三列 (Y) 是我的真实值 - 这就是我想要预测(估计)的值。

 time     X   Y
0.000543  0  10
0.000575  0  10
0.041324  1  10
0.041331  2  10
0.041336  3  10
0.04134   4  10
  ...
9.987735  55 239
9.987739  56 239
9.987744  57 239
9.987749  58 239
9.987938  59 239

I want to run a rolling of for example 5 window OLS regression estimation, and I have tried it with the following script.

我想运行例如 5 window 的滚动OLS regression estimation,并且我已经使用以下脚本进行了尝试。

# /usr/bin/python -tt

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('estimated_pred.csv')

model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']], 
                               window_type='rolling', window=5, intercept=True)
df['Y_hat'] = model.y_predict

print(df['Y_hat'])
print (model.summary)
df.plot.scatter(x='X', y='Y', s=0.1)

The summary of the regression analysis is shown below.

回归分析的概要如下所示。

   -------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <X> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   2

R-squared:           -inf
Adj R-squared:       -inf

Rmse:              0.0000

F-stat (1, 3):        nan, p-value:        nan

Degrees of Freedom: model 1, resid 3

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             X     0.0000     0.0000       1.97     0.1429     0.0000     0.0000
     intercept   239.0000     0.0000 14567091934632472.00     0.0000   239.0000   239.0000
---------------------------------End of Summary---------------------------------

enter image description here

在此处输入图片说明

I want to do a backward prediction of Yat t+1(i.e. predict the next value of Yaccording to the previous value i.e. p(Y)t+1by including the mean squared error (MSE) - for example, if we look at row 5, the value of Xis 2 and the value of Yis 10. Let's say the prediction value (p(Y)t+1) is 6 and therefore the msewill be (10-6)^2. How can we do this using either statsmodelsor scikit-learnfor pd.stats.ols.MovingOLSwas removed in Pandasversion 0.20.0 and since I can't find any reference?

我想对Yat进行后向预测t+1(即Y根据前一个值预测 的下一个值,即p(Y)t+1通过包括均方误差 ( MSE) - 例如,如果我们查看第 5 行, 的值为X2 和Y是 10。假设预测值 ( p(Y)t+1) 是 6,因此mse将是(10-6)^2。我们如何使用0.20.0 版本中删除的statsmodelsor 或scikit-learnfor来做到这一点,因为我找不到任何参考?pd.stats.ols.MovingOLSPandas

回答by Vlox

Here is an outline of doing rolling OLS with statsmodels and should work for your data. simply use df=pd.read_csv('estimated_pred.csv')instead of my randomly generated df:

这是使用 statsmodels 进行滚动 OLS 的概述,应该适用于您的数据。只需使用df=pd.read_csv('estimated_pred.csv')而不是我随机生成的 df:

import pandas as pd
import numpy as np
import statsmodels.api as sm

#random data
#df=pd.DataFrame(np.random.normal(size=(500,3)),columns=['time','X','Y'])
df=pd.read_csv('estimated_pred.csv')    
df=df.dropna() #uncomment this line to drop nans
window = 5

df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
    temp=df.iloc[i-window:i,:]
    RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']])).fit()
    df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
    df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
    df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]

#The following line gives you predicted values in a row, given the PRIOR row's estimated parameters
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']

I store the constant and betas, but there are a number of ways to approach predicting... you can use your fitted model object mine is RollOLSand the .predict()method, or multiply it yourself which I did in the final line (easier to do this way in this case because number of variables is fixed and known and you can do simple column math all in one go).

我存储常数和 beta,但有多种方法可以进行预测……您可以使用我的拟合模型对象RollOLS.predict()方法,或者自己乘以我在最后一行中所做的(更容易这样做)在这种情况下,因为变量的数量是固定且已知的,您可以一次性完成简单的列数学运算)。

to do predictions with sm though as you go it would look like this:

使用 sm 进行预测时,它看起来像这样:

predict_x=np.random.normal(size=(20,2))
RollOLS.predict(sm.add_constant(predict_x))

but keep in mind, if you ran the above code in sequence the predicted values would be using the model of the last window only. if you want to use a different model then you can save those as you go, or predict values within the for loop. Note you can also get fitted values with RollOLS.fittedvalues, and so if you are smoothing data pull and save RollOLS.fittedvalues[-1]for each iteration in the loop.

但请记住,如果您按顺序运行上述代码,则预测值将仅使用最后一个窗口的模型。如果您想使用不同的模型,那么您可以随时保存这些模型,或者在 for 循环中预测值。请注意,您还可以使用 获得拟合值RollOLS.fittedvalues,因此如果您正在RollOLS.fittedvalues[-1]为循环中的每次迭代平滑数据拉取和保存。



To help see how to use for your own data here is the tail of my df after the rolling regression loop is run:

为了帮助了解如何使用您自己的数据,这里是滚动回归循环运行后我的 df 的尾部:

      time         X           Y           a           b1          b2
495 0.662463    0.771971    0.643008    -0.0235751  0.037875    0.0907694
496 -0.127879   1.293141    0.404959    0.00314073  0.0441054   0.113387
497 -0.006581   -0.824247   0.226653    0.0105847   0.0439867   0.118228
498 1.870858    0.920964    0.571535    0.0123463   0.0428359   0.11598
499 0.724296    0.537296    -0.411965   0.00104044  0.055003    0.118953