从 Pandas 到 Statsmodels 的 OLS 中已弃用的滚动窗口选项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37317727/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Deprecated rolling window option in OLS from Pandas to Statsmodels
提问by Asher11
as the title suggests, where has the rolling function option in the ols command in Pandas migrated to in statsmodels? I can't seem to find it. Pandas tells me doom is in the works:
正如标题所暗示的,Pandas 中的 ols 命令中的滚动功能选项迁移到了 statsmodels 中的何处?我好像找不到 Pandas 告诉我厄运正在酝酿中:
FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
model = pd.ols(y=series_1, x=mmmm, window=50)
in fact, if you do something like:
实际上,如果您执行以下操作:
import statsmodels.api as sm
model = sm.OLS(series_1, mmmm, window=50).fit()
print(model.summary())
you get results (window does not impair the running of the code) but you get only the parameters of the regression run on the entire period, not the series of parameters for each of the rolling period it should be supposed to work on.
你会得到结果(窗口不会影响代码的运行),但你只会得到整个周期内回归运行的参数,而不是它应该工作的每个滚动周期的参数系列。
回答by Brad Solomon
I created an ols
module designed to mimic pandas' deprecated MovingOLS
; it is here.
我创建了一个ols
模块,旨在模仿Pandas的弃用MovingOLS
;它在这里。
It has three core classes:
它具有三个核心类:
OLS
: static (single-window) ordinary least-squares regression. The output are NumPy arraysRollingOLS
: rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.PandasRollingOLS
: wraps the results ofRollingOLS
in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.
OLS
:静态(单窗口)普通最小二乘回归。输出是 NumPy 数组RollingOLS
:滚动(多窗口)普通最小二乘回归。输出是更高维的 NumPy 数组。PandasRollingOLS
: 将结果包装RollingOLS
在 pandas Series & DataFrames 中。旨在模仿已弃用的Pandas模块的外观。
Note that the module is part of a package(which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.
请注意,该模块是一个包(我目前正在上传到 PyPi 的过程中)的一部分,它需要一个包间导入。
The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS
takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper
.
上面的前两个类完全在 NumPy 中实现,主要使用矩阵代数。 RollingOLS
还广泛利用广播。属性很大程度上模仿了 statsmodels 的 OLS RegressionResultsWrapper
。
An example:
一个例子:
import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS
# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
syms = {
"TWEXBMTH" : "usd",
"T10Y2YM" : "term_spread",
"GOLDAMGBD228NLBM" : "gold",
}
params = {
"fq": "Monthly,Monthly,Monthly",
"id": ",".join(syms.keys()),
"cosd": "2000-01-01",
"coed": "2019-02-01",
}
data = pd.read_csv(
url + "?" + urllib.parse.urlencode(params, safe=","),
na_values={"."},
parse_dates=["DATE"],
index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
# usd term_spread gold
# DATE
# 2000-02-01 0.012580 -1.409091 0.057152
# 2000-03-01 -0.000113 2.000000 -0.047034
# 2000-04-01 0.005634 0.518519 -0.023520
# 2000-05-01 0.022017 -0.097561 -0.016675
# 2000-06-01 -0.010116 0.027027 0.036599
y = data.usd
x = data.drop('usd', axis=1)
window = 12 # months
model = PandasRollingOLS(y=y, x=x, window=window)
print(model.beta.head()) # Coefficients excluding the intercept
# term_spread gold
# DATE
# 2001-01-01 0.000033 -0.054261
# 2001-02-01 0.000277 -0.188556
# 2001-03-01 0.002432 -0.294865
# 2001-04-01 0.002796 -0.334880
# 2001-05-01 0.002448 -0.241902
print(model.fstat.head())
# DATE
# 2001-01-01 0.136991
# 2001-02-01 1.233794
# 2001-03-01 3.053000
# 2001-04-01 3.997486
# 2001-05-01 3.855118
# Name: fstat, dtype: float64
print(model.rsq.head()) # R-squared
# DATE
# 2001-01-01 0.029543
# 2001-02-01 0.215179
# 2001-03-01 0.404210
# 2001-04-01 0.470432
# 2001-05-01 0.461408
# Name: rsq, dtype: float64
回答by citynorman
Rolling beta with sklearn
使用 sklearn 滚动测试版
import pandas as pd
from sklearn import linear_model
def rolling_beta(X, y, idx, window=255):
assert len(X)==len(y)
out_dates = []
out_beta = []
model_ols = linear_model.LinearRegression()
for iStart in range(0, len(X)-window):
iEnd = iStart+window
model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])
#store output
out_dates.append(idx[iEnd])
out_beta.append(model_ols.coef_[0][0])
return pd.DataFrame({'beta':out_beta}, index=out_dates)
df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)
回答by Pythonic
Adding for completeness a speedier numpy
-only solution which limits calculations only to the regression coefficients and the final estimate
为完整性添加一个更快的numpy
解决方案,该解决方案将计算仅限于回归系数和最终估计
Numpy rolling regression function
Numpy 滚动回归函数
import numpy as np
def rolling_regression(y, x, window=60):
"""
y and x must be pandas.Series
"""
# === Clean-up ============================================================
x = x.dropna()
y = y.dropna()
# === Trim acc to shortest ================================================
if x.index.size > y.index.size:
x = x[y.index]
else:
y = y[x.index]
# === Verify enough space =================================================
if x.index.size < window:
return None
else:
# === Add a constant if needed ========================================
X = x.to_frame()
X['c'] = 1
# === Loop... this can be improved ====================================
estimate_data = []
for i in range(window, x.index.size+1):
X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
y_slice = y.values[i-window:i]
coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
# === Assemble ========================================================
estimate = pandas.Series(data=estimate_data, index=x.index[window-1:])
return estimate
Notes
笔记
In some specific case uses, which only require the final estimate of the regression, x.rolling(window=60).apply(my_ols)
appears to be somewhat slow
在某些特定情况下,只需要对回归进行最终估计的使用x.rolling(window=60).apply(my_ols)
似乎有些缓慢
As a reminder, the coefficients for a regression can be calculated as a matrix product, as you can read on wikipedia's least squares page. This approach via numpy
's matrix multiplication can speed up the process somewhat vs using the ols in statsmodels
. This product is expressed in the line starting as coeff = ...
提醒一下,回归的系数可以计算为矩阵乘积,您可以在维基百科的最小二乘页面上阅读。这种通过numpy
矩阵乘法的方法与使用statsmodels
. 该乘积以以下开头的行表示coeff = ...