从 Pandas 到 Statsmodels 的 OLS 中已弃用的滚动窗口选项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37317727/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:15:27  来源:igfitidea点击:

Deprecated rolling window option in OLS from Pandas to Statsmodels

pythonpandasdeprecatedstatsmodels

提问by Asher11

as the title suggests, where has the rolling function option in the ols command in Pandas migrated to in statsmodels? I can't seem to find it. Pandas tells me doom is in the works:

正如标题所暗示的,Pandas 中的 ols 命令中的滚动功能选项迁移到了 statsmodels 中的何处?我好像找不到 Pandas 告诉我厄运正在酝酿中:

FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
  model = pd.ols(y=series_1, x=mmmm, window=50)

in fact, if you do something like:

实际上,如果您执行以下操作:

import statsmodels.api as sm

model = sm.OLS(series_1, mmmm, window=50).fit()

print(model.summary())

you get results (window does not impair the running of the code) but you get only the parameters of the regression run on the entire period, not the series of parameters for each of the rolling period it should be supposed to work on.

你会得到结果(窗口不会影响代码的运行),但你只会得到整个周期内回归运行的参数,而不是它应该工作的每个滚动周期的参数系列。

回答by Brad Solomon

I created an olsmodule designed to mimic pandas' deprecated MovingOLS; it is here.

我创建了一个ols模块,旨在模仿Pandas的弃用MovingOLS;它在这里

It has three core classes:

它具有三个核心类:

  • OLS: static (single-window) ordinary least-squares regression. The output are NumPy arrays
  • RollingOLS: rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.
  • PandasRollingOLS: wraps the results of RollingOLSin pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.
  • OLS:静态(单窗口)普通最小二乘回归。输出是 NumPy 数组
  • RollingOLS:滚动(多窗口)普通最小二乘回归。输出是更高维的 NumPy 数组。
  • PandasRollingOLS: 将结果包装RollingOLS在 pandas Series & DataFrames 中。旨在模仿已弃用的Pandas模块的外观。

Note that the module is part of a package(which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.

请注意,该模块是一个(我目前正在上传到 PyPi 的过程中)的一部分,它需要一个包间导入。

The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLStakes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.

上面的前两个类完全在 NumPy 中实现,主要使用矩阵代数。 RollingOLS还广泛利用广播。属性很大程度上模仿了 statsmodels 的 OLS RegressionResultsWrapper

An example:

一个例子:

import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS

# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"

syms = {
    "TWEXBMTH" : "usd", 
    "T10Y2YM" : "term_spread", 
    "GOLDAMGBD228NLBM" : "gold",
}

params = {
    "fq": "Monthly,Monthly,Monthly",
    "id": ",".join(syms.keys()),
    "cosd": "2000-01-01",
    "coed": "2019-02-01",
}

data = pd.read_csv(
    url + "?" + urllib.parse.urlencode(params, safe=","),
    na_values={"."},
    parse_dates=["DATE"],
    index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
#                  usd  term_spread      gold
# DATE                                       
# 2000-02-01  0.012580    -1.409091  0.057152
# 2000-03-01 -0.000113     2.000000 -0.047034
# 2000-04-01  0.005634     0.518519 -0.023520
# 2000-05-01  0.022017    -0.097561 -0.016675
# 2000-06-01 -0.010116     0.027027  0.036599

y = data.usd
x = data.drop('usd', axis=1)

window = 12  # months
model = PandasRollingOLS(y=y, x=x, window=window)

print(model.beta.head())  # Coefficients excluding the intercept
#             term_spread      gold
# DATE                             
# 2001-01-01     0.000033 -0.054261
# 2001-02-01     0.000277 -0.188556
# 2001-03-01     0.002432 -0.294865
# 2001-04-01     0.002796 -0.334880
# 2001-05-01     0.002448 -0.241902

print(model.fstat.head())
# DATE
# 2001-01-01    0.136991
# 2001-02-01    1.233794
# 2001-03-01    3.053000
# 2001-04-01    3.997486
# 2001-05-01    3.855118
# Name: fstat, dtype: float64

print(model.rsq.head())  # R-squared
# DATE
# 2001-01-01    0.029543
# 2001-02-01    0.215179
# 2001-03-01    0.404210
# 2001-04-01    0.470432
# 2001-05-01    0.461408
# Name: rsq, dtype: float64

回答by citynorman

Rolling beta with sklearn

使用 sklearn 滚动测试版

import pandas as pd
from sklearn import linear_model

def rolling_beta(X, y, idx, window=255):

    assert len(X)==len(y)

    out_dates = []
    out_beta = []

    model_ols = linear_model.LinearRegression()

    for iStart in range(0, len(X)-window):        
        iEnd = iStart+window

        model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])

        #store output
        out_dates.append(idx[iEnd])
        out_beta.append(model_ols.coef_[0][0])

    return pd.DataFrame({'beta':out_beta}, index=out_dates)


df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)

回答by Pythonic

Adding for completeness a speedier numpy-only solution which limits calculations only to the regression coefficients and the final estimate

为完整性添加一个更快的numpy解决方案,该解决方案将计算仅限于回归系数和最终估计

Numpy rolling regression function

Numpy 滚动回归函数

import numpy as np

def rolling_regression(y, x, window=60):
    """ 
    y and x must be pandas.Series
    """
# === Clean-up ============================================================
    x = x.dropna()
    y = y.dropna()
# === Trim acc to shortest ================================================
    if x.index.size > y.index.size:
        x = x[y.index]
    else:
        y = y[x.index]
# === Verify enough space =================================================
    if x.index.size < window:
        return None
    else:
    # === Add a constant if needed ========================================
        X = x.to_frame()
        X['c'] = 1
    # === Loop... this can be improved ====================================
        estimate_data = []
        for i in range(window, x.index.size+1):
            X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
            y_slice = y.values[i-window:i]
            coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
            estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
    # === Assemble ========================================================
        estimate = pandas.Series(data=estimate_data, index=x.index[window-1:]) 
        return estimate             


Notes

笔记

In some specific case uses, which only require the final estimate of the regression, x.rolling(window=60).apply(my_ols)appears to be somewhat slow

在某些特定情况下,只需要对回归进行最终估计的使用x.rolling(window=60).apply(my_ols)似乎有些缓慢

As a reminder, the coefficients for a regression can be calculated as a matrix product, as you can read on wikipedia's least squares page. This approach via numpy's matrix multiplication can speed up the process somewhat vs using the ols in statsmodels. This product is expressed in the line starting as coeff = ...

提醒一下,回归的系数可以计算为矩阵乘积,您可以在维基百科的最小二乘页面上阅读。这种通过numpy矩阵乘法的方法与使用statsmodels. 该乘积以以下开头的行表示coeff = ...