Pandas 滚动回归:循环的替代方案

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44380068/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:44:23  来源:igfitidea点击:

Pandas rolling regression: alternatives to looping

pythonpandasnumpylinear-regressionstatsmodels

提问by Brad Solomon

I got good use out of pandas' MovingOLSclass (source here) within the deprecated stats/olsmodule. Unfortunately, it was gutted completely with pandas 0.20.

我在已弃用的模块中充分利用了Pandas的MovingOLS课程(来源herestats/ols。不幸的是,它被 pandas 0.20 完全破坏了。

The question of how to run rolling OLS regression in an efficient manner has been asked several times (here, for instance), but phrased a little broadly and left without a great answer, in my view.

如何以有效的方式运行滚动 OLS 回归的问题已经被问过多次(例如,这里),但在我看来,措辞有点宽泛,没有很好的答案。

Here are my questions:

以下是我的问题:

  1. How can I best mimic the basic framework of pandas' MovingOLS? The most attractive feature of this class was the ability to view multiple methods/attributes as separate time series--i.e. coefficients, r-squared, t-statistics, etc without needing to re-run regression. For example, you could create something like model = pd.MovingOLS(y, x)and then call .t_stat, .rmse, .std_err, and the like. In the example below, conversely, I don't see a way around being forced to compute each statistic separately. Is there a method that doesn't involve creating sliding/rolling "blocks" (strides) and running regressions/using linear algebra to get model parameters for each?

  2. More broadly, what's going on under the hood in pandas that makes rolling.applynot able to take more complex functions?* When you create a .rollingobject, in layman's terms, what's going on internally--is it fundamentally different from looping over each window and creating a higher-dimensional array as I'm doing below?

  1. 我怎样才能最好地模仿Pandas的基本框架MovingOLS?该课程最吸引人的特点是能够将多种方法/属性视为单独的时间序列——即系数、r 平方、t 统计量等,而无需重新运行回归。例如,您可以创建类似的内容model = pd.MovingOLS(y, x),然后调用.t_stat.rmse.std_err等。相反,在下面的示例中,我没有看到被迫单独计算每个统计数据的方法。是否有一种方法不涉及创建滑动/滚动“块”(步幅)和运行回归/使用线性代数来获取每个块的模型参数?

  2. 更广泛地说,pandas 的幕后发生了什么导致rolling.apply无法使用更复杂的函数?* 当你创建一个.rolling对象时,用外行的话说,内部发生了什么——它与循环每个窗口并创建一个根本不同吗?我在下面做的高维数组?

*Namely, funcpassed to .apply:

*即,func传递给.apply

Must produce a single value from an ndarray input *args and **kwargs are passed to the function

必须从 ndarray 输入生成单个值 *args 和 **kwargs 传递给函数

Here's where I'm currently at with some sample data, regressing percentage changes in the trade weighted dollar on interest rate spreads and the price of copper. (This doesn't make a ton of sense; just picked these randomly.) I've taken it out of a class-based implementation and tried to strip it down to a simpler script.

这是我目前使用的一些样本数据,回归贸易加权美元对利率差和铜价的百分比变化。(这没有多大意义;只是随机选择了这些。)我已经将它从基于类的实现中取出并尝试将其剥离为更简单的脚本。

from datetime import date
from pandas_datareader.data import DataReader
import statsmodels.formula.api as smf

syms = {'TWEXBMTH' : 'usd', 
        'T10Y2YM' : 'term_spread', 
        'PCOPPUSDM' : 'copper'
       }

start = date(2000, 1, 1)
data = (DataReader(syms.keys(), 'fred', start)
        .pct_change()
        .dropna())
data = data.rename(columns = syms)
data = data.assign(intercept = 1.) # required by statsmodels OLS

def sliding_windows(x, window):
    """Create rolling/sliding windows of length ~window~.

    Given an array of shape (y, z), it will return "blocks" of shape
    (x - window + 1, window, z)."""

    return np.array([x[i:i + window] for i 
                    in range(0, x.shape[0] - window + 1)])

data.head(3)
Out[33]: 
                 usd  term_spread    copper  intercept
DATE                                                  
2000-02-01  0.012573    -1.409091 -0.019972        1.0
2000-03-01 -0.000079     2.000000 -0.037202        1.0
2000-04-01  0.005642     0.518519 -0.033275        1.0

window = 36
wins = sliding_windows(data.values, window=window)
y, x = wins[:, :, 0], wins[:, :, 1:]

coefs = []

for endog, exog in zip(y, x):
    model = smf.OLS(endog, exog).fit()
        # The full set of model attributes gets lost with each loop
    coefs.append(model.params)

df = pd.DataFrame(coefs, columns=data.iloc[:, 1:].columns,
                  index=data.index[window - 1:])

df.head(3) # rolling 36m coefficients
Out[70]: 
            term_spread    copper  intercept
DATE                                        
2003-01-01    -0.000122 -0.018426   0.001937
2003-02-01     0.000391 -0.015740   0.001597
2003-03-01     0.000655 -0.016811   0.001546

采纳答案by Brad Solomon

I created an olsmodule designed to mimic pandas' deprecated MovingOLS; it is here.

我创建了一个ols模块,旨在模仿Pandas的弃用MovingOLS;它在这里

It has three core classes:

它具有三个核心类:

  • OLS: static (single-window) ordinary least-squares regression. The output are NumPy arrays
  • RollingOLS: rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.
  • PandasRollingOLS: wraps the results of RollingOLSin pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.
  • OLS:静态(单窗口)普通最小二乘回归。输出是 NumPy 数组
  • RollingOLS:滚动(多窗口)普通最小二乘回归。输出是更高维的 NumPy 数组。
  • PandasRollingOLS: 将结果包装RollingOLS在 pandas Series & DataFrames 中。旨在模仿已弃用的Pandas模块的外观。

Note that the module is part of a package(which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.

请注意,该模块是一个(我目前正在上传到 PyPi 的过程中)的一部分,它需要一个包间导入。

The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLStakes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.

上面的前两个类完全在 NumPy 中实现,主要使用矩阵代数。 RollingOLS还广泛利用广播。属性很大程度上模仿了 statsmodels 的 OLS RegressionResultsWrapper

An example:

一个例子:

import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS

# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"

syms = {
    "TWEXBMTH" : "usd", 
    "T10Y2YM" : "term_spread", 
    "GOLDAMGBD228NLBM" : "gold",
}

params = {
    "fq": "Monthly,Monthly,Monthly",
    "id": ",".join(syms.keys()),
    "cosd": "2000-01-01",
    "coed": "2019-02-01",
}

data = pd.read_csv(
    url + "?" + urllib.parse.urlencode(params, safe=","),
    na_values={"."},
    parse_dates=["DATE"],
    index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
#                  usd  term_spread      gold
# DATE                                       
# 2000-02-01  0.012580    -1.409091  0.057152
# 2000-03-01 -0.000113     2.000000 -0.047034
# 2000-04-01  0.005634     0.518519 -0.023520
# 2000-05-01  0.022017    -0.097561 -0.016675
# 2000-06-01 -0.010116     0.027027  0.036599

y = data.usd
x = data.drop('usd', axis=1)

window = 12  # months
model = PandasRollingOLS(y=y, x=x, window=window)

print(model.beta.head())  # Coefficients excluding the intercept
#             term_spread      gold
# DATE                             
# 2001-01-01     0.000033 -0.054261
# 2001-02-01     0.000277 -0.188556
# 2001-03-01     0.002432 -0.294865
# 2001-04-01     0.002796 -0.334880
# 2001-05-01     0.002448 -0.241902

print(model.fstat.head())
# DATE
# 2001-01-01    0.136991
# 2001-02-01    1.233794
# 2001-03-01    3.053000
# 2001-04-01    3.997486
# 2001-05-01    3.855118
# Name: fstat, dtype: float64

print(model.rsq.head())  # R-squared
# DATE
# 2001-01-01    0.029543
# 2001-02-01    0.215179
# 2001-03-01    0.404210
# 2001-04-01    0.470432
# 2001-05-01    0.461408
# Name: rsq, dtype: float64

回答by L. Astorian

Use a custom rolling apply function.

使用自定义滚动应用功能。

import numpy as np

df['slope'] = df.values.rolling(window=125).apply(lambda x: np.polyfit(np.array(range(0,125)), x, 1)[0], raw=True)