时间序列 Pandas 的线性回归

Question

提问by canyon289

I would like to get a regression with a time series as a predictor and I'm trying to follow the answer give on this SO answer (OLS with pandas: datetime index as predictor) but it no longer seems to work to the best of my knowledge.

我想用时间序列作为预测器进行回归，我试图按照这个 SO 答案给出的答案（OLS with pandas: datetime index as predictor）但它似乎不再发挥我的最佳作用知识。

Am I missing something or is there a new way to do this?

我错过了什么还是有新的方法来做到这一点？

import pandas as pd

rng = pd.date_range('1/1/2011', periods=4, freq='H')       
s = pd.Series(range(4), index = rng)                                                                      
z = s.reset_index()

pd.ols(x=z["index"], y=z[0])

I'm getting this error. The error is explanatory but I'm wondering what I'm missing in reimplementing a solution that worked before.

我收到这个错误。该错误是解释性的，但我想知道在重新实现以前有效的解决方案时我缺少什么。

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [float64]

类型错误：不能从 [datetime64[ns]] 到 [float64]

Answer 1

采纳答案by JohnE

I'm not sure why pd.olsis so picky there (it does appear to me that you followed the example correctly). I suspect this is due to changes in how pandas handles or stores datetime indexes but am too lazy to explore this further. Anyway, since your datetime variable differs only in the hour, you could just extract the hour with a dtaccessor:

我不确定为什么pd.ols在那里如此挑剔（在我看来，您确实正确地遵循了示例）。我怀疑这是由于Pandas处理或存储日期时间索引的方式发生了变化，但我懒得进一步探索。无论如何，由于您的日期时间变量仅在小时内不同，您可以使用dt访问器提取小时：

pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0])

However, that gives you an r-squared of 1, since your model is overspecified with the inclusion of an intercept (and y being a linear function of x). You could change the rangeto np.random.randnand then you'd get something that looks like normal regression results.

但是，这会给您 1 的 r 平方，因为您的模型因包含截距而被过度指定（并且 y 是 x 的线性函数）。您可以更改range为np.random.randn，然后您会得到一些看起来像正常回归结果的东西。

In [6]: z = pd.Series(np.random.randn(4), index = rng).reset_index()                                                               
        pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0])
Out[6]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         4
Number of Degrees of Freedom:   2

R-squared:         0.7743
Adj R-squared:     0.6615

Rmse:              0.5156

F-stat (1, 2):     6.8626, p-value:     0.1200

Degrees of Freedom: model 1, resid 2

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.6040     0.2306      -2.62     0.1200    -1.0560    -0.1521
     intercept     0.2915     0.4314       0.68     0.5689    -0.5540     1.1370
---------------------------------End of Summary---------------------------------

Alternatively, you could convert the index to an integer, although I found this didn't work very well (I'm assuming because the integers represent nanoseconds since the epoch or something like that, and hence are very large and cause precision issues), but converting to integer and dividing by a trillion or so did work and gave essentially the same results as using dt.hour(i.e. same r-squared):

或者，您可以将索引转换为整数，尽管我发现这并不能很好地工作（我假设是因为整数代表自纪元以来的纳秒或类似的东西，因此非常大并导致精度问题），但转换为整数并除以一万亿左右确实有效，并给出了与使用基本相同的结果dt.hour（即相同的 r 平方）：

pd.ols(x=pd.to_datetime(z["index"]).astype(int)/1e12, y=z[0])

Source of the error message

错误消息的来源

FWIW, it looks like that error message is coming from something like this:

FWIW，该错误消息似乎来自以下内容：

pd.to_datetime(z["index"]).astype(float)

Although a fairly obvious workaround is this:

虽然一个相当明显的解决方法是这样的：

pd.to_datetime(z["index"]).astype(int).astype(float)

时间序列 Pandas 的线性回归

提问by canyon289

采纳答案by JohnE

相关推荐

最近更新

标签

时间序列 Pandas 的线性回归

提问by canyon289

采纳答案by JohnE

相关推荐

pandas 熊猫系列：更改索引顺序

将 Pandas TimeDelta 转换为整数

pandas - 分组和计算唯一值

如何比较两个不同长度的 Python Pandas 系列？

相关推荐

最近更新

标签