时间序列 Pandas 的线性回归
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30425490/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Linear Regression from Time Series Pandas
提问by canyon289
I would like to get a regression with a time series as a predictor and I'm trying to follow the answer give on this SO answer (OLS with pandas: datetime index as predictor) but it no longer seems to work to the best of my knowledge.
我想用时间序列作为预测器进行回归,我试图按照这个 SO 答案给出的答案(OLS with pandas: datetime index as predictor)但它似乎不再发挥我的最佳作用知识。
Am I missing something or is there a new way to do this?
我错过了什么还是有新的方法来做到这一点?
import pandas as pd
rng = pd.date_range('1/1/2011', periods=4, freq='H')
s = pd.Series(range(4), index = rng)
z = s.reset_index()
pd.ols(x=z["index"], y=z[0])
I'm getting this error. The error is explanatory but I'm wondering what I'm missing in reimplementing a solution that worked before.
我收到这个错误。该错误是解释性的,但我想知道在重新实现以前有效的解决方案时我缺少什么。
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [float64]
类型错误:不能从 [datetime64[ns]] 到 [float64]
采纳答案by JohnE
I'm not sure why pd.olsis so picky there (it does appear to me that you followed the example correctly). I suspect this is due to changes in how pandas handles or stores datetime indexes but am too lazy to explore this further. Anyway, since your datetime variable differs only in the hour, you could just extract the hour with a dtaccessor:
我不确定为什么pd.ols在那里如此挑剔(在我看来,您确实正确地遵循了示例)。我怀疑这是由于Pandas处理或存储日期时间索引的方式发生了变化,但我懒得进一步探索。无论如何,由于您的日期时间变量仅在小时内不同,您可以使用dt访问器提取小时:
pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0])
However, that gives you an r-squared of 1, since your model is overspecified with the inclusion of an intercept (and y being a linear function of x). You could change the rangeto np.random.randnand then you'd get something that looks like normal regression results.
但是,这会给您 1 的 r 平方,因为您的模型因包含截距而被过度指定(并且 y 是 x 的线性函数)。您可以更改range为np.random.randn,然后您会得到一些看起来像正常回归结果的东西。
In [6]: z = pd.Series(np.random.randn(4), index = rng).reset_index()
pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0])
Out[6]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 4
Number of Degrees of Freedom: 2
R-squared: 0.7743
Adj R-squared: 0.6615
Rmse: 0.5156
F-stat (1, 2): 6.8626, p-value: 0.1200
Degrees of Freedom: model 1, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x -0.6040 0.2306 -2.62 0.1200 -1.0560 -0.1521
intercept 0.2915 0.4314 0.68 0.5689 -0.5540 1.1370
---------------------------------End of Summary---------------------------------
Alternatively, you could convert the index to an integer, although I found this didn't work very well (I'm assuming because the integers represent nanoseconds since the epoch or something like that, and hence are very large and cause precision issues), but converting to integer and dividing by a trillion or so did work and gave essentially the same results as using dt.hour(i.e. same r-squared):
或者,您可以将索引转换为整数,尽管我发现这并不能很好地工作(我假设是因为整数代表自纪元以来的纳秒或类似的东西,因此非常大并导致精度问题),但转换为整数并除以一万亿左右确实有效,并给出了与使用基本相同的结果dt.hour(即相同的 r 平方):
pd.ols(x=pd.to_datetime(z["index"]).astype(int)/1e12, y=z[0])
Source of the error message
错误消息的来源
FWIW, it looks like that error message is coming from something like this:
FWIW,该错误消息似乎来自以下内容:
pd.to_datetime(z["index"]).astype(float)
Although a fairly obvious workaround is this:
虽然一个相当明显的解决方法是这样的:
pd.to_datetime(z["index"]).astype(int).astype(float)

