Pandas/Statsmodel OLS 预测未来值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25514220/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas/Statsmodel OLS predicting future values
提问by pythonista
I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
我一直试图在我创建的模型中预测未来的价值。我已经在 Pandas 和 statsmodels 中尝试过 OLS。这是我在 statsmodels 中的内容:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
返回的数组长度等于我原始数据框中的记录数,但值不相同。当我使用 Pandas 执行以下操作时,我没有返回任何值。
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
(请注意,Pandas 中没有 OLS 的 .fit 函数)有人可以说明我如何从 Pandas 或 statsmodel 中的 OLS 模型中获得未来的预测 - 我意识到我一定没有正确使用 .predict 并且我已经阅读人们遇到的其他多个问题,但它们似乎不适用于我的案例。
editI believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
编辑我相信定义的“endog”是不正确的——我应该传递我想要预测的值;因此我创建了一个超过最后记录值 12 个周期的日期范围。但是当我收到错误时,我仍然想念一些东西:
matrices are not aligned
edithere is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
在这里编辑是一段数据,最后一列(红色)数字是日期增量,它与第一个日期相差几个月:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085
回答by chrisb
I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
我认为您的问题是 statsmodels 默认情况下不会添加拦截,因此您的模型并没有达到很大的拟合度。要在您的代码中解决它,将是这样的:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1to remove). See below, it should give the same answer.
另外,就其价值而言,我认为 statsmodel 公式 api 在处理 DataFrame 时使用起来要好得多,并且默认情况下会添加一个拦截(添加 a- 1以删除)。见下文,它应该给出相同的答案。
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
Edit:
编辑:
To predict future values, just pass new data to .predict()For example, using the first model:
要预测未来值,只需将新数据传递给.predict()例如,使用第一个模型:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
On the intercept - there's nothing encoded in the number 1it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
在截距上 - 数字中没有任何编码,1它只是基于 OLS 的数学运算(截距完全类似于始终等于 1 的回归量),因此您可以立即从摘要中提取值。查看 statsmodels文档,添加拦截的另一种方法是:
X = sm.add_constant(X)

