Python StatsModels 的置信区间和预测区间
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17559408/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
confidence and prediction intervals with StatsModels
提问by F.N.B
I do this linear regression
with StatsModels
:
我这样做linear regression
有StatsModels
:
import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
n = 100
x = np.linspace(0, 10, n)
e = np.random.normal(size=n)
y = 1 + 0.5*x + 2*e
X = sm.add_constant(x)
re = sm.OLS(y, X).fit()
print(re.summary())
prstd, iv_l, iv_u = wls_prediction_std(re)
My questions are, iv_l
and iv_u
are the upper and lower confidence intervalsor prediction intervals?
我的问题是,iv_l
和iv_u
为上,下置信区间或预测区间?
How I get others?
我怎么得到别人?
I need the confidence and prediction intervals for all points, to do a plot.
我需要所有点的置信区间和预测区间来绘制图表。
采纳答案by Josef
updatesee the second answerwhich is more recent. Some of the models and results classes have now a get_prediction
method that provides additional information including prediction intervals and/or confidence intervals for the predicted mean.
更新请参阅更新的第二个答案。某些模型和结果类现在具有一种get_prediction
方法,可提供附加信息,包括预测均值的预测区间和/或置信区间。
old answer:
旧答案:
iv_l
and iv_u
give you the limits of the prediction interval for each point.
iv_l
并iv_u
为您提供每个点的预测区间的限制。
Prediction interval is the confidence interval for an observation and includes the estimate of the error.
预测区间是观测值的置信区间,包括误差估计值。
I think, confidence interval for the mean prediction is not yet available in statsmodels
.
(Actually, the confidence interval for the fitted values is hiding inside the summary_table of influence_outlier, but I need to verify this.)
我认为,平均预测的置信区间在statsmodels
. (实际上,拟合值的置信区间隐藏在influence_outlier的summary_table中,但我需要验证这一点。)
Proper prediction methods for statsmodels are on the TODO list.
统计模型的正确预测方法在 TODO 列表中。
Addition
添加
Confidence intervals are there for OLS but the access is a bit clumsy.
OLS 有置信区间,但访问有点笨拙。
To be included after running your script:
在运行脚本后包含:
from statsmodels.stats.outliers_influence import summary_table
st, data, ss2 = summary_table(re, alpha=0.05)
fittedvalues = data[:, 2]
predict_mean_se = data[:, 3]
predict_mean_ci_low, predict_mean_ci_upp = data[:, 4:6].T
predict_ci_low, predict_ci_upp = data[:, 6:8].T
# Check we got the right things
print np.max(np.abs(re.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))
plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
plt.show()
This should give the same results as SAS, http://jpktd.blogspot.ca/2012/01/nice-thing-about-seeing-zeros.html
这应该给出与 SAS 相同的结果,http://jpktd.blogspot.ca/2012/01/nice-thing-about-seeing-zeros.html
回答by Shahe Jokarian
You can get the prediction intervals by using LRPI() class from the Ipython notebook in my repo (https://github.com/shahejokarian/regression-prediction-interval).
您可以在我的存储库 ( https://github.com/shahejokarian/regression-prediction-interval) 中使用 Ipython 笔记本中的 LRPI() 类来获取预测间隔。
You need to set the t value to get the desired confidence interval for the prediction values, otherwise the default is 95% conf. interval.
您需要设置 t 值以获得预测值所需的置信区间,否则默认为 95% conf。间隔。
The LRPI class uses sklearn.linear_model's LinearRegression , numpy and pandas libraries.
LRPI 类使用 sklearn.linear_model 的 LinearRegression 、 numpy 和 pandas 库。
There is an example shown in the notebook too.
笔记本中也显示了一个示例。
回答by Julius
For test data you can try to use the following.
对于测试数据,您可以尝试使用以下内容。
predictions = result.get_prediction(out_of_sample_df)
predictions.summary_frame(alpha=0.05)
I found the summary_frame() method buried hereand you can find the get_prediction() method here. You can change the significance level of the confidence interval and prediction interval by modifying the "alpha" parameter.
我发现summary_frame()方法埋在这里,你可以找到get_prediction()方法在这里。您可以通过修改“alpha”参数来更改置信区间和预测区间的显着性水平。
I am posting this here because this was the first post that comes up when looking for a solution for confidence & prediction intervals – even though this concerns itself with test data rather.
我在这里发布这个是因为这是在寻找置信和预测区间的解决方案时出现的第一篇文章——尽管这与测试数据有关。
Here's a function to take a model, new data, and an arbitrary quantile, using this approach:
这是一个使用这种方法获取模型、新数据和任意分位数的函数:
def ols_quantile(m, X, q):
# m: OLS model.
# X: X matrix.
# q: Quantile.
#
# Set alpha based on q.
a = q * 2
if q > 0.5:
a = 2 * (1 - q)
predictions = m.get_prediction(X)
frame = predictions.summary_frame(alpha=a)
if q > 0.5:
return frame.obs_ci_upper
return frame.obs_ci_lower
回答by Max Ghenis
summary_frame
and summary_table
work well when you need exact results for a single quantile, but don't vectorize well. This will provide a normal approximation of the prediction interval (not confidence interval) and works for a vector of quantiles:
summary_frame
并且summary_table
当您需要单个分位数的精确结果时效果很好,但不能很好地矢量化。这将提供预测区间(不是置信区间)的正常近似值,并适用于分位数向量:
def ols_quantile(m, X, q):
# m: Statsmodels OLS model.
# X: X matrix of data to predict.
# q: Quantile.
#
from scipy.stats import norm
mean_pred = m.predict(X)
se = np.sqrt(m.scale)
return mean_pred + norm.ppf(q) * se
回答by fabrica
You can calculate them based on results given by statsmodel and the normality assumptions.
您可以根据 statsmodel 给出的结果和正态性假设来计算它们。
Here is an example for OLS and CI for the mean value:
以下是 OLS 和 CI 平均值的示例:
import statsmodels.api as sm
import numpy as np
from scipy import stats
#Significance level:
sl = 0.05
#Evaluate mean value at a required point x0. Here, at the point (0.0,2.0) for N_model=2:
x0 = np.asarray([1.0, 0.0, 2.0])# If you have no constant in your model, remove the first 1.0. For more dimensions, add the desired values.
#Get an OLS model based on output y and the prepared vector X (as in your notation):
model = sm.OLS(endog = y, exog = X )
results = model.fit()
#Get two-tailed t-values:
(t_minus, t_plus) = stats.t.interval(alpha = (1.0 - sl), df = len(results.resid) - len(x0) )
y_value_at_x0 = np.dot(results.params, x0)
lower_bound = y_value_at_x0 + t_minus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))
upper_bound = y_value_at_x0 + t_plus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))
You can wrap a nice function around this with input results, point x0 and significance level sl.
你可以用输入结果、点 x0 和显着性水平 sl 围绕这个包裹一个很好的函数。
I am unsure now if you can use this for WLS() since there are extra things happening there.
我现在不确定您是否可以将它用于 WLS(),因为那里发生了额外的事情。
Ref: Ch3 in [D.C. Montgomery and E.A. Peck. “Introduction to Linear Regression Analysis.” 4th. Ed., Wiley, 1992].
参考:[DC Montgomery 和 EA Peck。“线性回归分析简介。” 4. 编辑,威利,1992]。
回答by Bryan Butler
With time series results, you get a much smoother plot using the get_forecast()
method. An example of time series is below:
通过时间序列结果,您可以使用该get_forecast()
方法获得更平滑的图。下面是一个时间序列的例子:
# Seasonal Arima Modeling, no exogenous variable
model = SARIMAX(train['MI'], order=(1,1,1), seasonal_order=(1,1,0,12), enforce_invertibility=True)
results = model.fit()
results.summary()
The next step is to make the predictions, this generates the confidence intervals.
下一步是进行预测,这会生成置信区间。
# make the predictions for 11 steps ahead
predictions_int = results.get_forecast(steps=11)
predictions_int.predicted_mean
These can be put in a data frame but need some cleaning up:
这些可以放在数据框中,但需要一些清理:
# get a better view
predictions_int.conf_int()
Concatenate the data frame, but clean up the headers
连接数据框,但清理标题
conf_df = pd.concat([test['MI'],predictions_int.predicted_mean, predictions_int.conf_int()], axis = 1)
conf_df.head()
Then we rename the columns.
然后我们重命名列。
conf_df = conf_df.rename(columns={0: 'Predictions', 'lower MI': 'Lower CI', 'upper MI': 'Upper CI'})
conf_df.head()
Make the plot.
制作情节。
# make a plot of model fit
# color = 'skyblue'
fig = plt.figure(figsize = (16,8))
ax1 = fig.add_subplot(111)
x = conf_df.index.values
upper = conf_df['Upper CI']
lower = conf_df['Lower CI']
conf_df['MI'].plot(color = 'blue', label = 'Actual')
conf_df['Predictions'].plot(color = 'orange',label = 'Predicted' )
upper.plot(color = 'grey', label = 'Upper CI')
lower.plot(color = 'grey', label = 'Lower CI')
# plot the legend for the first plot
plt.legend(loc = 'lower left', fontsize = 12)
# fill between the conf intervals
plt.fill_between(x, lower, upper, color='grey', alpha='0.2')
plt.ylim(1000,3500)
plt.show()