Python StatsModels 的置信区间和预测区间

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17559408/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:32:07  来源:igfitidea点击:

confidence and prediction intervals with StatsModels

pythonstatisticsstatsmodels

提问by F.N.B

I do this linear regressionwith StatsModels:

我这样做linear regressionStatsModels

import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

n = 100

x = np.linspace(0, 10, n)
e = np.random.normal(size=n)
y = 1 + 0.5*x + 2*e
X = sm.add_constant(x)

re = sm.OLS(y, X).fit()
print(re.summary())

prstd, iv_l, iv_u = wls_prediction_std(re)

My questions are, iv_land iv_uare the upper and lower confidence intervalsor prediction intervals?

我的问题是,iv_liv_u为上,下置信区间预测区间

How I get others?

我怎么得到别人?

I need the confidence and prediction intervals for all points, to do a plot.

我需要所有点的置信区间和预测区间来绘制图表。

采纳答案by Josef

updatesee the second answerwhich is more recent. Some of the models and results classes have now a get_predictionmethod that provides additional information including prediction intervals and/or confidence intervals for the predicted mean.

更新请参阅更新的第二个答案。某些模型和结果类现在具有一种get_prediction方法,可提供附加信息,包括预测均值的预测区间和/或置信区间。

old answer:

旧答案:

iv_land iv_ugive you the limits of the prediction interval for each point.

iv_liv_u为您提供每个点的预测区间的限制。

Prediction interval is the confidence interval for an observation and includes the estimate of the error.

预测区间是观测值的置信区间,包括误差估计值。

I think, confidence interval for the mean prediction is not yet available in statsmodels. (Actually, the confidence interval for the fitted values is hiding inside the summary_table of influence_outlier, but I need to verify this.)

我认为,平均预测的置信区间在statsmodels. (实际上,拟合值的置信区间隐藏在influence_outlier的summary_table中,但我需要验证这一点。)

Proper prediction methods for statsmodels are on the TODO list.

统计模型的正确预测方法在 TODO 列表中。

Addition

添加

Confidence intervals are there for OLS but the access is a bit clumsy.

OLS 有置信区间,但访问有点笨拙。

To be included after running your script:

在运行脚本后包含:

from statsmodels.stats.outliers_influence import summary_table

st, data, ss2 = summary_table(re, alpha=0.05)

fittedvalues = data[:, 2]
predict_mean_se  = data[:, 3]
predict_mean_ci_low, predict_mean_ci_upp = data[:, 4:6].T
predict_ci_low, predict_ci_upp = data[:, 6:8].T

# Check we got the right things
print np.max(np.abs(re.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))

plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
plt.show()

enter image description here

在此处输入图片说明

This should give the same results as SAS, http://jpktd.blogspot.ca/2012/01/nice-thing-about-seeing-zeros.html

这应该给出与 SAS 相同的结果,http://jpktd.blogspot.ca/2012/01/nice-thing-about-seeing-zeros.html

回答by Shahe Jokarian

You can get the prediction intervals by using LRPI() class from the Ipython notebook in my repo (https://github.com/shahejokarian/regression-prediction-interval).

您可以在我的存储库 ( https://github.com/shahejokarian/regression-prediction-interval) 中使用 Ipython 笔记本中的 LRPI() 类来获取预测间隔。

You need to set the t value to get the desired confidence interval for the prediction values, otherwise the default is 95% conf. interval.

您需要设置 t 值以获得预测值所需的置信区间,否则默认为 95% conf。间隔。

The LRPI class uses sklearn.linear_model's LinearRegression , numpy and pandas libraries.

LRPI 类使用 sklearn.linear_model 的 LinearRegression 、 numpy 和 pandas 库。

There is an example shown in the notebook too.

笔记本中也显示了一个示例。

回答by Julius

For test data you can try to use the following.

对于测试数据,您可以尝试使用以下内容。

predictions = result.get_prediction(out_of_sample_df)
predictions.summary_frame(alpha=0.05)

I found the summary_frame() method buried hereand you can find the get_prediction() method here. You can change the significance level of the confidence interval and prediction interval by modifying the "alpha" parameter.

我发现summary_frame()方法埋在这里,你可以找到get_prediction()方法在这里。您可以通过修改“alpha”参数来更改置信区间和预测区间的显着性水平。

I am posting this here because this was the first post that comes up when looking for a solution for confidence & prediction intervals – even though this concerns itself with test data rather.

我在这里发布这个是因为这是在寻找置信和预测区间的解决方案时出现的第一篇文章——尽管这与测试数据有关。

Here's a function to take a model, new data, and an arbitrary quantile, using this approach:

这是一个使用这种方法获取模型、新数据和任意分位数的函数:

def ols_quantile(m, X, q):
  # m: OLS model.
  # X: X matrix.
  # q: Quantile.
  #
  # Set alpha based on q.
  a = q * 2
  if q > 0.5:
    a = 2 * (1 - q)
  predictions = m.get_prediction(X)
  frame = predictions.summary_frame(alpha=a)
  if q > 0.5:
    return frame.obs_ci_upper
  return frame.obs_ci_lower

回答by Max Ghenis

summary_frameand summary_tablework well when you need exact results for a single quantile, but don't vectorize well. This will provide a normal approximation of the prediction interval (not confidence interval) and works for a vector of quantiles:

summary_frame并且summary_table当您需要单个分位数的精确结果时效果很好,但不能很好地矢量化。这将提供预测区间(不是置信区间)的正常近似值,并适用于分位数向量:

def ols_quantile(m, X, q):
  # m: Statsmodels OLS model.
  # X: X matrix of data to predict.
  # q: Quantile.
  #
  from scipy.stats import norm
  mean_pred = m.predict(X)
  se = np.sqrt(m.scale)
  return mean_pred + norm.ppf(q) * se

回答by fabrica

You can calculate them based on results given by statsmodel and the normality assumptions.

您可以根据 statsmodel 给出的结果和正态性假设来计算它们。

Here is an example for OLS and CI for the mean value:

以下是 OLS 和 CI 平均值的示例:

import statsmodels.api as sm
import numpy as np
from scipy import stats

#Significance level:
sl = 0.05
#Evaluate mean value at a required point x0. Here, at the point (0.0,2.0) for N_model=2:
x0 = np.asarray([1.0, 0.0, 2.0])# If you have no constant in your model, remove the first 1.0. For more dimensions, add the desired values.

#Get an OLS model based on output y and the prepared vector X (as in your notation):
model = sm.OLS(endog = y, exog = X )
results = model.fit()
#Get two-tailed t-values:
(t_minus, t_plus) = stats.t.interval(alpha = (1.0 - sl), df =  len(results.resid) - len(x0) )
y_value_at_x0 = np.dot(results.params, x0)
lower_bound = y_value_at_x0 + t_minus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))
upper_bound = y_value_at_x0 +  t_plus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))

You can wrap a nice function around this with input results, point x0 and significance level sl.

你可以用输入结果、点 x0 和显着性水平 sl 围绕这个包裹一个很好的函数。

I am unsure now if you can use this for WLS() since there are extra things happening there.

我现在不确定您是否可以将它用于 WLS(),因为那里发生了额外的事情。

Ref: Ch3 in [D.C. Montgomery and E.A. Peck. “Introduction to Linear Regression Analysis.” 4th. Ed., Wiley, 1992].

参考:[DC Montgomery 和 EA Peck。“线性回归分析简介。” 4. 编辑,威利,1992]。

回答by Bryan Butler

With time series results, you get a much smoother plot using the get_forecast()method. An example of time series is below:

通过时间序列结果,您可以使用该get_forecast()方法获得更平滑的图。下面是一个时间序列的例子:

# Seasonal Arima Modeling, no exogenous variable
model = SARIMAX(train['MI'], order=(1,1,1), seasonal_order=(1,1,0,12), enforce_invertibility=True)

results = model.fit()

results.summary()

enter image description here

在此处输入图片说明

The next step is to make the predictions, this generates the confidence intervals.

下一步是进行预测,这会生成置信区间。

# make the predictions for 11 steps ahead
predictions_int = results.get_forecast(steps=11)
predictions_int.predicted_mean

enter image description here

在此处输入图片说明

These can be put in a data frame but need some cleaning up:

这些可以放在数据框中,但需要一些清理:

# get a better view
predictions_int.conf_int()

enter image description here

在此处输入图片说明

Concatenate the data frame, but clean up the headers

连接数据框,但清理标题

conf_df = pd.concat([test['MI'],predictions_int.predicted_mean, predictions_int.conf_int()], axis = 1)

conf_df.head()

enter image description here

在此处输入图片说明

Then we rename the columns.

然后我们重命名列。

conf_df = conf_df.rename(columns={0: 'Predictions', 'lower MI': 'Lower CI', 'upper MI': 'Upper CI'})
conf_df.head()

enter image description here

在此处输入图片说明

Make the plot.

制作情节。

# make a plot of model fit
# color = 'skyblue'

fig = plt.figure(figsize = (16,8))
ax1 = fig.add_subplot(111)


x = conf_df.index.values


upper = conf_df['Upper CI']
lower = conf_df['Lower CI']

conf_df['MI'].plot(color = 'blue', label = 'Actual')
conf_df['Predictions'].plot(color = 'orange',label = 'Predicted' )
upper.plot(color = 'grey', label = 'Upper CI')
lower.plot(color = 'grey', label = 'Lower CI')

# plot the legend for the first plot
plt.legend(loc = 'lower left', fontsize = 12)


# fill between the conf intervals
plt.fill_between(x, lower, upper, color='grey', alpha='0.2')

plt.ylim(1000,3500)

plt.show()

enter image description here

在此处输入图片说明