Python 中的逐步回归
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15433372/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Stepwise Regression in Python
提问by user2174063
How to perform stepwise regressionin python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.
如何在python中执行逐步回归?SCIPY 中有 OLS 的方法,但我无法逐步完成。在这方面的任何帮助将是一个很大的帮助。谢谢。
Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:
编辑:我正在尝试建立一个线性回归模型。我有 5 个自变量并使用前向逐步回归,我的目标是选择变量以使我的模型具有最低的 p 值。以下链接解释了目标:
Thanks again.
再次感谢。
回答by Matti Pastell
Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.
Statsmodels 有其他回归方法:http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html 。我认为它将帮助您实现逐步回归。
回答by Aaron Schumacher
Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.
Trevor Smith 和我为使用 statsmodels 的线性回归编写了一个小前向选择函数:http://planspace.org/20150423-forward_selection_with_statsmodels/您可以轻松修改它以最小化 p 值,或仅基于 beta p 值进行选择多一点工作。
回答by Varun-08
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Using the summary method, you can check in your kernel the p values of your variables written as 'P>|t|'. Then check for the variable with the highest p value. Suppose x3 has the highest value e.g 0.956. Then remove this column from your array and repeat all the steps.
使用汇总方法,您可以在内核中检查写为“P>|t|”的变量的 p 值。然后检查具有最高 p 值的变量。假设 x3 具有最高值,例如 0.956。然后从阵列中删除此列并重复所有步骤。
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.
重复这些方法,直到删除所有 p 值高于显着性值(例如 0.05)的列。最后,您的变量 X_opt 将具有 p 值小于显着性水平的所有最佳变量。
回答by David Dale
You can make forward-backward selection based on statsmodels.api.OLSmodel, as shown in this answer.
您可以根据statsmodels.api.OLS模型进行前后选择,如本答案所示。
However, this answerdescribes why you should not use stepwise selection for econometric models in the first place.
但是,这个答案首先描述了为什么不应该对计量经济学模型使用逐步选择。
回答by Regi Mathew
You may try mlxtend which got various selection methods.
您可以尝试使用多种选择方法的 mlxtend。
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
回答by Jacob Helwig
Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:
这是我刚刚编写的一种方法,它使用统计学习简介中所述的“混合选择”。作为输入,它需要:
lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors. There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified
lm,一个 statsmodels.OLS.fit(Y,X),其中 X 是 n 个数组,其中 n 是数据点的数量,Y,其中 Y 是训练数据中的响应
curr_preds- 带有 ['const'] 的列表
potential_preds- 所有潜在预测变量的列表。还需要一个 Pandas 数据框 X_mix,其中包含所有数据,包括“const”,以及与潜在预测变量相对应的所有数据
托尔,可选。最大 pvalue,如果未指定,则为 .05
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)
回答by HE Xin
I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS
我开发了这个存储库https://github.com/xinhe97/StepwiseSelectionOLS
My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.
我的逐步选择类(最佳子集、向前逐步、向后逐步)与 sklearn 兼容。你可以用我的类来做 Pipeline 和 GridSearchCV。
The essential part of my code is as follows:
我的代码的基本部分如下:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors

