Python 中的逐步回归

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15433372/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:06:02  来源:igfitidea点击:

Stepwise Regression in Python

pythonscipyregression

提问by user2174063

How to perform stepwise regressionin python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.

如何在python中执行逐步回归?SCIPY 中有 OLS 的方法,但我无法逐步完成。在这方面的任何帮助将是一个很大的帮助。谢谢。

Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:

编辑:我正在尝试建立一个线性回归模型。我有 5 个自变量并使用前向逐步回归,我的目标是选择变量以使我的模型具有最低的 p 值。以下链接解释了目标:

https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk

https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-行%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520RERESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD24bvk2d.

Thanks again.

再次感谢。

回答by Matti Pastell

Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.

Statsmodels 有其他回归方法:http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html 。我认为它将帮助您实现逐步回归。

回答by Aaron Schumacher

Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.

Trevor Smith 和我为使用 statsmodels 的线性回归编写了一个小前向选择函数:http://planspace.org/20150423-forward_selection_with_statsmodels/您可以轻松修改它以最小化 p 值,或仅基于 beta p 值进行选择多一点工作。

回答by Varun-08

"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm

"""X_opt variable has all the columns of independent variables of matrix X 
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]

"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

Using the summary method, you can check in your kernel the p values of your variables written as 'P>|t|'. Then check for the variable with the highest p value. Suppose x3 has the highest value e.g 0.956. Then remove this column from your array and repeat all the steps.

使用汇总方法,您可以在内核中检查写为“P>|t|”的变量的 p 值。然后检查具有最高 p 值的变量。假设 x3 具有最高值,例如 0.956。然后从阵列中删除此列并重复所有步骤。

X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.

重复这些方法,直到删除所有 p 值高于显着性值(例如 0.05)的列。最后,您的变量 X_opt 将具有 p 值小于显着性水平的所有最佳变量。

回答by David Dale

You can make forward-backward selection based on statsmodels.api.OLSmodel, as shown in this answer.

您可以根据statsmodels.api.OLS模型进行前后选择,如本答案所示。

However, this answerdescribes why you should not use stepwise selection for econometric models in the first place.

但是,这个答案首先描述了为什么不应该对计量经济学模型使用逐步选择。

回答by Regi Mathew

You may try mlxtend which got various selection methods.

您可以尝试使用多种选择方法的 mlxtend。

from mlxtend.feature_selection import SequentialFeatureSelector as sfs

clf = LinearRegression()

# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)

# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)

回答by Jacob Helwig

Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:

这是我刚刚编写的一种方法,它使用统计学习简介中所述的“混合选择”。作为输入,它需要:

  • lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the number of data points, and Y, where Y is the response in the training data

  • curr_preds- a list with ['const']

  • potential_preds- a list of all potential predictors. There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors

  • tol, optional. The max pvalue, .05 if not specified

  • lm,一个 statsmodels.OLS.fit(Y,X),其中 X 是 n 个数组,其中 n 是数据点的数量,Y,其中 Y 是训练数据中的响应

  • curr_preds- 带有 ['const'] 的列表

  • potential_preds- 所有潜在预测变量的列表。还需要一个 Pandas 数据框 X_mix,其中包含所有数据,包括“const”,以及与潜在预测变量相对应的所有数据

  • 托尔,可选。最大 pvalue,如果未指定,则为 .05

def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
  while (len(potential_preds) > 0):
    index_best = -1 # this will record the index of the best predictor
    curr = -1 # this will record current index
    best_r_squared = lm.rsquared_adj # record the r squared of the current model
    # loop to determine if any of the predictors can better the r-squared  
    for pred in potential_preds:
      curr += 1 # increment current
      preds = curr_preds.copy() # grab the current predictors
      preds.append(pred)
      lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
      new_r_sq = lm_new.rsquared_adj # record r squared for new model
      if new_r_sq > best_r_squared:
        best_r_squared = new_r_sq
        index_best = curr

    if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
      curr_preds.append(potential_preds.pop(index_best))
    else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
      break

    # fit a new lm using the new predictors, look at the p-values
    pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
    pval_too_big = []
    # make a list of all the p-values that are greater than the tolerance 
    for feat in pvals.index:
      if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
        pval_too_big.append(feat)

    # now remove all the features from curr_preds that have a p-value that is too large
    for feat in pval_too_big:
      pop_index = curr_preds.index(feat)
      curr_preds.pop(pop_index)

回答by HE Xin

I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS

我开发了这个存储库https://github.com/xinhe97/StepwiseSelectionOLS

My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.

我的逐步选择类(最佳子集、向前逐步、向后逐步)与 sklearn 兼容。你可以用我的类来做 Pipeline 和 GridSearchCV。

The essential part of my code is as follows:

我的代码的基本部分如下:

################### Criteria ###################
def processSubset(self, X,y,feature_index):
    # Fit model on feature_set and calculate rsq_adj
    regr = sm.OLS(y,X[:,feature_index]).fit()
    rsq_adj = regr.rsquared_adj
    bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
    rsq = regr.rsquared
    return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}

################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
    # Pull out predictors we still need to process
    remaining_predictors_index = [p for p in range(X.shape[1])
                            if p not in predictors_index]

    results = []
    for p in remaining_predictors_index:
        new_predictors_index = predictors_index+[p]
        new_predictors_index.sort()
        results.append(self.processSubset(X,y,new_predictors_index))
        # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    # Choose the model with the highest rsq_adj
    # best_model = models.loc[models['bic'].idxmin()]
    best_model = models.loc[models['rsq'].idxmax()]
    # Return the best model, along with model's other  information
    return best_model

def forwardK(self,X_est,y_est, fK):
    models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
    predictors_index = []

    M = min(fK,X_est.shape[1])

    for i in range(1,M+1):
        print(i)
        models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
        predictors_index = models_fwd.loc[i,'predictors_index']

    print(models_fwd)
    # best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
    best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
    # best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
    best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
    return best_model_fwd, best_predictors