Python 中的逐步回归

Question

提问by user2174063

How to perform stepwise regressionin python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.

如何在python中执行逐步回归？SCIPY 中有 OLS 的方法，但我无法逐步完成。在这方面的任何帮助将是一个很大的帮助。谢谢。

Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:

编辑：我正在尝试建立一个线性回归模型。我有 5 个自变量并使用前向逐步回归，我的目标是选择变量以使我的模型具有最低的 p 值。以下链接解释了目标：

https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk

https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-行%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520RERESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD24bvk2d.

Thanks again.

再次感谢。

Answer 1

回答by Matti Pastell

Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.

Statsmodels 有其他回归方法：http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html 。我认为它将帮助您实现逐步回归。

Answer 2

回答by Aaron Schumacher

Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.

Trevor Smith 和我为使用 statsmodels 的线性回归编写了一个小前向选择函数：http://planspace.org/20150423-forward_selection_with_statsmodels/您可以轻松修改它以最小化 p 值，或仅基于 beta p 值进行选择多一点工作。

Answer 3

回答by Varun-08

"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm

"""X_opt variable has all the columns of independent variables of matrix X 
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]

"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

Using the summary method, you can check in your kernel the p values of your variables written as 'P>|t|'. Then check for the variable with the highest p value. Suppose x3 has the highest value e.g 0.956. Then remove this column from your array and repeat all the steps.

使用汇总方法，您可以在内核中检查写为“P>|t|”的变量的 p 值。然后检查具有最高 p 值的变量。假设 x3 具有最高值，例如 0.956。然后从阵列中删除此列并重复所有步骤。

X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.

重复这些方法，直到删除所有 p 值高于显着性值（例如 0.05）的列。最后，您的变量 X_opt 将具有 p 值小于显着性水平的所有最佳变量。

Answer 4

回答by David Dale

You can make forward-backward selection based on statsmodels.api.OLSmodel, as shown in this answer.

您可以根据statsmodels.api.OLS模型进行前后选择，如本答案所示。

However, this answerdescribes why you should not use stepwise selection for econometric models in the first place.

但是，这个答案首先描述了为什么不应该对计量经济学模型使用逐步选择。

Answer 5

回答by Regi Mathew

You may try mlxtend which got various selection methods.

您可以尝试使用多种选择方法的 mlxtend。

from mlxtend.feature_selection import SequentialFeatureSelector as sfs

clf = LinearRegression()

# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)

# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)

Answer 6

回答by Jacob Helwig

Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:

这是我刚刚编写的一种方法，它使用统计学习简介中所述的“混合选择”。作为输入，它需要：

lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors. There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified

lm，一个 statsmodels.OLS.fit(Y,X)，其中 X 是 n 个数组，其中 n 是数据点的数量，Y，其中 Y 是训练数据中的响应
curr_preds- 带有 ['const'] 的列表
potential_preds- 所有潜在预测变量的列表。还需要一个 Pandas 数据框 X_mix，其中包含所有数据，包括“const”，以及与潜在预测变量相对应的所有数据
托尔，可选。最大 pvalue，如果未指定，则为 .05

def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
  while (len(potential_preds) > 0):
    index_best = -1 # this will record the index of the best predictor
    curr = -1 # this will record current index
    best_r_squared = lm.rsquared_adj # record the r squared of the current model
    # loop to determine if any of the predictors can better the r-squared  
    for pred in potential_preds:
      curr += 1 # increment current
      preds = curr_preds.copy() # grab the current predictors
      preds.append(pred)
      lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
      new_r_sq = lm_new.rsquared_adj # record r squared for new model
      if new_r_sq > best_r_squared:
        best_r_squared = new_r_sq
        index_best = curr

    if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
      curr_preds.append(potential_preds.pop(index_best))
    else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
      break

    # fit a new lm using the new predictors, look at the p-values
    pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
    pval_too_big = []
    # make a list of all the p-values that are greater than the tolerance 
    for feat in pvals.index:
      if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
        pval_too_big.append(feat)

    # now remove all the features from curr_preds that have a p-value that is too large
    for feat in pval_too_big:
      pop_index = curr_preds.index(feat)
      curr_preds.pop(pop_index)

Answer 7

回答by HE Xin

I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS

我开发了这个存储库https://github.com/xinhe97/StepwiseSelectionOLS

My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.

我的逐步选择类（最佳子集、向前逐步、向后逐步）与 sklearn 兼容。你可以用我的类来做 Pipeline 和 GridSearchCV。

The essential part of my code is as follows:

我的代码的基本部分如下：

################### Criteria ###################
def processSubset(self, X,y,feature_index):
    # Fit model on feature_set and calculate rsq_adj
    regr = sm.OLS(y,X[:,feature_index]).fit()
    rsq_adj = regr.rsquared_adj
    bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
    rsq = regr.rsquared
    return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}

################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
    # Pull out predictors we still need to process
    remaining_predictors_index = [p for p in range(X.shape[1])
                            if p not in predictors_index]

    results = []
    for p in remaining_predictors_index:
        new_predictors_index = predictors_index+[p]
        new_predictors_index.sort()
        results.append(self.processSubset(X,y,new_predictors_index))
        # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    # Choose the model with the highest rsq_adj
    # best_model = models.loc[models['bic'].idxmin()]
    best_model = models.loc[models['rsq'].idxmax()]
    # Return the best model, along with model's other  information
    return best_model

def forwardK(self,X_est,y_est, fK):
    models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
    predictors_index = []

    M = min(fK,X_est.shape[1])

    for i in range(1,M+1):
        print(i)
        models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
        predictors_index = models_fwd.loc[i,'predictors_index']

    print(models_fwd)
    # best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
    best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
    # best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
    best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
    return best_model_fwd, best_predictors

Python 中的逐步回归

提问by user2174063

回答by Matti Pastell

回答by Aaron Schumacher

回答by Varun-08

回答by David Dale

回答by Regi Mathew

回答by Jacob Helwig

回答by HE Xin

相关推荐

最近更新

标签

Python 中的逐步回归

提问by user2174063

回答by Matti Pastell

回答by Aaron Schumacher

回答by Varun-08

回答by David Dale

回答by Regi Mathew

回答by Jacob Helwig

回答by HE Xin

相关推荐

Python ValueError: 需要 1 个以上的值才能解包，拆分一行

Python ManagementForm 数据丢失或被篡改

Python：导入时出现语法错误

从 AJAX 或 JQuery 运行 Python 脚本

相关推荐

最近更新

标签