Python中的方差膨胀因子

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42658379/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:00:10  来源:igfitidea点击:

Variance Inflation Factor in Python

pythonrnumpystatisticsstatsmodels

提问by Nizag

I'm trying to calculate the variance inflation factor (VIF) for each column in a simple dataset in python:

我正在尝试在 python 中的一个简单数据集中计算每一列的方差膨胀因子 (VIF):

a b c d
1 2 4 4
1 2 6 3
2 3 7 4
3 2 8 5
4 1 9 4

I have already done this in R using the vif function from the usdm librarywhich gives the following results:

我已经使用usdm 库中的 vif 函数在 R 中完成了此操作,结果如下:

a <- c(1, 1, 2, 3, 4)
b <- c(2, 2, 3, 2, 1)
c <- c(4, 6, 7, 8, 9)
d <- c(4, 3, 4, 5, 4)

df <- data.frame(a, b, c, d)
vif_df <- vif(df)
print(vif_df)

Variables   VIF
   a        22.95
   b        3.00
   c        12.95
   d        3.00

However, when I do the same in python using the statsmodel vif function, my results are:

但是,当我使用statsmodel vif 函数在 python 中做同样的事情时,我的结果是:

a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

ck = np.column_stack([a, b, c, d])

vif = [variance_inflation_factor(ck, i) for i in range(ck.shape[1])]
print(vif)

Variables   VIF
   a        47.136986301369774
   b        28.931506849315081
   c        80.31506849315096
   d        40.438356164383549

The results are vastly different, even though the inputs are the same. In general, results from the statsmodel VIF function seem to be wrong, but I'm not sure if this is because of the way I am calling it or if it is an issue with the function itself.

即使输入相同,结果也大不相同。一般来说,statsmodel VIF 函数的结果似乎是错误的,但我不确定这是因为我调用它的方式还是函数本身的问题。

I was hoping someone could help me figure out whether I was incorrectly calling the statsmodel function or explain the discrepancies in the results. If it's an issue with the function then are there any VIF alternatives in python?

我希望有人可以帮助我弄清楚我是否错误地调用了 statsmodel 函数或解释了结果中的差异。如果这是函数的问题,那么在 python 中是否有任何 VIF 替代方案?

采纳答案by Drverzal

I believe the reason for this is due to a difference in Python's OLS. OLS, which is used in the python variance inflation factor calculation, does not add an intercept by default. You definitely want an intercept in there however.

我相信这是由于 Python 的 OLS 不同造成的。用于python方差膨胀因子计算的OLS,默认不加截距。但是,您肯定希望在那里进行拦截。

What you'd want to do is add one more column to your matrix, ck, filled with ones to represent a constant. This will be the the intercept term of the equation. Once this is done, your values should match out properly.

您想要做的是在矩阵 ck 中再添加一列,用 ck 填充以表示常量。这将是方程的截距项。完成此操作后,您的值应正确匹配。

Edited: replaced zeroes with ones

编辑:用一个替换零

回答by Alexander

As mentioned by others and in this postby Josef Perktold, the function's author, variance_inflation_factorexpects the presence of a constant in the matrix of explanatory variables. One can use add_constantfrom statsmodels to add the required constant to the dataframe before passing its values to the function.

正如其他人以及该函数的作者 Josef Perktold在这篇文章中所提到的,variance_inflation_factor期望解释变量矩阵中存在一个常数。add_constant在将其值传递给函数之前,可以使用from statsmodels 将所需的常量添加到数据帧中。

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

df = pd.DataFrame(
    {'a': [1, 1, 2, 3, 4],
     'b': [2, 2, 3, 2, 1],
     'c': [4, 6, 7, 8, 9],
     'd': [4, 3, 4, 5, 4]}
)

X = add_constant(df)
>>> pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)
const    136.875
a         22.950
b          3.000
c         12.950
d          3.000
dtype: float64

I believe you could also add the constant to the right most column of the dataframe using assign:

我相信您还可以使用以下方法将常量添加到数据框的最右侧列assign

X = df.assign(const=1)
>>> pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)
a         22.950
b          3.000
c         12.950
d          3.000
const    136.875
dtype: float64

The source code itself is rather concise:

源代码本身相当简洁:

def variance_inflation_factor(exog, exog_idx):
    """
    exog : ndarray, (nobs, k_vars)
        design matrix with all explanatory variables, as for example used in
        regression
    exog_idx : int
        index of the exogenous variable in the columns of exog
    """
    k_vars = exog.shape[1]
    x_i = exog[:, exog_idx]
    mask = np.arange(k_vars) != exog_idx
    x_noti = exog[:, mask]
    r_squared_i = OLS(x_i, x_noti).fit().rsquared
    vif = 1. / (1. - r_squared_i)
    return vif

It is also rather simple to modify the code to return all of the VIFs as a series:

修改代码以将所有 VIF 作为一个系列返回也很简单:

from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant

def variance_inflation_factors(exog_df):
    '''
    Parameters
    ----------
    exog_df : dataframe, (nobs, k_vars)
        design matrix with all explanatory variables, as for example used in
        regression.

    Returns
    -------
    vif : Series
        variance inflation factors
    '''
    exog_df = add_constant(exog_df)
    vifs = pd.Series(
        [1 / (1. - OLS(exog_df[col].values, 
                       exog_df.loc[:, exog_df.columns != col].values).fit().rsquared) 
         for col in exog_df],
        index=exog_df.columns,
        name='VIF'
    )
    return vifs

>>> variance_inflation_factors(df)
const    136.875
a         22.950
b          3.000
c         12.950
Name: VIF, dtype: float64

回答by T_T

For future comers to this thread (like me):

对于这个线程的未来来者(像我一样):

import numpy as np
import scipy as sp

a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

ck = np.column_stack([a, b, c, d])
cc = sp.corrcoef(ck, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()

This code gives

这段代码给出

array([22.95,  3.  , 12.95,  3.  ])

[EDIT]

[编辑]

In response to a comment, I tried to use DataFrameas much as possible (numpyis required to invert a matrix).

为了回应评论,我尝试DataFrame尽可能多地使用(numpy需要反转矩阵)。

import pandas as pd
import numpy as np

a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

df = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})
df_cor = df.corr()
pd.DataFrame(np.linalg.inv(df.corr().values), index = df_cor.index, columns=df_cor.columns)

The code gives

代码给出

       a            b           c           d
a   22.950000   6.453681    -16.301917  -6.453681
b   6.453681    3.000000    -4.080441   -2.000000
c   -16.301917  -4.080441   12.950000   4.080441
d   -6.453681   -2.000000   4.080441    3.000000

The diagonal elements give VIF.

对角线元素给出 VIF。

回答by steven

In case you don't wanna deal with variance_inflation_factorand add_constant. Please consider the following two functions.

如果你不想处理variance_inflation_factoradd_constant。请考虑以下两个函数。

1. Use formula in statasmodels:

1.在statasmodels中使用公式:

import pandas as pd
import statsmodels.formula.api as smf

def get_vif(exogs, data):
    '''Return VIF (variance inflation factor) DataFrame

    Args:
    exogs (list): list of exogenous/independent variables
    data (DataFrame): the df storing all variables

    Returns:
    VIF and Tolerance DataFrame for each exogenous variable

    Notes:
    Assume we have a list of exogenous variable [X1, X2, X3, X4].
    To calculate the VIF and Tolerance for each variable, we regress
    each of them against other exogenous variables. For instance, the
    regression model for X3 is defined as:
                        X3 ~ X1 + X2 + X4
    And then we extract the R-squared from the model to calculate:
                    VIF = 1 / (1 - R-squared)
                    Tolerance = 1 - R-squared
    The cutoff to detect multicollinearity:
                    VIF > 10 or Tolerance < 0.1
    '''

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # create formula for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        formula = f"{exog} ~ {' + '.join(not_exog)}"

        # extract r-squared from the fit
        r_squared = smf.ols(formula, data=data).fit().rsquared

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif



2. Use LinearRegressionin sklearn:

2.LinearRegression在sklearn中使用:

# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from sklearn.linear_model import LinearRegression

def sklearn_vif(exogs, data):

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # form input data for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        X, y = data[not_exog], data[exog]

        # extract r-squared from the fit
        r_squared = LinearRegression().fit(X, y).score(X, y)

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif



Example:

例子:

import seaborn as sns

df = sns.load_dataset('car_crashes')
exogs = ['alcohol', 'speeding', 'no_previous', 'not_distracted']

[In] %%timeit -n 100
get_vif(exogs=exogs, data=df)

[Out]
                      VIF   Tolerance
alcohol          3.436072   0.291030
no_previous      3.113984   0.321132
not_distracted   2.668456   0.374749
speeding         1.884340   0.530690

69.6 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

[In] %%timeit -n 100
sklearn_vif(exogs=exogs, data=df)

[Out]
                      VIF   Tolerance
alcohol          3.436072   0.291030
no_previous      3.113984   0.321132
not_distracted   2.668456   0.374749
speeding         1.884340   0.530690

15.7 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

回答by Saqib Mujtaba

Example for Boston Data:

波士顿数据示例:

VIFis calculated by auxiliary regression, so not dependent on the actual fit.

VIF是通过辅助回归计算的,因此不依赖于实际拟合。

See below:

见下文:

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

# Break into left and right hand side; y and X
y, X = dmatrices(formula="medv ~ crim + zn + nox + ptratio + black + rm ", data=boston, return_type="dataframe")

# For each Xi, calculate VIF
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Fit X to y
result = sm.OLS(y, X).fit()

回答by Chef1075

I wrote this function based on some other posts I saw on Stack and CrossValidated. It shows the features which are over the threshold and returns a new dataframe with the features removed.

我根据我在 Stack 和 CrossValidated 上看到的其他一些帖子编写了这个函数。它显示超过阈值的特征,并返回一个删除了特征的新数据帧。

from statsmodels.stats.outliers_influence import variance_inflation_factor 
from statsmodels.tools.tools import add_constant

def calculate_vif_(df, thresh=5):
    '''
    Calculates VIF each feature in a pandas dataframe
    A constant must be added to variance_inflation_factor or the results will be incorrect

    :param df: the pandas dataframe containing only the predictor features, not the response variable
    :param thresh: the max VIF value before the feature is removed from the dataframe
    :return: dataframe with features removed
    '''
    const = add_constant(df)
    cols = const.columns
    variables = np.arange(const.shape[1])
    vif_df = pd.Series([variance_inflation_factor(const.values, i) 
               for i in range(const.shape[1])], 
              index=const.columns).to_frame()

    vif_df = vif_df.sort_values(by=0, ascending=False).rename(columns={0: 'VIF'})
    vif_df = vif_df.drop('const')
    vif_df = vif_df[vif_df['VIF'] > thresh]

    print 'Features above VIF threshold:\n'
    print vif_df[vif_df['VIF'] > thresh]

    col_to_drop = list(vif_df.index)

    for i in col_to_drop:
        print 'Dropping: {}'.format(i)
        df = df.drop(columns=i)

    return df

回答by Md Asraful Kabir

Although it is already late, I am adding some modifications from the given answer. To get the best set after removing multicollinearity if we use @Chef1075 solution then we will lose the variables which are correlated. We have to remove only one of them. To do this I came with the following solution using @steve answer:

虽然已经晚了,但我正在从给定的答案中添加一些修改。如果我们使用@Chef1075 解决方案,为了在去除多重共线性后获得最佳集合,那么我们将丢失相关的变量。我们只需要删除其中之一。为此,我使用@steve 回答提供了以下解决方案:

import pandas as pd
from sklearn.linear_model import LinearRegression

def sklearn_vif(exogs, data):
    '''
    This function calculates variance inflation function in sklearn way. 
     It is a comparatively faster process.

    '''
    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # form input data for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        X, y = data[not_exog], data[exog]

        # extract r-squared from the fit
        r_squared = LinearRegression().fit(X, y).score(X, y)

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 4,1],
 'b': [2, 2, 3, 2, 1,3],
 'c': [4, 6, 7, 8, 9,5],
 'd': [4, 3, 4, 5, 4,6],
 'e': [8,8,14,15,17,20]}
  )

df_vif= sklearn_vif(exogs=df.columns, data=df).sort_values(by='VIF',ascending=False)
while (df_vif.VIF>5).any() ==True:
    red_df_vif= df_vif.drop(df_vif.index[0])
    df= df[red_df_vif.index]
    df_vif=sklearn_vif(exogs=df.columns,data=df).sort_values(by='VIF',ascending=False)




print(df)

   d  c  b
0  4  4  2
1  3  6  2
2  4  7  3
3  5  8  2
4  4  9  1
5  6  5  3

回答by Max Alonzo

here code using dataframe python:

这里使用数据框 python 的代码:

To create data

创建数据

import numpy as np
import scipy as sp

import numpy as np
import scipy as sp

a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

To create dataframe

创建数据框

import pandas as pd
data = pd.DataFrame()
data["a"] = a
data["b"] = b
data["c"] = c
data["d"] = d

import pandas as pd
data = pd.DataFrame()
data["a"] = a
data["b"] = b
data["c"] = c
data["d"] = d

Calculate VIF

计算 VIF

cc = np.corrcoef(data, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()

cc = np.corrcoef(data, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()

Result

结果

array([22.95, 3. , 12.95, 3. ])

array([22.95, 3. , 12.95, 3. ])