Python中的方差膨胀因子
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42658379/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Variance Inflation Factor in Python
提问by Nizag
I'm trying to calculate the variance inflation factor (VIF) for each column in a simple dataset in python:
我正在尝试在 python 中的一个简单数据集中计算每一列的方差膨胀因子 (VIF):
a b c d
1 2 4 4
1 2 6 3
2 3 7 4
3 2 8 5
4 1 9 4
I have already done this in R using the vif function from the usdm librarywhich gives the following results:
我已经使用usdm 库中的 vif 函数在 R 中完成了此操作,结果如下:
a <- c(1, 1, 2, 3, 4)
b <- c(2, 2, 3, 2, 1)
c <- c(4, 6, 7, 8, 9)
d <- c(4, 3, 4, 5, 4)
df <- data.frame(a, b, c, d)
vif_df <- vif(df)
print(vif_df)
Variables VIF
a 22.95
b 3.00
c 12.95
d 3.00
However, when I do the same in python using the statsmodel vif function, my results are:
但是,当我使用statsmodel vif 函数在 python 中做同样的事情时,我的结果是:
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]
ck = np.column_stack([a, b, c, d])
vif = [variance_inflation_factor(ck, i) for i in range(ck.shape[1])]
print(vif)
Variables VIF
a 47.136986301369774
b 28.931506849315081
c 80.31506849315096
d 40.438356164383549
The results are vastly different, even though the inputs are the same. In general, results from the statsmodel VIF function seem to be wrong, but I'm not sure if this is because of the way I am calling it or if it is an issue with the function itself.
即使输入相同,结果也大不相同。一般来说,statsmodel VIF 函数的结果似乎是错误的,但我不确定这是因为我调用它的方式还是函数本身的问题。
I was hoping someone could help me figure out whether I was incorrectly calling the statsmodel function or explain the discrepancies in the results. If it's an issue with the function then are there any VIF alternatives in python?
我希望有人可以帮助我弄清楚我是否错误地调用了 statsmodel 函数或解释了结果中的差异。如果这是函数的问题,那么在 python 中是否有任何 VIF 替代方案?
采纳答案by Drverzal
I believe the reason for this is due to a difference in Python's OLS. OLS, which is used in the python variance inflation factor calculation, does not add an intercept by default. You definitely want an intercept in there however.
我相信这是由于 Python 的 OLS 不同造成的。用于python方差膨胀因子计算的OLS,默认不加截距。但是,您肯定希望在那里进行拦截。
What you'd want to do is add one more column to your matrix, ck, filled with ones to represent a constant. This will be the the intercept term of the equation. Once this is done, your values should match out properly.
您想要做的是在矩阵 ck 中再添加一列,用 ck 填充以表示常量。这将是方程的截距项。完成此操作后,您的值应正确匹配。
Edited: replaced zeroes with ones
编辑:用一个替换零
回答by Alexander
As mentioned by others and in this postby Josef Perktold, the function's author, variance_inflation_factor
expects the presence of a constant in the matrix of explanatory variables. One can use add_constant
from statsmodels to add the required constant to the dataframe before passing its values to the function.
正如其他人以及该函数的作者 Josef Perktold在这篇文章中所提到的,variance_inflation_factor
期望解释变量矩阵中存在一个常数。add_constant
在将其值传递给函数之前,可以使用from statsmodels 将所需的常量添加到数据帧中。
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 4],
'b': [2, 2, 3, 2, 1],
'c': [4, 6, 7, 8, 9],
'd': [4, 3, 4, 5, 4]}
)
X = add_constant(df)
>>> pd.Series([variance_inflation_factor(X.values, i)
for i in range(X.shape[1])],
index=X.columns)
const 136.875
a 22.950
b 3.000
c 12.950
d 3.000
dtype: float64
I believe you could also add the constant to the right most column of the dataframe using assign
:
我相信您还可以使用以下方法将常量添加到数据框的最右侧列assign
:
X = df.assign(const=1)
>>> pd.Series([variance_inflation_factor(X.values, i)
for i in range(X.shape[1])],
index=X.columns)
a 22.950
b 3.000
c 12.950
d 3.000
const 136.875
dtype: float64
The source code itself is rather concise:
源代码本身相当简洁:
def variance_inflation_factor(exog, exog_idx):
"""
exog : ndarray, (nobs, k_vars)
design matrix with all explanatory variables, as for example used in
regression
exog_idx : int
index of the exogenous variable in the columns of exog
"""
k_vars = exog.shape[1]
x_i = exog[:, exog_idx]
mask = np.arange(k_vars) != exog_idx
x_noti = exog[:, mask]
r_squared_i = OLS(x_i, x_noti).fit().rsquared
vif = 1. / (1. - r_squared_i)
return vif
It is also rather simple to modify the code to return all of the VIFs as a series:
修改代码以将所有 VIF 作为一个系列返回也很简单:
from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant
def variance_inflation_factors(exog_df):
'''
Parameters
----------
exog_df : dataframe, (nobs, k_vars)
design matrix with all explanatory variables, as for example used in
regression.
Returns
-------
vif : Series
variance inflation factors
'''
exog_df = add_constant(exog_df)
vifs = pd.Series(
[1 / (1. - OLS(exog_df[col].values,
exog_df.loc[:, exog_df.columns != col].values).fit().rsquared)
for col in exog_df],
index=exog_df.columns,
name='VIF'
)
return vifs
>>> variance_inflation_factors(df)
const 136.875
a 22.950
b 3.000
c 12.950
Name: VIF, dtype: float64
回答by T_T
For future comers to this thread (like me):
对于这个线程的未来来者(像我一样):
import numpy as np
import scipy as sp
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]
ck = np.column_stack([a, b, c, d])
cc = sp.corrcoef(ck, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()
This code gives
这段代码给出
array([22.95, 3. , 12.95, 3. ])
[EDIT]
[编辑]
In response to a comment, I tried to use DataFrame
as much as possible (numpy
is required to invert a matrix).
为了回应评论,我尝试DataFrame
尽可能多地使用(numpy
需要反转矩阵)。
import pandas as pd
import numpy as np
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]
df = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})
df_cor = df.corr()
pd.DataFrame(np.linalg.inv(df.corr().values), index = df_cor.index, columns=df_cor.columns)
The code gives
代码给出
a b c d
a 22.950000 6.453681 -16.301917 -6.453681
b 6.453681 3.000000 -4.080441 -2.000000
c -16.301917 -4.080441 12.950000 4.080441
d -6.453681 -2.000000 4.080441 3.000000
The diagonal elements give VIF.
对角线元素给出 VIF。
回答by steven
In case you don't wanna deal with variance_inflation_factor
and add_constant
. Please consider the following two functions.
如果你不想处理variance_inflation_factor
和add_constant
。请考虑以下两个函数。
1. Use formula in statasmodels:
1.在statasmodels中使用公式:
import pandas as pd
import statsmodels.formula.api as smf
def get_vif(exogs, data):
'''Return VIF (variance inflation factor) DataFrame
Args:
exogs (list): list of exogenous/independent variables
data (DataFrame): the df storing all variables
Returns:
VIF and Tolerance DataFrame for each exogenous variable
Notes:
Assume we have a list of exogenous variable [X1, X2, X3, X4].
To calculate the VIF and Tolerance for each variable, we regress
each of them against other exogenous variables. For instance, the
regression model for X3 is defined as:
X3 ~ X1 + X2 + X4
And then we extract the R-squared from the model to calculate:
VIF = 1 / (1 - R-squared)
Tolerance = 1 - R-squared
The cutoff to detect multicollinearity:
VIF > 10 or Tolerance < 0.1
'''
# initialize dictionaries
vif_dict, tolerance_dict = {}, {}
# create formula for each exogenous variable
for exog in exogs:
not_exog = [i for i in exogs if i != exog]
formula = f"{exog} ~ {' + '.join(not_exog)}"
# extract r-squared from the fit
r_squared = smf.ols(formula, data=data).fit().rsquared
# calculate VIF
vif = 1/(1 - r_squared)
vif_dict[exog] = vif
# calculate tolerance
tolerance = 1 - r_squared
tolerance_dict[exog] = tolerance
# return VIF DataFrame
df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
return df_vif
2. Use LinearRegression
in sklearn:
2.LinearRegression
在sklearn中使用:
# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from sklearn.linear_model import LinearRegression
def sklearn_vif(exogs, data):
# initialize dictionaries
vif_dict, tolerance_dict = {}, {}
# form input data for each exogenous variable
for exog in exogs:
not_exog = [i for i in exogs if i != exog]
X, y = data[not_exog], data[exog]
# extract r-squared from the fit
r_squared = LinearRegression().fit(X, y).score(X, y)
# calculate VIF
vif = 1/(1 - r_squared)
vif_dict[exog] = vif
# calculate tolerance
tolerance = 1 - r_squared
tolerance_dict[exog] = tolerance
# return VIF DataFrame
df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
return df_vif
Example:
例子:
import seaborn as sns
df = sns.load_dataset('car_crashes')
exogs = ['alcohol', 'speeding', 'no_previous', 'not_distracted']
[In] %%timeit -n 100
get_vif(exogs=exogs, data=df)
[Out]
VIF Tolerance
alcohol 3.436072 0.291030
no_previous 3.113984 0.321132
not_distracted 2.668456 0.374749
speeding 1.884340 0.530690
69.6 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
[In] %%timeit -n 100
sklearn_vif(exogs=exogs, data=df)
[Out]
VIF Tolerance
alcohol 3.436072 0.291030
no_previous 3.113984 0.321132
not_distracted 2.668456 0.374749
speeding 1.884340 0.530690
15.7 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
回答by Saqib Mujtaba
Example for Boston Data:
波士顿数据示例:
VIFis calculated by auxiliary regression, so not dependent on the actual fit.
VIF是通过辅助回归计算的,因此不依赖于实际拟合。
See below:
见下文:
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
# Break into left and right hand side; y and X
y, X = dmatrices(formula="medv ~ crim + zn + nox + ptratio + black + rm ", data=boston, return_type="dataframe")
# For each Xi, calculate VIF
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# Fit X to y
result = sm.OLS(y, X).fit()
回答by Chef1075
I wrote this function based on some other posts I saw on Stack and CrossValidated. It shows the features which are over the threshold and returns a new dataframe with the features removed.
我根据我在 Stack 和 CrossValidated 上看到的其他一些帖子编写了这个函数。它显示超过阈值的特征,并返回一个删除了特征的新数据帧。
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
def calculate_vif_(df, thresh=5):
'''
Calculates VIF each feature in a pandas dataframe
A constant must be added to variance_inflation_factor or the results will be incorrect
:param df: the pandas dataframe containing only the predictor features, not the response variable
:param thresh: the max VIF value before the feature is removed from the dataframe
:return: dataframe with features removed
'''
const = add_constant(df)
cols = const.columns
variables = np.arange(const.shape[1])
vif_df = pd.Series([variance_inflation_factor(const.values, i)
for i in range(const.shape[1])],
index=const.columns).to_frame()
vif_df = vif_df.sort_values(by=0, ascending=False).rename(columns={0: 'VIF'})
vif_df = vif_df.drop('const')
vif_df = vif_df[vif_df['VIF'] > thresh]
print 'Features above VIF threshold:\n'
print vif_df[vif_df['VIF'] > thresh]
col_to_drop = list(vif_df.index)
for i in col_to_drop:
print 'Dropping: {}'.format(i)
df = df.drop(columns=i)
return df
回答by Md Asraful Kabir
Although it is already late, I am adding some modifications from the given answer. To get the best set after removing multicollinearity if we use @Chef1075 solution then we will lose the variables which are correlated. We have to remove only one of them. To do this I came with the following solution using @steve answer:
虽然已经晚了,但我正在从给定的答案中添加一些修改。如果我们使用@Chef1075 解决方案,为了在去除多重共线性后获得最佳集合,那么我们将丢失相关的变量。我们只需要删除其中之一。为此,我使用@steve 回答提供了以下解决方案:
import pandas as pd
from sklearn.linear_model import LinearRegression
def sklearn_vif(exogs, data):
'''
This function calculates variance inflation function in sklearn way.
It is a comparatively faster process.
'''
# initialize dictionaries
vif_dict, tolerance_dict = {}, {}
# form input data for each exogenous variable
for exog in exogs:
not_exog = [i for i in exogs if i != exog]
X, y = data[not_exog], data[exog]
# extract r-squared from the fit
r_squared = LinearRegression().fit(X, y).score(X, y)
# calculate VIF
vif = 1/(1 - r_squared)
vif_dict[exog] = vif
# calculate tolerance
tolerance = 1 - r_squared
tolerance_dict[exog] = tolerance
# return VIF DataFrame
df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
return df_vif
df = pd.DataFrame(
{'a': [1, 1, 2, 3, 4,1],
'b': [2, 2, 3, 2, 1,3],
'c': [4, 6, 7, 8, 9,5],
'd': [4, 3, 4, 5, 4,6],
'e': [8,8,14,15,17,20]}
)
df_vif= sklearn_vif(exogs=df.columns, data=df).sort_values(by='VIF',ascending=False)
while (df_vif.VIF>5).any() ==True:
red_df_vif= df_vif.drop(df_vif.index[0])
df= df[red_df_vif.index]
df_vif=sklearn_vif(exogs=df.columns,data=df).sort_values(by='VIF',ascending=False)
print(df)
d c b
0 4 4 2
1 3 6 2
2 4 7 3
3 5 8 2
4 4 9 1
5 6 5 3
回答by Max Alonzo
here code using dataframe python:
这里使用数据框 python 的代码:
To create data
创建数据
import numpy as np
import scipy as sp
import numpy as np
import scipy as sp
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]
a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]
To create dataframe
创建数据框
import pandas as pd
data = pd.DataFrame()
data["a"] = a
data["b"] = b
data["c"] = c
data["d"] = d
import pandas as pd
data = pd.DataFrame()
data["a"] = a
data["b"] = b
data["c"] = c
data["d"] = d
Calculate VIF
计算 VIF
cc = np.corrcoef(data, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()
cc = np.corrcoef(data, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()
Result
结果
array([22.95, 3. , 12.95, 3. ])
array([22.95, 3. , 12.95, 3. ])