pandas 如何对熊猫数据框运行多重共线性测试?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48223443/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to run a multicollinearity test on a pandas dataframe?
提问by Aakash Basu
I am comparatively new to Python, Stats and using DS libraries, my requirement is to run a multicollinearity test on a dataset having n number of columns and ensure the columns/variables having VIF > 5 are dropped altogether.
我对 Python、Stats 和使用 DS 库比较陌生,我的要求是对具有 n 个列的数据集运行多重共线性测试,并确保完全删除 VIF > 5 的列/变量。
I found a code which is,
我发现了一个代码,
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif_(X, thresh=5.0):
variables = range(X.shape[1])
tmp = range(X[variables].shape[1])
print(tmp)
dropped=True
while dropped:
dropped=False
vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
del variables[maxloc]
dropped=True
print('Remaining variables:')
print(X.columns[variables])
return X[variables]
But, I do not clearly understand, should I pass the dataset altogether in the X argument's position? If yes, it is not working.
但是,我不清楚,我应该在 X 参数的位置完全传递数据集吗?如果是,则它不起作用。
Please help!
请帮忙!
回答by Aakash Basu
I tweaked with the code and managed to achieve the desired result by the following code, with a little bit of Exception Handling -
我调整了代码并设法通过以下代码实现了预期的结果,并进行了一些异常处理 -
def multicollinearity_check(X, thresh=5.0):
data_type = X.dtypes
# print(type(data_type))
int_cols = \
X.select_dtypes(include=['int', 'int16', 'int32', 'int64', 'float', 'float16', 'float32', 'float64']).shape[1]
total_cols = X.shape[1]
try:
if int_cols != total_cols:
raise Exception('All the columns should be integer or float, for multicollinearity test.')
else:
variables = list(range(X.shape[1]))
dropped = True
print('''\n\nThe VIF calculator will now iterate through the features and calculate their respective values.
It shall continue dropping the highest VIF features until all the features have VIF less than the threshold of 5.\n\n''')
while dropped:
dropped = False
vif = [variance_inflation_factor(X.iloc[:, variables].values, ix) for ix in variables]
print('\n\nvif is: ', vif)
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print('dropping \'' + X.iloc[:, variables].columns[maxloc] + '\' at index: ' + str(maxloc))
# del variables[maxloc]
X.drop(X.columns[variables[maxloc]], 1, inplace=True)
variables = list(range(X.shape[1]))
dropped = True
print('\n\nRemaining variables:\n')
print(X.columns[variables])
# return X.iloc[:,variables]
return X
except Exception as e:
print('Error caught: ', e)
回答by DanSan
I also had issues running something similar. I fixed it by changing how variables
was defined and finding another way of deleting its elements.
我也有运行类似的问题。我通过更改variables
定义方式并找到另一种删除其元素的方法来修复它。
The following script should work with Anaconda 5.0.1 and Python 3.6 (the latest version as of this writing).
以下脚本适用于 Anaconda 5.0.1 和 Python 3.6(撰写本文时的最新版本)。
import numpy as np
import pandas as pd
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor
from joblib import Parallel, delayed
# Defining the function that you will run later
def calculate_vif_(X, thresh=5.0):
variables = [X.columns[i] for i in range(X.shape[1])]
dropped=True
while dropped:
dropped=False
print(len(variables))
vif = Parallel(n_jobs=-1,verbose=5)(delayed(variance_inflation_factor)(X[variables].values, ix) for ix in range(len(variables)))
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print(time.ctime() + ' dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
variables.pop(maxloc)
dropped=True
print('Remaining variables:')
print([variables])
return X[[i for i in variables]]
X = df[feature_list] # Selecting your data
X2 = calculate_vif_(X,5) # Actually running the function
If you have many features it will take very long to run. So I made another change to have it work in parallel in case you have multiple CPUs available.
如果您有许多功能,则需要很长时间才能运行。所以我做了另一个改变,让它并行工作,以防你有多个 CPU 可用。
Enjoy!
享受!