pandas 如何对熊猫数据框运行多重共线性测试?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48223443/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:02:57  来源:igfitidea点击:

How to run a multicollinearity test on a pandas dataframe?

pandaspython-3.6statsmodels

提问by Aakash Basu

I am comparatively new to Python, Stats and using DS libraries, my requirement is to run a multicollinearity test on a dataset having n number of columns and ensure the columns/variables having VIF > 5 are dropped altogether.

我对 Python、Stats 和使用 DS 库比较陌生,我的要求是对具有 n 个列的数据集运行多重共线性测试,并确保完全删除 VIF > 5 的列/变量。

I found a code which is,

我发现了一个代码,

 from statsmodels.stats.outliers_influence import variance_inflation_factor

    def calculate_vif_(X, thresh=5.0):

        variables = range(X.shape[1])
        tmp = range(X[variables].shape[1])
        print(tmp)
        dropped=True
        while dropped:
            dropped=False
            vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]

            maxloc = vif.index(max(vif))
            if max(vif) > thresh:
                print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                del variables[maxloc]
                dropped=True

        print('Remaining variables:')
        print(X.columns[variables])
        return X[variables]

But, I do not clearly understand, should I pass the dataset altogether in the X argument's position? If yes, it is not working.

但是,我不清楚,我应该在 X 参数的位置完全传递数据集吗?如果是,则它不起作用。

Please help!

请帮忙!

回答by Aakash Basu

I tweaked with the code and managed to achieve the desired result by the following code, with a little bit of Exception Handling -

我调整了代码并设法通过以下代码实现了预期的结果,并进行了一些异常处理 -

def multicollinearity_check(X, thresh=5.0):
    data_type = X.dtypes
    # print(type(data_type))
    int_cols = \
    X.select_dtypes(include=['int', 'int16', 'int32', 'int64', 'float', 'float16', 'float32', 'float64']).shape[1]
    total_cols = X.shape[1]
    try:
        if int_cols != total_cols:
            raise Exception('All the columns should be integer or float, for multicollinearity test.')
        else:
            variables = list(range(X.shape[1]))
            dropped = True
            print('''\n\nThe VIF calculator will now iterate through the features and calculate their respective values.
            It shall continue dropping the highest VIF features until all the features have VIF less than the threshold of 5.\n\n''')
            while dropped:
                dropped = False
                vif = [variance_inflation_factor(X.iloc[:, variables].values, ix) for ix in variables]
                print('\n\nvif is: ', vif)
                maxloc = vif.index(max(vif))
                if max(vif) > thresh:
                    print('dropping \'' + X.iloc[:, variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                    # del variables[maxloc]
                    X.drop(X.columns[variables[maxloc]], 1, inplace=True)
                    variables = list(range(X.shape[1]))
                    dropped = True

            print('\n\nRemaining variables:\n')
            print(X.columns[variables])
            # return X.iloc[:,variables]
            return X
    except Exception as e:
        print('Error caught: ', e)

回答by DanSan

I also had issues running something similar. I fixed it by changing how variableswas defined and finding another way of deleting its elements.

我也有运行类似的问题。我通过更改variables定义方式并找到另一种删除其元素的方法来修复它。

The following script should work with Anaconda 5.0.1 and Python 3.6 (the latest version as of this writing).

以下脚本适用于 Anaconda 5.0.1 和 Python 3.6(撰写本文时的最新版本)。

import numpy as np
import pandas as pd
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor    
from joblib import Parallel, delayed

# Defining the function that you will run later
def calculate_vif_(X, thresh=5.0):
    variables = [X.columns[i] for i in range(X.shape[1])]
    dropped=True
    while dropped:
        dropped=False
        print(len(variables))
        vif = Parallel(n_jobs=-1,verbose=5)(delayed(variance_inflation_factor)(X[variables].values, ix) for ix in range(len(variables)))

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print(time.ctime() + ' dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
            variables.pop(maxloc)
            dropped=True

    print('Remaining variables:')
    print([variables])
    return X[[i for i in variables]]

X = df[feature_list] # Selecting your data

X2 = calculate_vif_(X,5) # Actually running the function

If you have many features it will take very long to run. So I made another change to have it work in parallel in case you have multiple CPUs available.

如果您有许多功能,则需要很长时间才能运行。所以我做了另一个改变,让它并行工作,以防你有多个 CPU 可用。

Enjoy!

享受!