pandas 如何对熊猫数据框运行多重共线性测试？

Question

提问by Aakash Basu

I am comparatively new to Python, Stats and using DS libraries, my requirement is to run a multicollinearity test on a dataset having n number of columns and ensure the columns/variables having VIF > 5 are dropped altogether.

我对 Python、Stats 和使用 DS 库比较陌生，我的要求是对具有 n 个列的数据集运行多重共线性测试，并确保完全删除 VIF > 5 的列/变量。

I found a code which is,

我发现了一个代码，

 from statsmodels.stats.outliers_influence import variance_inflation_factor

    def calculate_vif_(X, thresh=5.0):

        variables = range(X.shape[1])
        tmp = range(X[variables].shape[1])
        print(tmp)
        dropped=True
        while dropped:
            dropped=False
            vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]

            maxloc = vif.index(max(vif))
            if max(vif) > thresh:
                print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                del variables[maxloc]
                dropped=True

        print('Remaining variables:')
        print(X.columns[variables])
        return X[variables]

But, I do not clearly understand, should I pass the dataset altogether in the X argument's position? If yes, it is not working.

但是，我不清楚，我应该在 X 参数的位置完全传递数据集吗？如果是，则它不起作用。

Please help!

请帮忙！

Answer 1

回答by Aakash Basu

I tweaked with the code and managed to achieve the desired result by the following code, with a little bit of Exception Handling -

我调整了代码并设法通过以下代码实现了预期的结果，并进行了一些异常处理 -

def multicollinearity_check(X, thresh=5.0):
    data_type = X.dtypes
    # print(type(data_type))
    int_cols = \
    X.select_dtypes(include=['int', 'int16', 'int32', 'int64', 'float', 'float16', 'float32', 'float64']).shape[1]
    total_cols = X.shape[1]
    try:
        if int_cols != total_cols:
            raise Exception('All the columns should be integer or float, for multicollinearity test.')
        else:
            variables = list(range(X.shape[1]))
            dropped = True
            print('''\n\nThe VIF calculator will now iterate through the features and calculate their respective values.
            It shall continue dropping the highest VIF features until all the features have VIF less than the threshold of 5.\n\n''')
            while dropped:
                dropped = False
                vif = [variance_inflation_factor(X.iloc[:, variables].values, ix) for ix in variables]
                print('\n\nvif is: ', vif)
                maxloc = vif.index(max(vif))
                if max(vif) > thresh:
                    print('dropping \'' + X.iloc[:, variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                    # del variables[maxloc]
                    X.drop(X.columns[variables[maxloc]], 1, inplace=True)
                    variables = list(range(X.shape[1]))
                    dropped = True

            print('\n\nRemaining variables:\n')
            print(X.columns[variables])
            # return X.iloc[:,variables]
            return X
    except Exception as e:
        print('Error caught: ', e)

Answer 2

回答by DanSan

I also had issues running something similar. I fixed it by changing how variableswas defined and finding another way of deleting its elements.

我也有运行类似的问题。我通过更改variables定义方式并找到另一种删除其元素的方法来修复它。

The following script should work with Anaconda 5.0.1 and Python 3.6 (the latest version as of this writing).

以下脚本适用于 Anaconda 5.0.1 和 Python 3.6（撰写本文时的最新版本）。

import numpy as np
import pandas as pd
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor    
from joblib import Parallel, delayed

# Defining the function that you will run later
def calculate_vif_(X, thresh=5.0):
    variables = [X.columns[i] for i in range(X.shape[1])]
    dropped=True
    while dropped:
        dropped=False
        print(len(variables))
        vif = Parallel(n_jobs=-1,verbose=5)(delayed(variance_inflation_factor)(X[variables].values, ix) for ix in range(len(variables)))

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print(time.ctime() + ' dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
            variables.pop(maxloc)
            dropped=True

    print('Remaining variables:')
    print([variables])
    return X[[i for i in variables]]

X = df[feature_list] # Selecting your data

X2 = calculate_vif_(X,5) # Actually running the function

If you have many features it will take very long to run. So I made another change to have it work in parallel in case you have multiple CPUs available.

如果您有许多功能，则需要很长时间才能运行。所以我做了另一个改变，让它并行工作，以防你有多个 CPU 可用。

Enjoy!

享受！

pandas 如何对熊猫数据框运行多重共线性测试？

提问by Aakash Basu

回答by Aakash Basu

回答by DanSan

相关推荐

最近更新

标签

pandas 如何对熊猫数据框运行多重共线性测试？

提问by Aakash Basu

回答by Aakash Basu

回答by DanSan

相关推荐

Pandas 数据框到烧瓶模板作为 json

pandas 为什么 DBSCAN 聚类在电影镜头数据集上返回单个聚类？

为什么在使用 pandas apply 时会出现 AttributeError？

Pandas - 将混合正/负数列变为正数

相关推荐

最近更新

标签