如何使用python或pandas计算所有列之间的相关性并删除高度相关的列

Question

提问by jax

I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data..

我有一个庞大的数据集，在机器学习建模之前，总是建议您首先删除高度相关的描述符（列）我如何计算列的相关性并删除具有阈值的列，例如删除所有列或描述符具有 >0.8 的相关性。它还应该保留减少数据中的标题..

Example data set

示例数据集

 GA      PN       PC     MBP      GR     AP   
0.033   6.652   6.681   0.194   0.874   3.177    
0.034   9.039   6.224   0.194   1.137   3.4      
0.035   10.936  10.304  1.015   0.911   4.9      
0.022   10.11   9.603   1.374   0.848   4.566    
0.035   2.963   17.156  0.599   0.823   9.406    
0.033   10.872  10.244  1.015   0.574   4.871     
0.035   21.694  22.389  1.015   0.859   9.259     
0.035   10.936  10.304  1.015   0.911   4.5

Please help....

请帮忙....

Answer 1

回答by Jamie Bull

Firstly, I'd suggest using something like PCA as a dimensionality reductionmethod, but if you have to roll your own then your question is insufficiently constrained. Where two columns are correlated, which one do you want to remove? What if column A is correlated with column B, while column B is correlated with column C, but not column A?

首先，我建议使用 PCA 之类的方法作为降维方法，但是如果您必须自己动手，那么您的问题就没有受到足够的约束。在两列相关的情况下，您要删除哪一列？如果 A 列与 B 列相关，而 B 列与 C 列相关，但不与 A 列相关呢？

You can get a pairwise matrix of correlations by calling DataFrame.corr()(docs) which might help you with developing your algorithm, but eventually you need to convert that into a list of columns to keep.

您可以通过调用DataFrame.corr()( docs)来获得相关的成对矩阵，这可能有助于您开发算法，但最终您需要将其转换为要保留的列列表。

Answer 2

回答by TomDobbs

Plug your features dataframe in this function and just set your correlation threshold. It'll auto drop columns, but will also give you a diagnostic of the columns it drops if you want to do it manually.

在这个函数中插入你的特征数据框，然后设置你的相关阈值。它会自动删除列，但如果您想手动删除列，它也会为您提供它删除的列的诊断信息。

def corr_df(x, corr_val):
    '''
    Obj: Drops features that are strongly correlated to other features.
          This lowers model complexity, and aids in generalizing the model.
    Inputs:
          df: features df (x)
          corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
    Output: df that only includes uncorrelated features
    '''

    # Creates Correlation Matrix and Instantiates
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterates through Correlation Matrix Table to find correlated columns
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = item.values
            if val >= corr_val:
                # Prints the correlated feature set and the corr val
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(i)

    drops = sorted(set(drop_cols))[::-1]

    # Drops the correlated columns
    for i in drops:
        col = x.iloc[:, (i+1):(i+2)].columns.values
        df = x.drop(col, axis=1)

    return df

Answer 3

回答by NISHA DAGA

Here is the approach which I have used -

这是我使用的方法 -

def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset

    print(dataset)

Hope this helps!

希望这可以帮助！

Answer 4

回答by Mojgan Mazouchi

You can use the following for a given data frame df:

您可以对给定的数据框 df 使用以下内容：

corr_matrix = df.corr().abs()
high_corr_var=np.where(corr_matrix>0.8)
high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]

Answer 5

回答by azuber

I took the liberty to modify TomDobbs' answer. The reported bug in the comments is removed now. Also, the new function filters out the negative correlation, too.

我冒昧地修改了 TomDobbs 的回答。评论中报告的错误现已删除。此外，新函数也过滤掉了负相关。

def corr_df(x, corr_val):
    '''
    Obj: Drops features that are strongly correlated to other features.
          This lowers model complexity, and aids in generalizing the model.
    Inputs:
          df: features df (x)
          corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
    Output: df that only includes uncorrelated features
    '''

    # Creates Correlation Matrix and Instantiates
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterates through Correlation Matrix Table to find correlated columns
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = item.values
            if abs(val) >= corr_val:
                # Prints the correlated feature set and the corr val
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(i)

    drops = sorted(set(drop_cols))[::-1]

    # Drops the correlated columns
    for i in drops:
        col = x.iloc[:, (i+1):(i+2)].columns.values
        x = x.drop(col, axis=1)
    return x

Answer 6

回答by Cherry Wu

The method here worked well for me, only a few lines of code: https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

这里的方法对我来说效果很好，只有几行代码：https: //chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

import numpy as np

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features 
df.drop(to_drop, axis=1, inplace=True)

Answer 7

回答by Ryan

A small revision to the solution posted by user3025698 that resolves an issue where the correlation between the first two columns is not captured and some data type checking.

对 user3025698 发布的解决方案的小修订，解决了未捕获前两列之间的相关性和某些数据类型检查的问题。

def filter_df_corr(inp_data, corr_val):
    '''
    Returns an array or dataframe (based on type(inp_data) adjusted to drop \
        columns with high correlation to one another. Takes second arg corr_val
        that defines the cutoff

    ----------
    inp_data : np.array, pd.DataFrame
        Values to consider
    corr_val : float
        Value [0, 1] on which to base the correlation cutoff
    '''
    # Creates Correlation Matrix
    if isinstance(inp_data, np.ndarray):
        inp_data = pd.DataFrame(data=inp_data)
        array_flag = True
    else:
        array_flag = False
    corr_matrix = inp_data.corr()

    # Iterates through Correlation Matrix Table to find correlated columns
    drop_cols = []
    n_cols = len(corr_matrix.columns)

    for i in range(n_cols):
        for k in range(i+1, n_cols):
            val = corr_matrix.iloc[k, i]
            col = corr_matrix.columns[i]
            row = corr_matrix.index[k]
            if abs(val) >= corr_val:
                # Prints the correlated feature set and the corr val
                print(col, "|", row, "|", round(val, 2))
                drop_cols.append(col)

    # Drops the correlated columns
    drop_cols = set(drop_cols)
    inp_data = inp_data.drop(columns=drop_cols)
    # Return same type as inp
    if array_flag:
        return inp_data.values
    else:
        return inp_data

Answer 8

回答by iPhoneDeveloper

Another effective way i found to find correlation is to use pandas profiling. Once you have your dataframe ready just use

我发现找到相关性的另一种有效方法是使用 Pandas 分析。准备好数据框后，只需使用

import pandas_profiling as pp

 your_df_report= pp.ProfileReport(your_df)
 your_df_report.to_file("your_df_report.html")

This report in html clearly gives you detailed report on your data frame which is nothing but EDA which includes your co relation between different features as well. It will suggest you to drop columns with high co relation as well.

这份 html 报告清楚地为您提供了关于您的数据框的详细报告，它只不过是 EDA，其中还包括您在不同功能之间的关联。它也会建议您删除具有高关联的列。

Answer 9

回答by Celso

This is the approach I used on my job last month. Perhaps it is not the best or quickest way, but it works fine. Here, df is my original Pandas dataframe:

这是我上个月在工作中使用的方法。也许这不是最好或最快的方法，但它工作得很好。在这里，df 是我原来的 Pandas 数据框：

dropvars = []
threshold = 0.95
df_corr = df.corr().stack().reset_index().rename(columns={'level_0': 'Var 1', 'level_1': 'Var 2', 0: 'Corr'})
df_corr = df_corr[(df_corr['Corr'].abs() >= threshold) & (df_corr['Var 1'] != df_corr['Var 2'])]
while len(df_corr) > 0:
    var = df_corr['Var 1'].iloc[0]
    df_corr = df_corr[((df_corr['Var 1'] != var) & (df_corr['Var 2'] != var))]
    dropvars.append(var)
df.drop(columns=dropvars, inplace=True)

My idea is as follows: first, I create a dataframe containing columna Var 1, Var 2 and Corr, where I keep only those pairs of variables whose correlation is higher than or equal my threshold (in absolute value). Then, I iteratively choose the first variable (Var 1 value) in this correlations dataframe, add it to dropvar list, and remove all lines of the correlations dataframe where it appears, until my correlations dataframe is empty. In the end, I remove the columns in my dropvar list from my original dataframe.

我的想法如下：首先，我创建一个包含列 Var 1、Var 2 和 Corr 的数据框，其中我只保留那些相关性高于或等于我的阈值（绝对值）的变量对。然后，我迭代地选择此相关数据框中的第一个变量（Var 1 值），将其添加到 dropvar 列表中，并删除相关数据帧出现的所有行，直到我的相关数据帧为空。最后，我从原始数据框中删除了 dropvar 列表中的列。

Answer 10

回答by b-shields

I had a similar question today and came across this post. This is what I ended up with.

我今天有一个类似的问题，并遇到了这篇文章。这就是我的结果。

def uncorrelated_features(df, threshold=0.7):
    """
    Returns a subset of df columns with Pearson correlations
    below threshold.
    """

    corr = df.corr().abs()
    keep = []
    for i in range(len(corr.iloc[:,0])):
        above = corr.iloc[:i,i]
        if len(keep) > 0: above = above[keep]
        if len(above[above < threshold]) == len(above):
            keep.append(corr.columns.values[i])

    return df[keep]

如何使用python或pandas计算所有列之间的相关性并删除高度相关的列

提问by jax

回答by Jamie Bull

回答by TomDobbs

回答by NISHA DAGA

回答by Mojgan Mazouchi

回答by azuber

回答by Cherry Wu

回答by Ryan

回答by iPhoneDeveloper

回答by Celso

回答by b-shields

相关推荐

最近更新

标签

如何使用python或pandas计算所有列之间的相关性并删除高度相关的列

提问by jax

回答by Jamie Bull

回答by TomDobbs

回答by NISHA DAGA

回答by Mojgan Mazouchi

回答by azuber

回答by Cherry Wu

回答by Ryan

回答by iPhoneDeveloper

回答by Celso

回答by b-shields

相关推荐

如何使用 anaconda python 启用代理服务器？

Python 从列表的开头和结尾弹出多个项目

如何在python字典中为每个键添加多个值

Python 带有空的 except 代码的 Try-except 子句

相关推荐

最近更新

标签