Python 从 Pandas 的大型相关矩阵中列出最高相关对?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17778394/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:10:58  来源:igfitidea点击:

List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

pythonpandascorrelation

提问by Kyle Brandt

How do you find the top correlations in a correlation matrix with Pandas? There are many answers on how to do this with R (Show correlations as an ordered list, not as a large matrixor Efficient way to get highly correlated pairs from large data set in Python or R), but I am wondering how to do it with pandas? In my case the matrix is 4460x4460, so can't do it visually.

您如何在与 Pandas 的相关矩阵中找到最高相关性?关于如何使用 R 执行此操作有很多答案(将相关性显示为有序列表,而不是作为大矩阵从 Python 或 R 中的大型数据集中获取高度相关对的有效方法),但我想知道如何做和熊猫?在我的情况下,矩阵是 4460x4460,所以不能在视觉上做到这一点。

采纳答案by HYRY

You can use DataFrame.valuesto get an numpy array of the data and then use NumPy functions such as argsort()to get the most correlated pairs.

您可以使用DataFrame.values获取数据的 numpy 数组,然后使用 NumPy 函数argsort()来获取最相关的对。

But if you want to do this in pandas, you can unstackand sort the DataFrame:

但是如果你想unstack在 Pandas 中做到这一点,你可以对 DataFrame 进行排序:

import pandas as pd
import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

Here is the output:

这是输出:

2192  1522    0.636198
1522  2192    0.636198
3677  2027    0.641817
2027  3677    0.641817
242   130     0.646760
130   242     0.646760
1171  2733    0.670048
2733  1171    0.670048
1000  2000    0.742340
2000  1000    0.742340
dtype: float64

回答by arun

@HYRY's answer is perfect. Just building on that answer by adding a bit more logic to avoid duplicate and self correlations and proper sorting:

@HYRY 的回答是完美的。只是通过添加更多逻辑来避免重复和自相关以及正确排序来构建该答案:

import pandas as pd
d = {'x1': [1, 4, 4, 5, 6], 
     'x2': [0, 0, 8, 2, 4], 
     'x3': [2, 8, 8, 10, 12], 
     'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()

print("Correlation Matrix")
print(df.corr())
print()

def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))

That gives the following output:

这给出了以下输出:

Data Frame
   x1  x2  x3  x4
0   1   0   2  -1
1   4   0   8  -4
2   4   8   8  -4
3   5   2  10  -4
4   6   4  12  -5

Correlation Matrix
          x1        x2        x3        x4
x1  1.000000  0.399298  1.000000 -0.969248
x2  0.399298  1.000000  0.399298 -0.472866
x3  1.000000  0.399298  1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248  1.000000

Top Absolute Correlations
x1  x3    1.000000
x3  x4    0.969248
x1  x4    0.969248
dtype: float64

回答by MiFi

Few lines solution without redundant pairs of variables:

没有冗余变量对的几行解决方案:

corr_matrix = df.corr().abs()

#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                 .stack()
                 .sort_values(ascending=False))
#first element of sol series is the pair with the bigest correlation

回答by Frederik Meinertsen

Use itertools.combinationsto get all unique correlations from pandas own correlation matrix .corr(), generate list of lists and feed it back into a DataFrame in order to use '.sort_values'. Set ascending = Trueto display lowest correlations on top

用于itertools.combinations从 Pandas 自己的相关矩阵中获取所有唯一相关性.corr(),生成列表列表并将其反馈到 DataFrame 中以使用“.sort_values”。设置ascending = True为在顶部显示最低相关性

corranktakes a DataFrame as argument because it requires .corr().

corrank将 DataFrame 作为参数,因为它需要.corr().

  def corrank(X):
        import itertools
        df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])    
        print(df.sort_values(by='corr',ascending=False))

  corrank(X) # prints a descending list of correlation pair (Max on top)

回答by prashanth

Use the code below to view the correlations in the descending order.

使用下面的代码按降序查看相关性。

# See the correlations in descending order

corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)

回答by Addison Klinke

Combining some features of @HYRY and @arun's answers, you can print the top correlations for dataframe dfin a single line using:

结合@HYRY 和@arun 的答案的一些功能,您可以df使用以下方法在一行中打印数据帧的最高相关性:

df.corr().unstack().sort_values().drop_duplicates()

Note: the one downside is if you have 1.0 correlations that are notone variable to itself, the drop_duplicates()addition would remove them

注意:一个缺点是如果你有 1.0 相关性不是一个变量本身,drop_duplicates()添加会删除它们

回答by Rich Wandell

Lot's of good answers here. The easiest way I found was a combination of some of the answers above.

这里有很多很好的答案。我发现的最简单的方法是结合上面的一些答案。

corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
    .sort_values(by='column', ascending=False)\
    .dropna()

回答by falsarella

I didn't want to unstackor over-complicate this issue, since I just wanted to drop some highly correlated features as part of a feature selection phase.

我不想unstack或过度复杂化这个问题,因为我只是想删除一些高度相关的特征作为特征选择阶段的一部分。

So I ended up with the following simplified solution:

所以我最终得到了以下简化的解决方案:

# map features to their absolute correlation values
corr = features.corr().abs()

# set equality (self correlation) as zero
corr[corr == 1] = 0

# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)

# display the highly correlated features
display(corr_cols[corr_cols > 0.8])

In this case, if you want to drop correlated features, you may map through the filtered corr_colsarray and remove the odd-indexed (or even-indexed) ones.

在这种情况下,如果您想删除相关特征,您可以映射过滤后的corr_cols数组并删除奇数索引(或偶数索引)的。

回答by KIC

I was trying some of the solutions here but then I actually came up with my own one. I hope this might be useful for the next one so I share it here:

我在这里尝试了一些解决方案,但后来我实际上想出了自己的解决方案。我希望这对下一个有用,所以我在这里分享:

def sort_correlation_matrix(correlation_matrix):
    cor = correlation_matrix.abs()
    top_col = cor[cor.columns[0]][1:]
    top_col = top_col.sort_values(ascending=False)
    ordered_columns = [cor.columns[0]] + top_col.index.tolist()
    return correlation_matrix[ordered_columns].reindex(ordered_columns)

回答by Aibloy

This is a improve code from @MiFi. This one order in abs but not excluding the negative values.

这是@MiFi 的改进代码。这一个以 abs 为单位的顺序,但不排除负值。

   def top_correlation (df,n):
    corr_matrix = df.corr()
    correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                 .stack()
                 .sort_values(ascending=False))
    correlation = pd.DataFrame(correlation).reset_index()
    correlation.columns=["Variable_1","Variable_2","Correlacion"]
    correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
    return correlation.head(n)

top_correlation(ANYDATA,10)