pandas 熊猫:获取相关性高的列组合

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26463714/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:35:36  来源:igfitidea点击:

pandas: Get combination of columns where correlation is high

pythonnumpypandas

提问by Peter

I have a data set with 6 columns, from which I let pandas calculate the correlation matrix, with the following result:

我有一个包含 6 列的数据集,我让 Pandas 从中计算相关矩阵,结果如下:

               age  earnings    height     hours  siblings    weight
age       1.000000  0.026032  0.040002  0.024118  0.155894  0.048655
earnings  0.026032  1.000000  0.276373  0.224283  0.126651  0.092299
height    0.040002  0.276373  1.000000  0.235616  0.077551  0.572538
hours     0.024118  0.224283  0.235616  1.000000  0.067797  0.143160
siblings  0.155894  0.126651  0.077551  0.067797  1.000000  0.018367
weight    0.048655  0.092299  0.572538  0.143160  0.018367  1.000000

How can I get the combination of colums where the correlation is, for example, higher than 0.5, but the columns are not equal? So in this case, the output needs to be something like:

例如,如何获得相关性高于 0.5 但列不相等的列的组合?因此,在这种情况下,输出需要类似于:

[('height', 'weight')]

I tried to do it with for loops, but I think that's not the right/most efficient way:

我试图用 for 循环来做,但我认为这不是正确/最有效的方法:

correlated = []
for column1 in columns:
    for column2 in columns:
        if column1 != column2:
            correlation = df[column1].corr(df[column2])
            if correlation > 0.5 and (column2, column1) not in correlated:
                correlated.append((column1, column2))

In which df is my original dataframe. This outputs the desired result:

其中 df 是我的原始数据框。这将输出所需的结果:

[(u'height', u'weight')]

回答by Michael Brennan

How about the following, using numpy, and assuming you already have your correlation matrix in df:

以下如何,使用 numpy,并假设您已经在 中有相关矩阵df

import numpy as np

indices = np.where(df > 0.5)
indices = [(df.index[x], df.columns[y]) for x, y in zip(*indices)
                                        if x != y and x < y]

This will result in indicescontaining:

这将导致indices包含:

[('height', 'weight')]