pandas 熊猫：获取相关性高的列组合

Question

提问by Peter

I have a data set with 6 columns, from which I let pandas calculate the correlation matrix, with the following result:

我有一个包含 6 列的数据集，我让 Pandas 从中计算相关矩阵，结果如下：

               age  earnings    height     hours  siblings    weight
age       1.000000  0.026032  0.040002  0.024118  0.155894  0.048655
earnings  0.026032  1.000000  0.276373  0.224283  0.126651  0.092299
height    0.040002  0.276373  1.000000  0.235616  0.077551  0.572538
hours     0.024118  0.224283  0.235616  1.000000  0.067797  0.143160
siblings  0.155894  0.126651  0.077551  0.067797  1.000000  0.018367
weight    0.048655  0.092299  0.572538  0.143160  0.018367  1.000000

How can I get the combination of colums where the correlation is, for example, higher than 0.5, but the columns are not equal? So in this case, the output needs to be something like:

例如，如何获得相关性高于 0.5 但列不相等的列的组合？因此，在这种情况下，输出需要类似于：

[('height', 'weight')]

I tried to do it with for loops, but I think that's not the right/most efficient way:

我试图用 for 循环来做，但我认为这不是正确/最有效的方法：

correlated = []
for column1 in columns:
    for column2 in columns:
        if column1 != column2:
            correlation = df[column1].corr(df[column2])
            if correlation > 0.5 and (column2, column1) not in correlated:
                correlated.append((column1, column2))

In which df is my original dataframe. This outputs the desired result:

其中 df 是我的原始数据框。这将输出所需的结果：

[(u'height', u'weight')]

Answer 1

回答by Michael Brennan

How about the following, using numpy, and assuming you already have your correlation matrix in df:

以下如何，使用 numpy，并假设您已经在中有相关矩阵df：

import numpy as np

indices = np.where(df > 0.5)
indices = [(df.index[x], df.columns[y]) for x, y in zip(*indices)
                                        if x != y and x < y]

This will result in indicescontaining:

这将导致indices包含：

[('height', 'weight')]

pandas 熊猫：获取相关性高的列组合

提问by Peter

回答by Michael Brennan

相关推荐

最近更新

标签

pandas 熊猫：获取相关性高的列组合

提问by Peter

回答by Michael Brennan

相关推荐

Pandas Dataframe CSV 导出，如何防止额外的双引号字符

按列表顺序对 Pandas DataFrame 进行排序

Pandas Multiindex：我做错了什么？

pandas IPython Notebook 抛出 ImportError – IPython 不会

相关推荐

最近更新

标签