pandas 返回熊猫数据框中的相关列组

Question

提问by Bryan

I've run a correlation matrix on a pandas DataFrame:

我在 a 上运行了一个相关矩阵pandas DataFrame：

df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]})

corrMatrix=df.corr()
corrMatrix
           drive  four   one   six  three   two  zive
drive       1.00 -0.04 -0.75  1.00   1.00  0.24 -0.75
four       -0.04  1.00 -0.49 -0.04  -0.04  0.16 -0.49
one        -0.75 -0.49  1.00 -0.75  -0.75 -0.35  1.00
six         1.00 -0.04 -0.75  1.00   1.00  0.24 -0.75
three       1.00 -0.04 -0.75  1.00   1.00  0.24 -0.75
two         0.24  0.16 -0.35  0.24   0.24  1.00 -0.35
zive       -0.75 -0.49  1.00 -0.75  -0.75 -0.35  1.00

Now, I want to write some code to return the columns that are perfectly correlated (ie correlation ==1) in groups.

现在，我想编写一些代码来返回组中完全相关（即相关性==1）的列。

Optimally, I would want this: [['zive', 'one'], ['three', 'six', 'drive']]

最理想的是，我想要这个： [['zive', 'one'], ['three', 'six', 'drive']]

I've written the below code, which gives me ['drive', 'one', 'six', 'three', 'zive'], but as you can see, they are just a bag of columns that have some sort of perfect correlation with some other column-- it does not put them in a distinctive grouping with their perfectly correlated cousin columns.

我写了下面的代码，它给了我['drive', 'one', 'six', 'three', 'zive']，但正如你所看到的，它们只是一袋与其他一些列具有某种完美相关性的列——它并没有将它们与它们完美地放在一个独特的分组中相关的表亲列。

correlatedCols=[]
for col in corrMatrix:
    data=corrMatrix[col][corrMatrix[col]==1]
    if len(data)>1:
        correlatedCols.append(data.name)

correlatedCols  
['drive','one', 'six', 'three', 'zive']

EDIT:Using the advice given by @Karl D., I get this:

编辑：使用@Karl D. 给出的建议，我得到了这个：

cor = df.corr()
cor.loc[:,:] =  np.tril(cor.values, k=-1)
cor = cor.stack()
cor[cor ==1]
six    drive   1.00
three  drive   1.00
       six     1.00
zive   one     1.00

..which is not quite what I want -- since [six, drive]is not a grouping -- it's missing 'three'.

..这不是我想要的 - 因为[six, drive]不是分组 - 它丢失了'three'。

Answer 1

采纳答案by Akavall

Here is a naive approach:

这是一个天真的方法：

df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]})

corrMatrix=df.corr()

corrMatrix.loc[:,:] =  np.tril(corrMatrix, k=-1) # borrowed from Karl D's answer

already_in = set()
result = []
for col in corrMatrix:
    perfect_corr = corrMatrix[col][corrMatrix[col] == 1].index.tolist()
    if perfect_corr and col not in already_in:
        already_in.update(set(perfect_corr))
        perfect_corr.append(col)
        result.append(perfect_corr)

Result:

结果：

>>> result
[['six', 'three', 'drive'], ['zive', 'one']]

Answer 2

回答by Karl D.

You could do something like the following:

您可以执行以下操作：

>>> cor = df.corr()
>>> cor.loc[:,:] =  np.tril(cor, k=-1)
>>> cor = cor.stack()
>>> cor[cor > 0.9999]

three  six    1
zive   one    1

To match more closely your expected output you can do something like the following:

为了更接近您的预期输出，您可以执行以下操作：

>>> cor[cor > 0.9999].to_dict().keys()

[('zive', 'one'), ('three', 'six')]

Explanation. First, I create a lower triangular version of the covariance matrix that excludes the diagonal (using numpy's tril):

解释。首先，我创建了一个不包括对角线的协方差矩阵的下三角版本（使用 numpy's tril）：

>>> cor.loc[:,:] =  np.tril(cor.values, k=-1)

           four       one       six     three       two  zive
four   0.000000 -0.000000 -0.000000 -0.000000  0.000000    -0
one   -0.489177  0.000000 -0.000000 -0.000000 -0.000000     0
six   -0.039607 -0.747365  0.000000  0.000000  0.000000    -0
three -0.039607 -0.747365  1.000000  0.000000  0.000000    -0
two    0.159583 -0.351531  0.238102  0.238102  0.000000    -0
zive  -0.489177  1.000000 -0.747365 -0.747365 -0.351531     0

And then I stack the dataframe:

然后我堆叠数据框：

>>> cor = cor.stack()

four   four     0.000000
       one     -0.000000
       six     -0.000000
       three   -0.000000
       two      0.000000
       zive    -0.000000
one    four    -0.489177
       one      0.000000
       six     -0.000000
       three   -0.000000
       two     -0.000000
       zive     0.000000
six    four    -0.039607
       one     -0.747365
       six      0.000000
       three    0.000000
       two      0.000000
       zive    -0.000000
three  four    -0.039607
       one     -0.747365
       six      1.000000
       three    0.000000
       two      0.000000
       zive    -0.000000
two    four     0.159583
       one     -0.351531
       six      0.238102
       three    0.238102
       two      0.000000
       zive    -0.000000
zive   four    -0.489177
       one      1.000000
       six     -0.747365
       three   -0.747365
       two     -0.351531
       zive     0.000000

And then I can just grab the rows that equal one.

然后我可以抓住等于一的行。

Edit: I think this will get the form you want but it's not elegant:

编辑：我认为这会得到你想要的形式，但它并不优雅：

>>> from itertools import chain

>>> cor.loc[:,:] =  np.tril(cor, k=-1)
>>> cor = cor.stack()
>>> ones = cor[cor > 0.999].reset_index().loc[:,['level_0','level_1']]
>>> ones = ones.query('level_0 not in level_1')
>>> ones.groupby('level_0').agg(lambda x: set(chain(x.level_0,x.level_1))).values

[[set(['six', 'drive', 'three'])]
 [set(['zive', 'one'])]]

pandas 返回熊猫数据框中的相关列组

提问by Bryan

采纳答案by Akavall

回答by Karl D.

相关推荐

最近更新

标签

pandas 返回熊猫数据框中的相关列组

提问by Bryan

采纳答案by Akavall

回答by Karl D.

相关推荐

pandas Python：选择最常用的分组依据

pandas 用python处理96孔板中的数据标签

pandas 我应该如何通过函数传递 matplotlib 对象；作为轴，轴或图形？

pandas 通过 id 列表过滤熊猫数据框

相关推荐

最近更新

标签