pandas 返回熊猫数据框中的相关列组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24002820/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Returning groups of correlated columns in pandas data frame
提问by Bryan
I've run a correlation matrix on a pandas DataFrame:
我在 a 上运行了一个相关矩阵pandas DataFrame:
df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]})
corrMatrix=df.corr()
corrMatrix
drive four one six three two zive
drive 1.00 -0.04 -0.75 1.00 1.00 0.24 -0.75
four -0.04 1.00 -0.49 -0.04 -0.04 0.16 -0.49
one -0.75 -0.49 1.00 -0.75 -0.75 -0.35 1.00
six 1.00 -0.04 -0.75 1.00 1.00 0.24 -0.75
three 1.00 -0.04 -0.75 1.00 1.00 0.24 -0.75
two 0.24 0.16 -0.35 0.24 0.24 1.00 -0.35
zive -0.75 -0.49 1.00 -0.75 -0.75 -0.35 1.00
Now, I want to write some code to return the columns that are perfectly correlated (ie correlation ==1) in groups.
现在,我想编写一些代码来返回组中完全相关(即相关性==1)的列。
Optimally, I would want this:
[['zive', 'one'], ['three', 'six', 'drive']]
最理想的是,我想要这个:
[['zive', 'one'], ['three', 'six', 'drive']]
I've written the below code, which gives me ['drive', 'one', 'six', 'three', 'zive'], but as you can see, they are just a bag of columns that have some sort of perfect correlation with some other column-- it does not put them in a distinctive grouping with their perfectly correlated cousin columns.
我写了下面的代码,它给了我['drive', 'one', 'six', 'three', 'zive'],但正如你所看到的,它们只是一袋与其他一些列具有某种完美相关性的列——它并没有将它们与它们完美地放在一个独特的分组中相关的表亲列。
correlatedCols=[]
for col in corrMatrix:
data=corrMatrix[col][corrMatrix[col]==1]
if len(data)>1:
correlatedCols.append(data.name)
correlatedCols
['drive','one', 'six', 'three', 'zive']
EDIT:Using the advice given by @Karl D., I get this:
编辑:使用@Karl D. 给出的建议,我得到了这个:
cor = df.corr()
cor.loc[:,:] = np.tril(cor.values, k=-1)
cor = cor.stack()
cor[cor ==1]
six drive 1.00
three drive 1.00
six 1.00
zive one 1.00
..which is not quite what I want -- since [six, drive]is not a grouping -- it's missing 'three'.
..这不是我想要的 - 因为[six, drive]不是分组 - 它丢失了'three'。
采纳答案by Akavall
Here is a naive approach:
这是一个天真的方法:
df=pd.DataFrame( {'one':[0.1, .32, .2, 0.4, 0.8], 'two':[.23, .18, .56, .61, .12], 'three':[.9, .3, .6, .5, .3], 'four':[.34, .75, .91, .19, .21], 'zive': [0.1, .32, .2, 0.4, 0.8], 'six':[.9, .3, .6, .5, .3], 'drive':[.9, .3, .6, .5, .3]})
corrMatrix=df.corr()
corrMatrix.loc[:,:] = np.tril(corrMatrix, k=-1) # borrowed from Karl D's answer
already_in = set()
result = []
for col in corrMatrix:
perfect_corr = corrMatrix[col][corrMatrix[col] == 1].index.tolist()
if perfect_corr and col not in already_in:
already_in.update(set(perfect_corr))
perfect_corr.append(col)
result.append(perfect_corr)
Result:
结果:
>>> result
[['six', 'three', 'drive'], ['zive', 'one']]
回答by Karl D.
You could do something like the following:
您可以执行以下操作:
>>> cor = df.corr()
>>> cor.loc[:,:] = np.tril(cor, k=-1)
>>> cor = cor.stack()
>>> cor[cor > 0.9999]
three six 1
zive one 1
To match more closely your expected output you can do something like the following:
为了更接近您的预期输出,您可以执行以下操作:
>>> cor[cor > 0.9999].to_dict().keys()
[('zive', 'one'), ('three', 'six')]
Explanation. First, I create a lower triangular version of the covariance matrix that excludes the diagonal (using numpy's tril):
解释。首先,我创建了一个不包括对角线的协方差矩阵的下三角版本(使用 numpy's tril):
>>> cor.loc[:,:] = np.tril(cor.values, k=-1)
four one six three two zive
four 0.000000 -0.000000 -0.000000 -0.000000 0.000000 -0
one -0.489177 0.000000 -0.000000 -0.000000 -0.000000 0
six -0.039607 -0.747365 0.000000 0.000000 0.000000 -0
three -0.039607 -0.747365 1.000000 0.000000 0.000000 -0
two 0.159583 -0.351531 0.238102 0.238102 0.000000 -0
zive -0.489177 1.000000 -0.747365 -0.747365 -0.351531 0
And then I stack the dataframe:
然后我堆叠数据框:
>>> cor = cor.stack()
four four 0.000000
one -0.000000
six -0.000000
three -0.000000
two 0.000000
zive -0.000000
one four -0.489177
one 0.000000
six -0.000000
three -0.000000
two -0.000000
zive 0.000000
six four -0.039607
one -0.747365
six 0.000000
three 0.000000
two 0.000000
zive -0.000000
three four -0.039607
one -0.747365
six 1.000000
three 0.000000
two 0.000000
zive -0.000000
two four 0.159583
one -0.351531
six 0.238102
three 0.238102
two 0.000000
zive -0.000000
zive four -0.489177
one 1.000000
six -0.747365
three -0.747365
two -0.351531
zive 0.000000
And then I can just grab the rows that equal one.
然后我可以抓住等于一的行。
Edit: I think this will get the form you want but it's not elegant:
编辑:我认为这会得到你想要的形式,但它并不优雅:
>>> from itertools import chain
>>> cor.loc[:,:] = np.tril(cor, k=-1)
>>> cor = cor.stack()
>>> ones = cor[cor > 0.999].reset_index().loc[:,['level_0','level_1']]
>>> ones = ones.query('level_0 not in level_1')
>>> ones.groupby('level_0').agg(lambda x: set(chain(x.level_0,x.level_1))).values
[[set(['six', 'drive', 'three'])]
[set(['zive', 'one'])]]

