Python 从 Pandas 的大型相关矩阵中列出最高相关对?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17778394/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?
提问by Kyle Brandt
How do you find the top correlations in a correlation matrix with Pandas? There are many answers on how to do this with R (Show correlations as an ordered list, not as a large matrixor Efficient way to get highly correlated pairs from large data set in Python or R), but I am wondering how to do it with pandas? In my case the matrix is 4460x4460, so can't do it visually.
您如何在与 Pandas 的相关矩阵中找到最高相关性?关于如何使用 R 执行此操作有很多答案(将相关性显示为有序列表,而不是作为大矩阵或从 Python 或 R 中的大型数据集中获取高度相关对的有效方法),但我想知道如何做和熊猫?在我的情况下,矩阵是 4460x4460,所以不能在视觉上做到这一点。
采纳答案by HYRY
You can use DataFrame.values
to get an numpy array of the data and then use NumPy functions such as argsort()
to get the most correlated pairs.
您可以使用DataFrame.values
获取数据的 numpy 数组,然后使用 NumPy 函数argsort()
来获取最相关的对。
But if you want to do this in pandas, you can unstack
and sort the DataFrame:
但是如果你想unstack
在 Pandas 中做到这一点,你可以对 DataFrame 进行排序:
import pandas as pd
import numpy as np
shape = (50, 4460)
data = np.random.normal(size=shape)
data[:, 1000] += data[:, 2000]
df = pd.DataFrame(data)
c = df.corr().abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
print so[-4470:-4460]
Here is the output:
这是输出:
2192 1522 0.636198
1522 2192 0.636198
3677 2027 0.641817
2027 3677 0.641817
242 130 0.646760
130 242 0.646760
1171 2733 0.670048
2733 1171 0.670048
1000 2000 0.742340
2000 1000 0.742340
dtype: float64
回答by arun
@HYRY's answer is perfect. Just building on that answer by adding a bit more logic to avoid duplicate and self correlations and proper sorting:
@HYRY 的回答是完美的。只是通过添加更多逻辑来避免重复和自相关以及正确排序来构建该答案:
import pandas as pd
d = {'x1': [1, 4, 4, 5, 6],
'x2': [0, 0, 8, 2, 4],
'x3': [2, 8, 8, 10, 12],
'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()
print("Correlation Matrix")
print(df.corr())
print()
def get_redundant_pairs(df):
'''Get diagonal and lower triangular pairs of correlation matrix'''
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_top_abs_correlations(df, n=5):
au_corr = df.corr().abs().unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))
That gives the following output:
这给出了以下输出:
Data Frame
x1 x2 x3 x4
0 1 0 2 -1
1 4 0 8 -4
2 4 8 8 -4
3 5 2 10 -4
4 6 4 12 -5
Correlation Matrix
x1 x2 x3 x4
x1 1.000000 0.399298 1.000000 -0.969248
x2 0.399298 1.000000 0.399298 -0.472866
x3 1.000000 0.399298 1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248 1.000000
Top Absolute Correlations
x1 x3 1.000000
x3 x4 0.969248
x1 x4 0.969248
dtype: float64
回答by MiFi
Few lines solution without redundant pairs of variables:
没有冗余变量对的几行解决方案:
corr_matrix = df.corr().abs()
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
#first element of sol series is the pair with the bigest correlation
回答by Frederik Meinertsen
Use itertools.combinations
to get all unique correlations from pandas own correlation matrix .corr()
, generate list of lists and feed it back into a DataFrame in order to use '.sort_values'. Set ascending = True
to display lowest correlations on top
用于itertools.combinations
从 Pandas 自己的相关矩阵中获取所有唯一相关性.corr()
,生成列表列表并将其反馈到 DataFrame 中以使用“.sort_values”。设置ascending = True
为在顶部显示最低相关性
corrank
takes a DataFrame as argument because it requires .corr()
.
corrank
将 DataFrame 作为参数,因为它需要.corr()
.
def corrank(X):
import itertools
df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])
print(df.sort_values(by='corr',ascending=False))
corrank(X) # prints a descending list of correlation pair (Max on top)
回答by prashanth
Use the code below to view the correlations in the descending order.
使用下面的代码按降序查看相关性。
# See the correlations in descending order
corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)
回答by Addison Klinke
Combining some features of @HYRY and @arun's answers, you can print the top correlations for dataframe df
in a single line using:
结合@HYRY 和@arun 的答案的一些功能,您可以df
使用以下方法在一行中打印数据帧的最高相关性:
df.corr().unstack().sort_values().drop_duplicates()
Note: the one downside is if you have 1.0 correlations that are notone variable to itself, the drop_duplicates()
addition would remove them
注意:一个缺点是如果你有 1.0 相关性不是一个变量本身,drop_duplicates()
添加会删除它们
回答by Rich Wandell
Lot's of good answers here. The easiest way I found was a combination of some of the answers above.
这里有很多很好的答案。我发现的最简单的方法是结合上面的一些答案。
corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
.sort_values(by='column', ascending=False)\
.dropna()
回答by falsarella
I didn't want to unstack
or over-complicate this issue, since I just wanted to drop some highly correlated features as part of a feature selection phase.
我不想unstack
或过度复杂化这个问题,因为我只是想删除一些高度相关的特征作为特征选择阶段的一部分。
So I ended up with the following simplified solution:
所以我最终得到了以下简化的解决方案:
# map features to their absolute correlation values
corr = features.corr().abs()
# set equality (self correlation) as zero
corr[corr == 1] = 0
# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)
# display the highly correlated features
display(corr_cols[corr_cols > 0.8])
In this case, if you want to drop correlated features, you may map through the filtered corr_cols
array and remove the odd-indexed (or even-indexed) ones.
在这种情况下,如果您想删除相关特征,您可以映射过滤后的corr_cols
数组并删除奇数索引(或偶数索引)的。
回答by KIC
I was trying some of the solutions here but then I actually came up with my own one. I hope this might be useful for the next one so I share it here:
我在这里尝试了一些解决方案,但后来我实际上想出了自己的解决方案。我希望这对下一个有用,所以我在这里分享:
def sort_correlation_matrix(correlation_matrix):
cor = correlation_matrix.abs()
top_col = cor[cor.columns[0]][1:]
top_col = top_col.sort_values(ascending=False)
ordered_columns = [cor.columns[0]] + top_col.index.tolist()
return correlation_matrix[ordered_columns].reindex(ordered_columns)
回答by Aibloy
This is a improve code from @MiFi. This one order in abs but not excluding the negative values.
这是@MiFi 的改进代码。这一个以 abs 为单位的顺序,但不排除负值。
def top_correlation (df,n):
corr_matrix = df.corr()
correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
correlation = pd.DataFrame(correlation).reset_index()
correlation.columns=["Variable_1","Variable_2","Correlacion"]
correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
return correlation.head(n)
top_correlation(ANYDATA,10)