如果相关性大于 0.75,则从 Pandas 的数据框中删除该列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44889508/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
If correlation is greater than 0.75 remove the column from dataframe of pandas
提问by Avanish Mishra
I have a dataframe name data
for which I plotted correlation matrix by using
我有一个数据框名称data
,我使用它绘制了相关矩阵
corr = data.corr()
I want if corr
between two column is greater than 0.75, remove one of them from dataframe data
. I tried some option
我想如果corr
两列之间的值大于 0.75,请从 dataframe 中删除其中之一data
。我尝试了一些选择
raw =corr[(corr.abs()>0.75) & (corr.abs() < 1.0)]
but it did not help, I need column number from raw for which value is nonzero. Basically some python command replacement of following R command
但这没有帮助,我需要原始值非零的列号。基本上一些python命令替换了以下R命令
{hc=findCorrelation(corr,cutoff = 0.75)
hc = sort(hc)
data <- data[,-c(hc)]}
If anyone can help me to get command similar to above mention R command in python pandas, that would be helpful.
如果有人可以帮助我在 python pandas 中获得类似于上面提到的 R 命令的命令,那会很有帮助。
回答by piRSquared
Use np.eye
to ignore the diagonal values and find all columns that have some value whose absolute value is greater than the threshold. Use the logical negation as a mask for the index and columns.
使用np.eye
忽略对角线值,发现有其绝对值大于阈值一定的价值的所有列。使用逻辑否定作为索引和列的掩码。
Your example
你的榜样
m = ~(corr.mask(np.eye(len(corr), dtype=bool)).abs() > 0.75).any()
raw = corr.loc[m, m]
Working example
工作示例
np.random.seed([3,1415])
data = pd.DataFrame(
np.random.randint(10, size=(10, 10)),
columns=list('ABCDEFGHIJ'))
data
A B C D E F G H I J
0 0 2 7 3 8 7 0 6 8 6
1 0 2 0 4 9 7 3 2 4 3
2 3 6 7 7 4 5 3 7 5 9
3 8 7 6 4 7 6 2 6 6 5
4 2 8 7 5 8 4 7 6 1 5
5 2 8 2 4 7 6 9 4 2 4
6 6 3 8 3 9 8 0 4 3 0
7 4 1 5 8 6 0 8 7 4 6
8 3 5 8 5 1 5 1 4 3 9
9 5 5 7 0 3 2 5 8 8 9
corr = data.corr()
corr
A B C D E F G H I J
A 1.00 0.22 0.42 -0.12 -0.17 -0.16 -0.11 0.35 0.13 -0.06
B 0.22 1.00 0.10 -0.08 -0.18 0.07 0.33 0.12 -0.34 0.17
C 0.42 0.10 1.00 -0.08 -0.41 -0.12 -0.42 0.55 0.20 0.34
D -0.12 -0.08 -0.08 1.00 -0.05 -0.29 0.27 0.02 -0.45 0.11
E -0.17 -0.18 -0.41 -0.05 1.00 0.47 0.00 -0.38 -0.19 -0.86
F -0.16 0.07 -0.12 -0.29 0.47 1.00 -0.62 -0.67 -0.08 -0.54
G -0.11 0.33 -0.42 0.27 0.00 -0.62 1.00 0.22 -0.40 0.07
H 0.35 0.12 0.55 0.02 -0.38 -0.67 0.22 1.00 0.50 0.59
I 0.13 -0.34 0.20 -0.45 -0.19 -0.08 -0.40 0.50 1.00 0.40
J -0.06 0.17 0.34 0.11 -0.86 -0.54 0.07 0.59 0.40 1.00
m = ~(corr.mask(np.eye(len(corr), dtype=bool)).abs() > 0.5).any()
m
A True
B True
C False
D True
E False
F False
G False
H False
I True
J False
dtype: bool
raw = corr.loc[m, m]
raw
A B D I
A 1.00 0.22 -0.12 0.13
B 0.22 1.00 -0.08 -0.34
D -0.12 -0.08 1.00 -0.45
I 0.13 -0.34 -0.45 1.00