如果相关性大于 0.75,则从 Pandas 的数据框中删除该列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44889508/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:55:36  来源:igfitidea点击:

If correlation is greater than 0.75 remove the column from dataframe of pandas

pythonrpandasmachine-learningscikit-learn

提问by Avanish Mishra

I have a dataframe name datafor which I plotted correlation matrix by using

我有一个数据框名称data,我使用它绘制了相关矩阵

corr = data.corr()

I want if corrbetween two column is greater than 0.75, remove one of them from dataframe data. I tried some option

我想如果corr两列之间的值大于 0.75,请从 dataframe 中删除其中之一data。我尝试了一些选择

raw =corr[(corr.abs()>0.75) & (corr.abs() < 1.0)]

but it did not help, I need column number from raw for which value is nonzero. Basically some python command replacement of following R command

但这没有帮助,我需要原始值非零的列号。基本上一些python命令替换了以下R命令

{hc=findCorrelation(corr,cutoff = 0.75)

hc = sort(hc)

data <- data[,-c(hc)]}

If anyone can help me to get command similar to above mention R command in python pandas, that would be helpful.

如果有人可以帮助我在 python pandas 中获得类似于上面提到的 R 命令的命令,那会很有帮助。

回答by piRSquared

Use np.eyeto ignore the diagonal values and find all columns that have some value whose absolute value is greater than the threshold. Use the logical negation as a mask for the index and columns.

使用np.eye忽略对角线值,发现有其绝对值大于阈值一定的价值的所有列。使用逻辑否定作为索引和列的掩码。



Your example

你的榜样

m = ~(corr.mask(np.eye(len(corr), dtype=bool)).abs() > 0.75).any()

raw = corr.loc[m, m]

Working example

工作示例

np.random.seed([3,1415])
data = pd.DataFrame(
    np.random.randint(10, size=(10, 10)),
    columns=list('ABCDEFGHIJ'))
data

   A  B  C  D  E  F  G  H  I  J
0  0  2  7  3  8  7  0  6  8  6
1  0  2  0  4  9  7  3  2  4  3
2  3  6  7  7  4  5  3  7  5  9
3  8  7  6  4  7  6  2  6  6  5
4  2  8  7  5  8  4  7  6  1  5
5  2  8  2  4  7  6  9  4  2  4
6  6  3  8  3  9  8  0  4  3  0
7  4  1  5  8  6  0  8  7  4  6
8  3  5  8  5  1  5  1  4  3  9
9  5  5  7  0  3  2  5  8  8  9


corr = data.corr()
corr

      A     B     C     D     E     F     G     H     I     J
A  1.00  0.22  0.42 -0.12 -0.17 -0.16 -0.11  0.35  0.13 -0.06
B  0.22  1.00  0.10 -0.08 -0.18  0.07  0.33  0.12 -0.34  0.17
C  0.42  0.10  1.00 -0.08 -0.41 -0.12 -0.42  0.55  0.20  0.34
D -0.12 -0.08 -0.08  1.00 -0.05 -0.29  0.27  0.02 -0.45  0.11
E -0.17 -0.18 -0.41 -0.05  1.00  0.47  0.00 -0.38 -0.19 -0.86
F -0.16  0.07 -0.12 -0.29  0.47  1.00 -0.62 -0.67 -0.08 -0.54
G -0.11  0.33 -0.42  0.27  0.00 -0.62  1.00  0.22 -0.40  0.07
H  0.35  0.12  0.55  0.02 -0.38 -0.67  0.22  1.00  0.50  0.59
I  0.13 -0.34  0.20 -0.45 -0.19 -0.08 -0.40  0.50  1.00  0.40
J -0.06  0.17  0.34  0.11 -0.86 -0.54  0.07  0.59  0.40  1.00


m = ~(corr.mask(np.eye(len(corr), dtype=bool)).abs() > 0.5).any()
m

A     True
B     True
C    False
D     True
E    False
F    False
G    False
H    False
I     True
J    False
dtype: bool


raw = corr.loc[m, m]
raw

      A     B     D     I
A  1.00  0.22 -0.12  0.13
B  0.22  1.00 -0.08 -0.34
D -0.12 -0.08  1.00 -0.45
I  0.13 -0.34 -0.45  1.00