Python numpy corrcoef - 在忽略缺失数据的同时计算相关矩阵
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31619578/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
numpy corrcoef - compute correlation matrix while ignoring missing data
提问by Selah
I am trying to compute a correlation matrix of several values. These values include some 'nan' values. I'm using numpy.corrcoef. For element(i,j) of the output correlation matrix I'd like to have the correlation calculated using all values that exist for both variable i and variable j.
我正在尝试计算几个值的相关矩阵。这些值包括一些“nan”值。我正在使用 numpy.corrcoef。对于输出相关矩阵的元素(i,j),我希望使用变量 i 和变量 j 存在的所有值计算相关性。
This is what I have now:
这就是我现在所拥有的:
In[20]: df_counties = pd.read_sql("SELECT Median_Age, Rpercent_2008, overall_LS, population_density FROM countyVotingSM2", db_eng)
In[21]: np.corrcoef(df_counties, rowvar = False)
Out[21]:
array([[ 1. , nan, nan, -0.10998411],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan],
[-0.10998411, nan, nan, 1. ]])
Too many nan's :(
太多的南:(
采纳答案by Jianxun Li
One of the main features of pandas
is being NaN
friendly. To calculate correlation matrix, simply call df_counties.corr()
. Below is an example to demonstrate df.corr()
is NaN
tolerant whereas np.corrcoef
is not.
的主要特点之一pandas
是NaN
友好。要计算相关矩阵,只需调用df_counties.corr()
。下面是一个例子来证明df.corr()
是NaN
宽容的,而np.corrcoef
不是。
import pandas as pd
import numpy as np
# data
# ==============================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100,5), columns=list('ABCDE'))
df[df < 0] = np.nan
df
A B C D E
0 1.7641 0.4002 0.9787 2.2409 1.8676
1 NaN 0.9501 NaN NaN 0.4106
2 0.1440 1.4543 0.7610 0.1217 0.4439
3 0.3337 1.4941 NaN 0.3131 NaN
4 NaN 0.6536 0.8644 NaN 2.2698
5 NaN 0.0458 NaN 1.5328 1.4694
6 0.1549 0.3782 NaN NaN NaN
7 0.1563 1.2303 1.2024 NaN NaN
8 NaN NaN NaN 1.9508 NaN
9 NaN NaN 0.7775 NaN NaN
.. ... ... ... ... ...
90 NaN 0.8202 0.4631 0.2791 0.3389
91 2.0210 NaN NaN 0.1993 NaN
92 NaN NaN NaN 0.1813 NaN
93 2.4125 NaN NaN NaN 0.2515
94 NaN NaN NaN NaN 1.7389
95 0.9944 1.3191 NaN 1.1286 0.4960
96 0.7714 1.0294 NaN NaN 0.8626
97 NaN 1.5133 0.5531 NaN 0.2205
98 NaN NaN 1.1003 1.2980 2.6962
99 NaN NaN NaN NaN NaN
[100 rows x 5 columns]
# calculations
# ================================
df.corr()
A B C D E
A 1.0000 0.2718 0.2678 0.2822 0.1016
B 0.2718 1.0000 -0.0692 0.1736 -0.1432
C 0.2678 -0.0692 1.0000 -0.3392 0.0012
D 0.2822 0.1736 -0.3392 1.0000 0.1562
E 0.1016 -0.1432 0.0012 0.1562 1.0000
np.corrcoef(df, rowvar=False)
array([[ nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan]])
回答by bers
This will work, using the masked arraynumpy
module:
这将起作用,使用掩码数组numpy
模块:
import numpy as np
import numpy.ma as ma
A = [1, 2, 3, 4, 5, np.NaN]
B = [2, 3, 4, 5.25, np.NaN, 100]
print(ma.corrcoef(ma.masked_invalid(A), ma.masked_invalid(B)))
It outputs:
它输出:
[[1.0 0.99838143945703]
[0.99838143945703 1.0]]
Read more here: https://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html
在此处阅读更多信息:https: //docs.scipy.org/doc/numpy/reference/maskedarray.generic.html
回答by Marcin Kawka
In case you expect a different number of nans in each array, you may consider taking a logical AND of non-nan masks.
如果您希望每个数组中有不同数量的 nan,您可以考虑对非 nan 掩码进行逻辑 AND 运算。
import numpy as np
import numpy.ma as ma
a=ma.masked_invalid(A)
b=ma.masked_invalid(B)
msk = (~a.mask & ~b.mask)
print(ma.corrcoef(a[msk],b[msk]))