Python numpy corrcoef - 在忽略缺失数据的同时计算相关矩阵

Question

提问by Selah

I am trying to compute a correlation matrix of several values. These values include some 'nan' values. I'm using numpy.corrcoef. For element(i,j) of the output correlation matrix I'd like to have the correlation calculated using all values that exist for both variable i and variable j.

我正在尝试计算几个值的相关矩阵。这些值包括一些“nan”值。我正在使用 numpy.corrcoef。对于输出相关矩阵的元素（i，j），我希望使用变量 i 和变量 j 存在的所有值计算相关性。

This is what I have now:

这就是我现在所拥有的：

In[20]: df_counties = pd.read_sql("SELECT Median_Age, Rpercent_2008, overall_LS, population_density FROM countyVotingSM2", db_eng)
In[21]: np.corrcoef(df_counties, rowvar = False)
Out[21]: 
array([[ 1.        ,         nan,         nan, -0.10998411],
       [        nan,         nan,         nan,         nan],
       [        nan,         nan,         nan,         nan],
       [-0.10998411,         nan,         nan,  1.        ]])

Too many nan's :(

太多的南:(

Answer 1

采纳答案by Jianxun Li

One of the main features of pandasis being NaNfriendly. To calculate correlation matrix, simply call df_counties.corr(). Below is an example to demonstrate df.corr()is NaNtolerant whereas np.corrcoefis not.

的主要特点之一pandas是NaN友好。要计算相关矩阵，只需调用df_counties.corr()。下面是一个例子来证明df.corr()是NaN宽容的，而np.corrcoef不是。

import pandas as pd
import numpy as np

# data
# ==============================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100,5), columns=list('ABCDE'))
df[df < 0] = np.nan
df

         A       B       C       D       E
0   1.7641  0.4002  0.9787  2.2409  1.8676
1      NaN  0.9501     NaN     NaN  0.4106
2   0.1440  1.4543  0.7610  0.1217  0.4439
3   0.3337  1.4941     NaN  0.3131     NaN
4      NaN  0.6536  0.8644     NaN  2.2698
5      NaN  0.0458     NaN  1.5328  1.4694
6   0.1549  0.3782     NaN     NaN     NaN
7   0.1563  1.2303  1.2024     NaN     NaN
8      NaN     NaN     NaN  1.9508     NaN
9      NaN     NaN  0.7775     NaN     NaN
..     ...     ...     ...     ...     ...
90     NaN  0.8202  0.4631  0.2791  0.3389
91  2.0210     NaN     NaN  0.1993     NaN
92     NaN     NaN     NaN  0.1813     NaN
93  2.4125     NaN     NaN     NaN  0.2515
94     NaN     NaN     NaN     NaN  1.7389
95  0.9944  1.3191     NaN  1.1286  0.4960
96  0.7714  1.0294     NaN     NaN  0.8626
97     NaN  1.5133  0.5531     NaN  0.2205
98     NaN     NaN  1.1003  1.2980  2.6962
99     NaN     NaN     NaN     NaN     NaN

[100 rows x 5 columns]

# calculations
# ================================
df.corr()

        A       B       C       D       E
A  1.0000  0.2718  0.2678  0.2822  0.1016
B  0.2718  1.0000 -0.0692  0.1736 -0.1432
C  0.2678 -0.0692  1.0000 -0.3392  0.0012
D  0.2822  0.1736 -0.3392  1.0000  0.1562
E  0.1016 -0.1432  0.0012  0.1562  1.0000


np.corrcoef(df, rowvar=False)

array([[ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan]])

Answer 2

回答by bers

This will work, using the masked arraynumpymodule:

这将起作用，使用掩码数组numpy模块：

import numpy as np
import numpy.ma as ma

A = [1, 2, 3, 4, 5, np.NaN]
B = [2, 3, 4, 5.25, np.NaN, 100]

print(ma.corrcoef(ma.masked_invalid(A), ma.masked_invalid(B)))

It outputs:

它输出：

[[1.0 0.99838143945703]
 [0.99838143945703 1.0]]

Read more here: https://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html

在此处阅读更多信息：https: //docs.scipy.org/doc/numpy/reference/maskedarray.generic.html

Answer 3

回答by Marcin Kawka

In case you expect a different number of nans in each array, you may consider taking a logical AND of non-nan masks.

如果您希望每个数组中有不同数量的 nan，您可以考虑对非 nan 掩码进行逻辑 AND 运算。

import numpy as np
import numpy.ma as ma

a=ma.masked_invalid(A)
b=ma.masked_invalid(B)

msk = (~a.mask & ~b.mask)

print(ma.corrcoef(a[msk],b[msk]))

Python numpy corrcoef - 在忽略缺失数据的同时计算相关矩阵

提问by Selah

采纳答案by Jianxun Li

回答by bers

回答by Marcin Kawka

相关推荐

最近更新

标签

Python numpy corrcoef - 在忽略缺失数据的同时计算相关矩阵

提问by Selah

采纳答案by Jianxun Li

回答by bers

回答by Marcin Kawka

相关推荐

Python 来自外部范围的阴影名称 xyz

Python pickle/unpickle 列表到/从文件

Python 模拟补丁 os.environ 和返回值

Python 找不到满足 pytz 要求的版本

相关推荐

最近更新

标签