Python 计算两个多维数组之间的相关系数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30143417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:00:22  来源:igfitidea点击:

Computing the correlation coefficient between two multi-dimensional arrays

pythonarraysnumpyscipycorrelation

提问by dbliss

I have two arrays that have the shapes N X Tand M X T. I'd like to compute the correlation coefficient across Tbetween every possible pair of rows nand m(from Nand M, respectively).

我有两个具有形状N X TM X T. 我想计算T每对可能的行nm(分别来自NM)之间的相关系数。

What's the fastest, most pythonic way to do this? (Looping over Nand Mwould seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpyand/or scipy. Right now my arrays are numpyarrays, but I'm open to converting them to a different type.

什么是最快,最pythonic的方法来做到这一点?(循环NM在我看来既不快速也不pythonic。)我期待答案涉及numpy和/或scipy. 现在我的数组是numpyarrays,但我愿意将它们转换为不同的类型。

I'm expecting my output to be an array with the shape N X M.

我希望我的输出是一个形状为 的数组N X M

N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.

注意,当我说“相关系数”时,我指的是Pearson 积矩相关系数

Here are some things to note:

这里有一些注意事项:

  • The numpyfunction correlaterequires input arrays to be one-dimensional.
  • The numpyfunction corrcoefaccepts two-dimensional arrays, but they must have the same shape.
  • The scipy.statsfunction pearsonrrequires input arrays to be one-dimensional.
  • numpy函数correlate要求输入数组是一维的。
  • numpy函数corrcoef接受二维数组,但它们必须具有相同的形状。
  • scipy.stats函数pearsonr要求输入数组是一维的。

采纳答案by Divakar

Correlation (default 'valid' case) between two 2D arrays:

两个二维数组之间的相关性(默认为“有效”情况):

You can simply use matrix-multiplication np.dotlike so -

您可以np.dot像这样简单地使用矩阵乘法-

out = np.dot(arr_one,arr_two.T)

Correlation with the default "valid"case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.

"valid"两个输入数组的每个成对行组合 (row1,row2) 之间的默认情况的相关性将对应于每个 (row1,row2) 位置的乘法结果。



Row-wise Correlation Coefficient calculation for two 2D arrays:

两个二维数组的行相关系数计算:

def corr2_coeff(A, B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1)[:, None]
    B_mB = B - B.mean(1)[:, None]

    # Sum of squares across rows
    ssA = (A_mA**2).sum(1)
    ssB = (B_mB**2).sum(1)

    # Finally get corr coeff
    return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))

This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB

这是基于此解决方案 How to apply corr2 functions in Multidimentional arrays in MATLAB

Benchmarking

基准测试

This section compares runtime performance with the proposed approach against generate_correlation_map& loopy pearsonrbased approach listed in the other answer.(taken from the function test_generate_correlation_map()without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.

本节将运行时性能与建议的方法与另一个答案中列出的基于generate_correlation_map& 循环pearsonr的方法进行比较(取自函数test_generate_correlation_map()末尾没有值正确性验证码)。请注意,所提议方法的时间安排还包括在开始时检查两个输入数组中是否有相同数量的列,正如其他答案中所做的那样。接下来列出了运行时。

Case #1:

情况1:

In [106]: A = np.random.rand(1000, 100)

In [107]: B = np.random.rand(1000, 100)

In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop

In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop

Case #2:

案例#2:

In [110]: A = np.random.rand(5000, 100)

In [111]: B = np.random.rand(5000, 100)

In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop

In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop

Case #3:

案例#3:

In [114]: A = np.random.rand(10000, 10)

In [115]: B = np.random.rand(10000, 10)

In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop

In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop

The other loopy pearsonr basedapproach seemed too slow, but here are the runtimes for one small datasize -

另一种循环pearsonr based方法似乎太慢了,但这里是一个小数据大小的运行时 -

In [118]: A = np.random.rand(1000, 100)

In [119]: B = np.random.rand(1000, 100)

In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop

In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop

In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop

回答by dbliss

@Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.

@Divakar 为计算未缩放的相关性提供了一个很好的选择,这正是我最初要求的。

In order to calculate the correlation coefficient, a bit more is required:

为了计算相关系数,需要多一点:

import numpy as np

def generate_correlation_map(x, y):
    """Correlate each n with each m.

    Parameters
    ----------
    x : np.array
      Shape N X T.

    y : np.array
      Shape M X T.

    Returns
    -------
    np.array
      N X M array in which each element is a correlation coefficient.

    """
    mu_x = x.mean(1)
    mu_y = y.mean(1)
    n = x.shape[1]
    if n != y.shape[1]:
        raise ValueError('x and y must ' +
                         'have the same number of timepoints.')
    s_x = x.std(1, ddof=n - 1)
    s_y = y.std(1, ddof=n - 1)
    cov = np.dot(x,
                 y.T) - n * np.dot(mu_x[:, np.newaxis],
                                  mu_y[np.newaxis, :])
    return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])

Here's a test of this function, which passes:

这是对这个函数的测试,它通过了:

from scipy.stats import pearsonr

def test_generate_correlation_map():
    x = np.random.rand(10, 10)
    y = np.random.rand(20, 10)
    desired = np.empty((10, 20))
    for n in range(x.shape[0]):
        for m in range(y.shape[0]):
            desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
    actual = generate_correlation_map(x, y)
    np.testing.assert_array_almost_equal(actual, desired)