Python 计算两个多维数组之间的相关系数

Question

提问by dbliss

I have two arrays that have the shapes N X Tand M X T. I'd like to compute the correlation coefficient across Tbetween every possible pair of rows nand m(from Nand M, respectively).

我有两个具有形状N X T和M X T. 我想计算T每对可能的行n和m（分别来自N和M）之间的相关系数。

What's the fastest, most pythonic way to do this? (Looping over Nand Mwould seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpyand/or scipy. Right now my arrays are numpyarrays, but I'm open to converting them to a different type.

什么是最快，最pythonic的方法来做到这一点？（循环N并M在我看来既不快速也不pythonic。）我期待答案涉及numpy和/或scipy. 现在我的数组是numpyarrays，但我愿意将它们转换为不同的类型。

I'm expecting my output to be an array with the shape N X M.

我希望我的输出是一个形状为的数组N X M。

N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.

注意，当我说“相关系数”时，我指的是Pearson 积矩相关系数。

Here are some things to note:

这里有一些注意事项：

The numpyfunction correlaterequires input arrays to be one-dimensional.
The numpyfunction corrcoefaccepts two-dimensional arrays, but they must have the same shape.
The scipy.statsfunction pearsonrrequires input arrays to be one-dimensional.

该numpy函数correlate要求输入数组是一维的。
该numpy函数corrcoef接受二维数组，但它们必须具有相同的形状。
该scipy.stats函数pearsonr要求输入数组是一维的。

Answer 1

采纳答案by Divakar

Correlation (default 'valid' case) between two 2D arrays:

两个二维数组之间的相关性（默认为“有效”情况）：

You can simply use matrix-multiplication np.dotlike so -

您可以np.dot像这样简单地使用矩阵乘法-

out = np.dot(arr_one,arr_two.T)

Correlation with the default "valid"case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.

与"valid"两个输入数组的每个成对行组合 (row1,row2) 之间的默认情况的相关性将对应于每个 (row1,row2) 位置的乘法结果。

Row-wise Correlation Coefficient calculation for two 2D arrays:

两个二维数组的行相关系数计算：

def corr2_coeff(A, B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1)[:, None]
    B_mB = B - B.mean(1)[:, None]

    # Sum of squares across rows
    ssA = (A_mA**2).sum(1)
    ssB = (B_mB**2).sum(1)

    # Finally get corr coeff
    return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))

This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB

这是基于此解决方案 How to apply corr2 functions in Multidimentional arrays in MATLAB

Benchmarking

基准测试

This section compares runtime performance with the proposed approach against generate_correlation_map& loopy pearsonrbased approach listed in the other answer.(taken from the function test_generate_correlation_map()without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.

本节将运行时性能与建议的方法与另一个答案中列出的基于generate_correlation_map& 循环pearsonr的方法进行比较。（取自函数test_generate_correlation_map()末尾没有值正确性验证码）。请注意，所提议方法的时间安排还包括在开始时检查两个输入数组中是否有相同数量的列，正如其他答案中所做的那样。接下来列出了运行时。

Case #1:

情况1：

In [106]: A = np.random.rand(1000, 100)

In [107]: B = np.random.rand(1000, 100)

In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop

In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop

Case #2:

案例#2：

In [110]: A = np.random.rand(5000, 100)

In [111]: B = np.random.rand(5000, 100)

In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop

In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop

Case #3:

案例#3：

In [114]: A = np.random.rand(10000, 10)

In [115]: B = np.random.rand(10000, 10)

In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop

In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop

The other loopy pearsonr basedapproach seemed too slow, but here are the runtimes for one small datasize -

另一种循环pearsonr based方法似乎太慢了，但这里是一个小数据大小的运行时 -

In [118]: A = np.random.rand(1000, 100)

In [119]: B = np.random.rand(1000, 100)

In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop

In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop

In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop

Answer 2

回答by dbliss

@Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.

@Divakar 为计算未缩放的相关性提供了一个很好的选择，这正是我最初要求的。

In order to calculate the correlation coefficient, a bit more is required:

为了计算相关系数，需要多一点：

import numpy as np

def generate_correlation_map(x, y):
    """Correlate each n with each m.

    Parameters
    ----------
    x : np.array
      Shape N X T.

    y : np.array
      Shape M X T.

    Returns
    -------
    np.array
      N X M array in which each element is a correlation coefficient.

    """
    mu_x = x.mean(1)
    mu_y = y.mean(1)
    n = x.shape[1]
    if n != y.shape[1]:
        raise ValueError('x and y must ' +
                         'have the same number of timepoints.')
    s_x = x.std(1, ddof=n - 1)
    s_y = y.std(1, ddof=n - 1)
    cov = np.dot(x,
                 y.T) - n * np.dot(mu_x[:, np.newaxis],
                                  mu_y[np.newaxis, :])
    return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])

Here's a test of this function, which passes:

这是对这个函数的测试，它通过了：

from scipy.stats import pearsonr

def test_generate_correlation_map():
    x = np.random.rand(10, 10)
    y = np.random.rand(20, 10)
    desired = np.empty((10, 20))
    for n in range(x.shape[0]):
        for m in range(y.shape[0]):
            desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
    actual = generate_correlation_map(x, y)
    np.testing.assert_array_almost_equal(actual, desired)

Python 计算两个多维数组之间的相关系数

提问by dbliss

采纳答案by Divakar

回答by dbliss

相关推荐

最近更新

标签

Python 计算两个多维数组之间的相关系数

提问by dbliss

采纳答案by Divakar

回答by dbliss

相关推荐

Python NameError: 名称 'true' 未定义

Python 的 type() 函数及其“if”相关问题

Python 如何仅打印出熊猫数据框的索引？

Python 如何在使用 Pandas 读取 Excel 文件时指定列名？

相关推荐

最近更新

标签