Python 计算两个多维数组之间的相关系数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30143417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Computing the correlation coefficient between two multi-dimensional arrays
提问by dbliss
I have two arrays that have the shapes N X T
and M X T
. I'd like to compute the correlation coefficient across T
between every possible pair of rows n
and m
(from N
and M
, respectively).
我有两个具有形状N X T
和M X T
. 我想计算T
每对可能的行n
和m
(分别来自N
和M
)之间的相关系数。
What's the fastest, most pythonic way to do this? (Looping over N
and M
would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy
and/or scipy
. Right now my arrays are numpy
array
s, but I'm open to converting them to a different type.
什么是最快,最pythonic的方法来做到这一点?(循环N
并M
在我看来既不快速也不pythonic。)我期待答案涉及numpy
和/或scipy
. 现在我的数组是numpy
array
s,但我愿意将它们转换为不同的类型。
I'm expecting my output to be an array with the shape N X M
.
我希望我的输出是一个形状为 的数组N X M
。
N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.
注意,当我说“相关系数”时,我指的是Pearson 积矩相关系数。
Here are some things to note:
这里有一些注意事项:
- The
numpy
functioncorrelate
requires input arrays to be one-dimensional. - The
numpy
functioncorrcoef
accepts two-dimensional arrays, but they must have the same shape. - The
scipy.stats
functionpearsonr
requires input arrays to be one-dimensional.
- 该
numpy
函数correlate
要求输入数组是一维的。 - 该
numpy
函数corrcoef
接受二维数组,但它们必须具有相同的形状。 - 该
scipy.stats
函数pearsonr
要求输入数组是一维的。
采纳答案by Divakar
Correlation (default 'valid' case) between two 2D arrays:
两个二维数组之间的相关性(默认为“有效”情况):
You can simply use matrix-multiplication np.dot
like so -
您可以np.dot
像这样简单地使用矩阵乘法-
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid"
case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
与"valid"
两个输入数组的每个成对行组合 (row1,row2) 之间的默认情况的相关性将对应于每个 (row1,row2) 位置的乘法结果。
Row-wise Correlation Coefficient calculation for two 2D arrays:
两个二维数组的行相关系数计算:
def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)
# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
这是基于此解决方案 How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
基准测试
This section compares runtime performance with the proposed approach against generate_correlation_map
& loopy pearsonr
based approach listed in the other answer.(taken from the function test_generate_correlation_map()
without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
本节将运行时性能与建议的方法与另一个答案中列出的基于generate_correlation_map
& 循环pearsonr
的方法进行比较。(取自函数test_generate_correlation_map()
末尾没有值正确性验证码)。请注意,所提议方法的时间安排还包括在开始时检查两个输入数组中是否有相同数量的列,正如其他答案中所做的那样。接下来列出了运行时。
Case #1:
情况1:
In [106]: A = np.random.rand(1000, 100)
In [107]: B = np.random.rand(1000, 100)
In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop
In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop
Case #2:
案例#2:
In [110]: A = np.random.rand(5000, 100)
In [111]: B = np.random.rand(5000, 100)
In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop
In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop
Case #3:
案例#3:
In [114]: A = np.random.rand(10000, 10)
In [115]: B = np.random.rand(10000, 10)
In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop
In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based
approach seemed too slow, but here are the runtimes for one small datasize -
另一种循环pearsonr based
方法似乎太慢了,但这里是一个小数据大小的运行时 -
In [118]: A = np.random.rand(1000, 100)
In [119]: B = np.random.rand(1000, 100)
In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop
In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop
In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop
回答by dbliss
@Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.
@Divakar 为计算未缩放的相关性提供了一个很好的选择,这正是我最初要求的。
In order to calculate the correlation coefficient, a bit more is required:
为了计算相关系数,需要多一点:
import numpy as np
def generate_correlation_map(x, y):
"""Correlate each n with each m.
Parameters
----------
x : np.array
Shape N X T.
y : np.array
Shape M X T.
Returns
-------
np.array
N X M array in which each element is a correlation coefficient.
"""
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
if n != y.shape[1]:
raise ValueError('x and y must ' +
'have the same number of timepoints.')
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,
y.T) - n * np.dot(mu_x[:, np.newaxis],
mu_y[np.newaxis, :])
return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])
Here's a test of this function, which passes:
这是对这个函数的测试,它通过了:
from scipy.stats import pearsonr
def test_generate_correlation_map():
x = np.random.rand(10, 10)
y = np.random.rand(20, 10)
desired = np.empty((10, 20))
for n in range(x.shape[0]):
for m in range(y.shape[0]):
desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
actual = generate_correlation_map(x, y)
np.testing.assert_array_almost_equal(actual, desired)