Python 使用 numpy 计算成对互信息的最佳方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20491028/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:36:10  来源:igfitidea点击:

Optimal way to compute pairwise mutual information using numpy

pythonperformancenumpyscipyinformation-theory

提问by nahsivar

For an m x nmatrix, what's the optimal (fastest) way to compute the mutual information for all pairs of columns (n x n)?

对于mxn矩阵,计算所有列对 ( nxn)的互信息的最佳(最快)方法是什么?

By mutual information, I mean:

通过互信息,我的意思是:

I(X, Y) = H(X) + H(Y) - H(X,Y)

I(X, Y) = H(X) + H(Y) - H(X,Y)

where H(X)refers to the Shannon entropy of X.

其中H(X)是指的香农熵X

Currently I'm using np.histogram2dand np.histogramto calculate the joint (X,Y)and individual (X or Y)counts. For a given matrix A(e.g. a 250000 X 1000 matrix of floats), I am doing a nested forloop,

目前我正在使用np.histogram2dnp.histogram来计算联合(X,Y)和个体(X 或 Y)计数。对于给定的矩阵A(例如一个 250000 X 1000 的浮点矩阵),我正在做一个嵌套for循环,

    n = A.shape[1]
    for ix = arange(n)  
        for jx = arange(ix+1,n):
           matMI[ix,jx]= calc_MI(A[:,ix],A[:,jx])

Surely there must be better/faster ways to do this?

当然必须有更好/更快的方法来做到这一点?

As an aside, I've also looked for mapping functions on columns (column-wise or row-wise operations) on arrays, but haven't found a good general answer yet.

顺便说一句,我还在数组的列(按列或按行操作)上寻找映射函数,但还没有找到一个好的通用答案。

Here is my full implementation, following the conventions in the Wiki page:

这是我的完整实现,遵循Wiki 页面中的约定:

import numpy as np

def calc_MI(X,Y,bins):

   c_XY = np.histogram2d(X,Y,bins)[0]
   c_X = np.histogram(X,bins)[0]
   c_Y = np.histogram(Y,bins)[0]

   H_X = shan_entropy(c_X)
   H_Y = shan_entropy(c_Y)
   H_XY = shan_entropy(c_XY)

   MI = H_X + H_Y - H_XY
   return MI

def shan_entropy(c):
    c_normalized = c / float(np.sum(c))
    c_normalized = c_normalized[np.nonzero(c_normalized)]
    H = -sum(c_normalized* np.log2(c_normalized))  
    return H

A = np.array([[ 2.0,  140.0,  128.23, -150.5, -5.4  ],
              [ 2.4,  153.11, 130.34, -130.1, -9.5  ],
              [ 1.2,  156.9,  120.11, -110.45,-1.12 ]])

bins = 5 # ?
n = A.shape[1]
matMI = np.zeros((n, n))

for ix in np.arange(n):
    for jx in np.arange(ix+1,n):
        matMI[ix,jx] = calc_MI(A[:,ix], A[:,jx], bins)

Although my working version with nested forloops does it at reasonable speed, I'd like to know if there is a more optimal way to apply calc_MIon all the columns of A(to calculate their pairwise mutual information)?

尽管我的带有嵌套for循环的工作版本以合理的速度完成,但我想知道是否有更优化的方法来应用(计算它们的成对互信息)的calc_MI所有列A

I'd also like to know:

我也想知道:

  1. Whether there are efficient ways to map functions to operate on columns (or rows) of np.arrays(maybe like np.vectorize, which looks more like a decorator)?

  2. Whether there are other optimal implementations for this specific calculation (mutual information)?

  1. 是否有有效的方法来映射函数以对np.arrays(可能像np.vectorize,看起来更像装饰器)的列(或行)进行操作?

  2. 对于这个特定的计算(互信息)是否还有其他最佳实现?

回答by Warren Weckesser

I can't suggest a faster calculation for the outer loop over the n*(n-1)/2 vectors, but your implementation of calc_MI(x, y, bins)can be simplified if you can use scipy version 0.13 or scikit-learn.

我不能建议对 n*(n-1)/2 向量的外循环进行更快的计算,但是calc_MI(x, y, bins)如果您可以使用 scipy 0.13 版或scikit-learn,则可以简化您的实现。

In scipy 0.13, the lambda_argument was added to scipy.stats.chi2_contingencyThis argument controls the statistic that is computed by the function. If you use lambda_="log-likelihood"(or lambda_=0), the log-likelihood ratio is returned. This is also often called the G or G2statistic. Other than a factor of 2*n (where n is the total number of samples in the contingency table), this isthe mutual information. So you could implement calc_MIas:

在 scipy 0.13 中,该lambda_参数被添加scipy.stats.chi2_contingency到此参数控制由函数计算的统计信息。如果您使用lambda_="log-likelihood"(或lambda_=0),则返回对数似然比。这通常也称为 G 或 G 2统计量。除了因子 2*n(其中 n 是列联表中的样本总数)之外,这互信息。所以你可以实现calc_MI为:

from scipy.stats import chi2_contingency

def calc_MI(x, y, bins):
    c_xy = np.histogram2d(x, y, bins)[0]
    g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood")
    mi = 0.5 * g / c_xy.sum()
    return mi

The only difference between this and your implementation is that this implementation uses the natural logarithm instead of the base-2 logarithm (so it is expressing the information in "nats" instead of "bits"). If you really prefer bits, just divide miby log(2).

这与您的实现之间的唯一区别是,此实现使用自然对数而不是以 2 为底的对数(因此它以“nats”而不是“位”表示信息)。如果您真的更喜欢位,只需除以milog(2)。

If you have (or can install) sklearn(i.e. scikit-learn), you can use sklearn.metrics.mutual_info_score, and implement calc_MIas:

如果您有(或可以安装)sklearn(即 scikit-learn),您可以使用 sklearn.metrics.mutual_info_score, 并实现calc_MI为:

from sklearn.metrics import mutual_info_score

def calc_MI(x, y, bins):
    c_xy = np.histogram2d(x, y, bins)[0]
    mi = mutual_info_score(None, None, contingency=c_xy)
    return mi