Python 使用 numpy 计算成对互信息的最佳方法

Question

提问by nahsivar

For an m x nmatrix, what's the optimal (fastest) way to compute the mutual information for all pairs of columns (n x n)?

对于mxn矩阵，计算所有列对 ( nxn)的互信息的最佳（最快）方法是什么？

By mutual information, I mean:

通过互信息，我的意思是：

I(X, Y) = H(X) + H(Y) - H(X,Y)

where H(X)refers to the Shannon entropy of X.

其中H（X）是指的香农熵X。

Currently I'm using np.histogram2dand np.histogramto calculate the joint (X,Y)and individual (X or Y)counts. For a given matrix A(e.g. a 250000 X 1000 matrix of floats), I am doing a nested forloop,

目前我正在使用np.histogram2d和np.histogram来计算联合(X,Y)和个体(X 或 Y)计数。对于给定的矩阵A（例如一个 250000 X 1000 的浮点矩阵），我正在做一个嵌套for循环，

    n = A.shape[1]
    for ix = arange(n)  
        for jx = arange(ix+1,n):
           matMI[ix,jx]= calc_MI(A[:,ix],A[:,jx])

Surely there must be better/faster ways to do this?

当然必须有更好/更快的方法来做到这一点？

As an aside, I've also looked for mapping functions on columns (column-wise or row-wise operations) on arrays, but haven't found a good general answer yet.

顺便说一句，我还在数组的列（按列或按行操作）上寻找映射函数，但还没有找到一个好的通用答案。

Here is my full implementation, following the conventions in the Wiki page:

这是我的完整实现，遵循Wiki 页面中的约定：

import numpy as np

def calc_MI(X,Y,bins):

   c_XY = np.histogram2d(X,Y,bins)[0]
   c_X = np.histogram(X,bins)[0]
   c_Y = np.histogram(Y,bins)[0]

   H_X = shan_entropy(c_X)
   H_Y = shan_entropy(c_Y)
   H_XY = shan_entropy(c_XY)

   MI = H_X + H_Y - H_XY
   return MI

def shan_entropy(c):
    c_normalized = c / float(np.sum(c))
    c_normalized = c_normalized[np.nonzero(c_normalized)]
    H = -sum(c_normalized* np.log2(c_normalized))  
    return H

A = np.array([[ 2.0,  140.0,  128.23, -150.5, -5.4  ],
              [ 2.4,  153.11, 130.34, -130.1, -9.5  ],
              [ 1.2,  156.9,  120.11, -110.45,-1.12 ]])

bins = 5 # ?
n = A.shape[1]
matMI = np.zeros((n, n))

for ix in np.arange(n):
    for jx in np.arange(ix+1,n):
        matMI[ix,jx] = calc_MI(A[:,ix], A[:,jx], bins)

Although my working version with nested forloops does it at reasonable speed, I'd like to know if there is a more optimal way to apply calc_MIon all the columns of A(to calculate their pairwise mutual information)?

尽管我的带有嵌套for循环的工作版本以合理的速度完成，但我想知道是否有更优化的方法来应用（计算它们的成对互信息）的calc_MI所有列A？

I'd also like to know:

我也想知道：

Whether there are efficient ways to map functions to operate on columns (or rows) of np.arrays(maybe like np.vectorize, which looks more like a decorator)?
Whether there are other optimal implementations for this specific calculation (mutual information)?

是否有有效的方法来映射函数以对np.arrays（可能像np.vectorize，看起来更像装饰器）的列（或行）进行操作？
对于这个特定的计算（互信息）是否还有其他最佳实现？

Answer 1

回答by Warren Weckesser

I can't suggest a faster calculation for the outer loop over the n*(n-1)/2 vectors, but your implementation of calc_MI(x, y, bins)can be simplified if you can use scipy version 0.13 or scikit-learn.

我不能建议对 n*(n-1)/2 向量的外循环进行更快的计算，但是calc_MI(x, y, bins)如果您可以使用 scipy 0.13 版或scikit-learn，则可以简化您的实现。

In scipy 0.13, the lambda_argument was added to scipy.stats.chi2_contingencyThis argument controls the statistic that is computed by the function. If you use lambda_="log-likelihood"(or lambda_=0), the log-likelihood ratio is returned. This is also often called the G or G²statistic. Other than a factor of 2*n (where n is the total number of samples in the contingency table), this isthe mutual information. So you could implement calc_MIas:

在 scipy 0.13 中，该lambda_参数被添加scipy.stats.chi2_contingency到此参数控制由函数计算的统计信息。如果您使用lambda_="log-likelihood"(或lambda_=0)，则返回对数似然比。这通常也称为 G 或 G ²统计量。除了因子 2*n（其中 n 是列联表中的样本总数）之外，这是互信息。所以你可以实现calc_MI为：

from scipy.stats import chi2_contingency

def calc_MI(x, y, bins):
    c_xy = np.histogram2d(x, y, bins)[0]
    g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood")
    mi = 0.5 * g / c_xy.sum()
    return mi

The only difference between this and your implementation is that this implementation uses the natural logarithm instead of the base-2 logarithm (so it is expressing the information in "nats" instead of "bits"). If you really prefer bits, just divide miby log(2).

这与您的实现之间的唯一区别是，此实现使用自然对数而不是以 2 为底的对数（因此它以“nats”而不是“位”表示信息）。如果您真的更喜欢位，只需除以milog(2)。

If you have (or can install) sklearn(i.e. scikit-learn), you can use sklearn.metrics.mutual_info_score, and implement calc_MIas:

如果您有（或可以安装）sklearn（即 scikit-learn），您可以使用 sklearn.metrics.mutual_info_score, 并实现calc_MI为：

from sklearn.metrics import mutual_info_score

def calc_MI(x, y, bins):
    c_xy = np.histogram2d(x, y, bins)[0]
    mi = mutual_info_score(None, None, contingency=c_xy)
    return mi

Python 使用 numpy 计算成对互信息的最佳方法

提问by nahsivar

回答by Warren Weckesser

相关推荐

最近更新

标签

Python 使用 numpy 计算成对互信息的最佳方法

提问by nahsivar

回答by Warren Weckesser

相关推荐

Python Django Rest框架文件上传

Python 如何在scrapy蜘蛛中传递用户定义的参数

Python 从 Shapely 中的多边形中提取点/坐标

Python 将图像 (png) 转换为矩阵，然后转换为一维数组

相关推荐

最近更新

标签