Python 使用 numpy 计算成对互信息的最佳方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20491028/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Optimal way to compute pairwise mutual information using numpy
提问by nahsivar
For an m x nmatrix, what's the optimal (fastest) way to compute the mutual information for all pairs of columns (n x n)?
对于mxn矩阵,计算所有列对 ( nxn)的互信息的最佳(最快)方法是什么?
By mutual information, I mean:
通过互信息,我的意思是:
I(X, Y) = H(X) + H(Y) - H(X,Y)
I(X, Y) = H(X) + H(Y) - H(X,Y)
where H(X)refers to the Shannon entropy of X.
其中H(X)是指的香农熵X。
Currently I'm using np.histogram2dand np.histogramto calculate the joint (X,Y)and individual (X or Y)counts. For a given matrix A(e.g. a 250000 X 1000 matrix of floats), I am doing a nested forloop,
目前我正在使用np.histogram2d和np.histogram来计算联合(X,Y)和个体(X 或 Y)计数。对于给定的矩阵A(例如一个 250000 X 1000 的浮点矩阵),我正在做一个嵌套for循环,
n = A.shape[1]
for ix = arange(n)
for jx = arange(ix+1,n):
matMI[ix,jx]= calc_MI(A[:,ix],A[:,jx])
Surely there must be better/faster ways to do this?
当然必须有更好/更快的方法来做到这一点?
As an aside, I've also looked for mapping functions on columns (column-wise or row-wise operations) on arrays, but haven't found a good general answer yet.
顺便说一句,我还在数组的列(按列或按行操作)上寻找映射函数,但还没有找到一个好的通用答案。
Here is my full implementation, following the conventions in the Wiki page:
这是我的完整实现,遵循Wiki 页面中的约定:
import numpy as np
def calc_MI(X,Y,bins):
c_XY = np.histogram2d(X,Y,bins)[0]
c_X = np.histogram(X,bins)[0]
c_Y = np.histogram(Y,bins)[0]
H_X = shan_entropy(c_X)
H_Y = shan_entropy(c_Y)
H_XY = shan_entropy(c_XY)
MI = H_X + H_Y - H_XY
return MI
def shan_entropy(c):
c_normalized = c / float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
H = -sum(c_normalized* np.log2(c_normalized))
return H
A = np.array([[ 2.0, 140.0, 128.23, -150.5, -5.4 ],
[ 2.4, 153.11, 130.34, -130.1, -9.5 ],
[ 1.2, 156.9, 120.11, -110.45,-1.12 ]])
bins = 5 # ?
n = A.shape[1]
matMI = np.zeros((n, n))
for ix in np.arange(n):
for jx in np.arange(ix+1,n):
matMI[ix,jx] = calc_MI(A[:,ix], A[:,jx], bins)
Although my working version with nested forloops does it at reasonable speed, I'd like to know if there is a more optimal way to apply calc_MIon all the columns of A(to calculate their pairwise mutual information)?
尽管我的带有嵌套for循环的工作版本以合理的速度完成,但我想知道是否有更优化的方法来应用(计算它们的成对互信息)的calc_MI所有列A?
I'd also like to know:
我也想知道:
Whether there are efficient ways to map functions to operate on columns (or rows) of
np.arrays(maybe likenp.vectorize, which looks more like a decorator)?Whether there are other optimal implementations for this specific calculation (mutual information)?
是否有有效的方法来映射函数以对
np.arrays(可能像np.vectorize,看起来更像装饰器)的列(或行)进行操作?对于这个特定的计算(互信息)是否还有其他最佳实现?
回答by Warren Weckesser
I can't suggest a faster calculation for the outer loop over the n*(n-1)/2
vectors, but your implementation of calc_MI(x, y, bins)can be simplified
if you can use scipy version 0.13 or scikit-learn.
我不能建议对 n*(n-1)/2 向量的外循环进行更快的计算,但是calc_MI(x, y, bins)如果您可以使用 scipy 0.13 版或scikit-learn,则可以简化您的实现。
In scipy 0.13, the lambda_argument was added to scipy.stats.chi2_contingencyThis argument controls the statistic that is computed by the function. If
you use lambda_="log-likelihood"(or lambda_=0), the log-likelihood ratio
is returned. This is also often called the G or G2statistic. Other than
a factor of 2*n (where n is the total number of samples in the contingency
table), this isthe mutual information. So you could implement calc_MIas:
在 scipy 0.13 中,该lambda_参数被添加scipy.stats.chi2_contingency到此参数控制由函数计算的统计信息。如果您使用lambda_="log-likelihood"(或lambda_=0),则返回对数似然比。这通常也称为 G 或 G 2统计量。除了因子 2*n(其中 n 是列联表中的样本总数)之外,这是互信息。所以你可以实现calc_MI为:
from scipy.stats import chi2_contingency
def calc_MI(x, y, bins):
c_xy = np.histogram2d(x, y, bins)[0]
g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood")
mi = 0.5 * g / c_xy.sum()
return mi
The only difference between this and your implementation is that this
implementation uses the natural logarithm instead of the base-2 logarithm
(so it is expressing the information in "nats" instead of "bits"). If
you really prefer bits, just divide miby log(2).
这与您的实现之间的唯一区别是,此实现使用自然对数而不是以 2 为底的对数(因此它以“nats”而不是“位”表示信息)。如果您真的更喜欢位,只需除以milog(2)。
If you have (or can install) sklearn(i.e. scikit-learn), you can use
sklearn.metrics.mutual_info_score, and implement calc_MIas:
如果您有(或可以安装)sklearn(即 scikit-learn),您可以使用
sklearn.metrics.mutual_info_score, 并实现calc_MI为:
from sklearn.metrics import mutual_info_score
def calc_MI(x, y, bins):
c_xy = np.histogram2d(x, y, bins)[0]
mi = mutual_info_score(None, None, contingency=c_xy)
return mi

