在 Python 中计算熵的最快方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15450192/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest way to compute entropy in Python
提问by blueSurfer
In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:
在我的项目中,我需要多次计算 0-1 向量的熵。这是我的代码:
def entropy(labels):
""" Computes entropy of 0-1 vector. """
n_labels = len(labels)
if n_labels <= 1:
return 0
counts = np.bincount(labels)
probs = counts[np.nonzero(counts)] / n_labels
n_classes = len(probs)
if n_classes <= 1:
return 0
return - np.sum(probs * np.log(probs)) / np.log(n_classes)
Is there a faster way?
有没有更快的方法?
回答by blueSurfer
Following the suggestion from unutbu I create a pure python implementation.
按照 unutbu 的建议,我创建了一个纯 python 实现。
def entropy2(labels):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
counts = np.bincount(labels)
probs = counts / n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute standard entropy.
for i in probs:
ent -= i * log(i, base=n_classes)
return ent
The point I was missing was that labels is a large array, however probs is 3 or 4 elements long. Using pure python my application now is twice as fast.
我缺少的一点是标签是一个大数组,但是 probs 是 3 或 4 个元素长。使用纯 python,我的应用程序现在速度是原来的两倍。
回答by chupvl
take a look here also, there is a classical Shannon Entropy, should be a little bit faster then one by JohnEntropy http://pythonfiddle.com/shannon-entropy-calculation/
也看看这里,有一个经典的香农熵,应该比 JohnEntropy 的一个快一点http://pythonfiddle.com/shannon-entropy-calculation/
回答by d.b
The above answer is good, but if you need a version that can operate along different axes, here's a working implementation.
上面的答案很好,但是如果您需要一个可以沿不同轴运行的版本,这里有一个可行的实现。
def entropy(A, axis=None):
"""Computes the Shannon entropy of the elements of A. Assumes A is
an array-like of nonnegative ints whose max value is approximately
the number of unique values present.
>>> a = [0, 1]
>>> entropy(a)
1.0
>>> A = np.c_[a, a]
>>> entropy(A)
1.0
>>> A # doctest: +NORMALIZE_WHITESPACE
array([[0, 0], [1, 1]])
>>> entropy(A, axis=0) # doctest: +NORMALIZE_WHITESPACE
array([ 1., 1.])
>>> entropy(A, axis=1) # doctest: +NORMALIZE_WHITESPACE
array([[ 0.], [ 0.]])
>>> entropy([0, 0, 0])
0.0
>>> entropy([])
0.0
>>> entropy([5])
0.0
"""
if A is None or len(A) < 2:
return 0.
A = np.asarray(A)
if axis is None:
A = A.flatten()
counts = np.bincount(A) # needs small, non-negative ints
counts = counts[counts > 0]
if len(counts) == 1:
return 0. # avoid returning -0.0 to prevent weird doctests
probs = counts / float(A.size)
return -np.sum(probs * np.log2(probs))
elif axis == 0:
entropies = map(lambda col: entropy(col), A.T)
return np.array(entropies)
elif axis == 1:
entropies = map(lambda row: entropy(row), A)
return np.array(entropies).reshape((-1, 1))
else:
raise ValueError("unsupported axis: {}".format(axis))
回答by joemadeus
An answer that doesn't rely on numpy, either:
一个不依赖于 numpy 的答案,要么:
import math
from collections import Counter
def eta(data, unit='natural'):
base = {
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.
}
if len(data) <= 1:
return 0
counts = Counter()
for d in data:
counts[d] += 1
ent = 0
probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ent
This will accept any datatype you could throw at it:
这将接受您可以抛出的任何数据类型:
>>> eta(['mary', 'had', 'a', 'little', 'lamb'])
1.6094379124341005
>>> eta([c for c in "mary had a little lamb"])
2.311097886212714
The answer provided by @Jarad suggested timings as well. To that end:
@Jarad 提供的答案也建议了时间。为此:
repeat_number = 1000000
e = timeit.repeat(
stmt='''eta(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import eta''',
repeat=3,
number=repeat_number)
Timeit results: (I believe this is ~4x faster than the best numpy approach)
Timeit 结果:(我相信这比最好的 numpy 方法快约 4 倍)
print('Method: {}, Avg.: {:.6f}'.format("eta", np.array(e).mean()))
Method: eta, Avg.: 10.461799
回答by Sanjeet Gupta
With the data as a pd.Seriesand scipy.stats, calculating the entropy of a given quantity is pretty straightforward:
将数据作为 apd.Series和scipy.stats,计算给定数量的熵非常简单:
import pandas as pd
import scipy.stats
def ent(data):
"""Calculates entropy of the passed `pd.Series`
"""
p_data = data.value_counts() # counts occurrence of each value
entropy = scipy.stats.entropy(p_data) # get entropy from counts
return entropy
Note: scipy.statswill normalize the provided data, so this doesn't need to be done explicitly, i.e. passing an array of counts works fine.
注意:scipy.stats将规范化提供的数据,因此不需要显式完成,即传递计数数组可以正常工作。
回答by Ottotos
My favorite function for entropy is the following:
我最喜欢的熵函数如下:
def entropy(labels):
prob_dict = {x:labels.count(x)/len(labels) for x in labels}
probs = np.array(list(prob_dict.values()))
return - probs.dot(np.log2(probs))
I am still looking for a nicer way to avoid the dict -> values -> list -> np.array conversion. Will comment again if I found it.
我仍在寻找一种更好的方法来避免 dict -> values -> list -> np.array 转换。如果我找到它会再次评论。
回答by Jarad
@Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.
@Sanjeet Gupta 的回答很好,但可以浓缩。这个问题专门询问“最快”方式,但我只在一个答案上看到次数,所以我将使用 scipy 和 numpy 与原始海报的 entropy2 答案进行比较,并稍作改动。
Four different approaches: scipy/numpy, numpy/math, pandas/numpy, numpy
四种不同的方法:scipy/numpy、numpy/math、pandas/numpy、numpy
import numpy as np
from scipy.stats import entropy
from math import log, e
import pandas as pd
import timeit
def entropy1(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
return entropy(counts, base=base)
def entropy2(labels, base=None):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
value,counts = np.unique(labels, return_counts=True)
probs = counts / n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute entropy
base = e if base is None else base
for i in probs:
ent -= i * log(i, base)
return ent
def entropy3(labels, base=None):
vc = pd.Series(labels).value_counts(normalize=True, sort=False)
base = e if base is None else base
return -(vc * np.log(vc)/np.log(base)).sum()
def entropy4(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
norm_counts = counts / counts.sum()
base = e if base is None else base
return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()
Timeit operations:
时间操作:
repeat_number = 1000000
a = timeit.repeat(stmt='''entropy1(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',
repeat=3, number=repeat_number)
b = timeit.repeat(stmt='''entropy2(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',
repeat=3, number=repeat_number)
c = timeit.repeat(stmt='''entropy3(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',
repeat=3, number=repeat_number)
d = timeit.repeat(stmt='''entropy4(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',
repeat=3, number=repeat_number)
Timeit results:
时间结果:
# for loop to print out results of timeit
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):
print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))
Method: scipy/numpy, Avg.: 63.315312
Method: numpy/math, Avg.: 49.256894
Method: pandas/numpy, Avg.: 884.644023
Method: numpy, Avg.: 60.026938
Winner: numpy/math(entropy2)
获胜者:numpy/math(entropy2)
It's also worth noting that the entropy2function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab')). The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.
还值得注意的是,entropy2上面的函数可以处理数字和文本数据。例如:entropy2(list('abcdefabacdebcab'))。原始海报的答案来自 2013 年,并且有一个用于分箱整数的特定用例,但它不适用于文本。
回答by kravietz
Uniformly distributed data (high entropy):
均匀分布的数据(高熵):
s=range(0,256)
Shannon entropy calculation step by step:
香农熵计算步骤:
import collections
# calculate probability for each byte as number of occurrences / array length
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
# [0.00390625, 0.00390625, 0.00390625, ...]
# calculate per-character entropy fractions
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
# [0.03125, 0.03125, 0.03125, ...]
# sum fractions to obtain Shannon entropy
entropy = sum(e_x)
>>> entropy
8.0
One-liner (assuming import collections):
单线(假设import collections):
def H(s): return sum([-p_x*math.log(p_x,2) for p_x in [n_x/len(s) for x,n_x in collections.Counter(s).items()]])
A proper function:
一个适当的功能:
import collections
def H(s):
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
return sum(e_x)
Test cases - English text taken from CyberChef entropy estimator:
测试用例 - 取自CyberChef entropy estimator 的英文文本:
>>> H(range(0,256))
8.0
>>> H(range(0,64))
6.0
>>> H(range(0,128))
7.0
>>> H([0,1])
1.0
>>> H('Standard English text usually falls somewhere between 3.5 and 5')
4.228788210509104
回答by Tan Duong
Here is my approach:
这是我的方法:
labels = [0, 0, 1, 1]
from collections import Counter
from scipy import stats
stats.entropy(list(Counter(labels).values()), base=2)
回答by Krishna Chaitanya Gopaluni
from collections import Counter
from scipy import stats
labels = [0.9, 0.09, 0.1]
stats.entropy(list(Counter(labels).keys()), base=2)

