在 Python 中计算熵的最快方法

Question

提问by blueSurfer

In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:

在我的项目中，我需要多次计算 0-1 向量的熵。这是我的代码：

def entropy(labels):
    """ Computes entropy of 0-1 vector. """
    n_labels = len(labels)

    if n_labels <= 1:
        return 0

    counts = np.bincount(labels)
    probs = counts[np.nonzero(counts)] / n_labels
    n_classes = len(probs)

    if n_classes <= 1:
        return 0
    return - np.sum(probs * np.log(probs)) / np.log(n_classes)

Is there a faster way?

有没有更快的方法？

Answer 1

回答by blueSurfer

Following the suggestion from unutbu I create a pure python implementation.

按照 unutbu 的建议，我创建了一个纯 python 实现。

def entropy2(labels):
 """ Computes entropy of label distribution. """
    n_labels = len(labels)

    if n_labels <= 1:
        return 0

    counts = np.bincount(labels)
    probs = counts / n_labels
    n_classes = np.count_nonzero(probs)

    if n_classes <= 1:
        return 0

    ent = 0.

    # Compute standard entropy.
    for i in probs:
        ent -= i * log(i, base=n_classes)

    return ent

The point I was missing was that labels is a large array, however probs is 3 or 4 elements long. Using pure python my application now is twice as fast.

我缺少的一点是标签是一个大数组，但是 probs 是 3 或 4 个元素长。使用纯 python，我的应用程序现在速度是原来的两倍。

Answer 2

回答by chupvl

take a look here also, there is a classical Shannon Entropy, should be a little bit faster then one by JohnEntropy http://pythonfiddle.com/shannon-entropy-calculation/

也看看这里，有一个经典的香农熵，应该比 JohnEntropy 的一个快一点http://pythonfiddle.com/shannon-entropy-calculation/

Answer 3

回答by d.b

The above answer is good, but if you need a version that can operate along different axes, here's a working implementation.

上面的答案很好，但是如果您需要一个可以沿不同轴运行的版本，这里有一个可行的实现。

def entropy(A, axis=None):
    """Computes the Shannon entropy of the elements of A. Assumes A is 
    an array-like of nonnegative ints whose max value is approximately 
    the number of unique values present.

    >>> a = [0, 1]
    >>> entropy(a)
    1.0
    >>> A = np.c_[a, a]
    >>> entropy(A)
    1.0
    >>> A                   # doctest: +NORMALIZE_WHITESPACE
    array([[0, 0], [1, 1]])
    >>> entropy(A, axis=0)  # doctest: +NORMALIZE_WHITESPACE
    array([ 1., 1.])
    >>> entropy(A, axis=1)  # doctest: +NORMALIZE_WHITESPACE
    array([[ 0.], [ 0.]])
    >>> entropy([0, 0, 0])
    0.0
    >>> entropy([])
    0.0
    >>> entropy([5])
    0.0
    """
    if A is None or len(A) < 2:
        return 0.

    A = np.asarray(A)

    if axis is None:
        A = A.flatten()
        counts = np.bincount(A) # needs small, non-negative ints
        counts = counts[counts > 0]
        if len(counts) == 1:
            return 0. # avoid returning -0.0 to prevent weird doctests
        probs = counts / float(A.size)
        return -np.sum(probs * np.log2(probs))
    elif axis == 0:
        entropies = map(lambda col: entropy(col), A.T)
        return np.array(entropies)
    elif axis == 1:
        entropies = map(lambda row: entropy(row), A)
        return np.array(entropies).reshape((-1, 1))
    else:
        raise ValueError("unsupported axis: {}".format(axis))

Answer 4

回答by joemadeus

An answer that doesn't rely on numpy, either:

一个不依赖于 numpy 的答案，要么：

import math
from collections import Counter

def eta(data, unit='natural'):
    base = {
        'shannon' : 2.,
        'natural' : math.exp(1),
        'hartley' : 10.
    }

    if len(data) <= 1:
        return 0

    counts = Counter()

    for d in data:
        counts[d] += 1

    ent = 0

    probs = [float(c) / len(data) for c in counts.values()]
    for p in probs:
        if p > 0.:
            ent -= p * math.log(p, base[unit])

    return ent

This will accept any datatype you could throw at it:

这将接受您可以抛出的任何数据类型：

>>> eta(['mary', 'had', 'a', 'little', 'lamb'])
1.6094379124341005

>>> eta([c for c in "mary had a little lamb"])
2.311097886212714

The answer provided by @Jarad suggested timings as well. To that end:

@Jarad 提供的答案也建议了时间。为此：

repeat_number = 1000000
e = timeit.repeat(
    stmt='''eta(labels)''', 
    setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import eta''', 
    repeat=3, 
    number=repeat_number)

Timeit results: (I believe this is ~4x faster than the best numpy approach)

Timeit 结果：（我相信这比最好的 numpy 方法快约 4 倍）

print('Method: {}, Avg.: {:.6f}'.format("eta", np.array(e).mean()))

Method: eta, Avg.: 10.461799

Answer 5

回答by Sanjeet Gupta

With the data as a pd.Seriesand scipy.stats, calculating the entropy of a given quantity is pretty straightforward:

将数据作为 apd.Series和scipy.stats，计算给定数量的熵非常简单：

import pandas as pd
import scipy.stats

def ent(data):
    """Calculates entropy of the passed `pd.Series`
    """
    p_data = data.value_counts()           # counts occurrence of each value
    entropy = scipy.stats.entropy(p_data)  # get entropy from counts
    return entropy

Note: scipy.statswill normalize the provided data, so this doesn't need to be done explicitly, i.e. passing an array of counts works fine.

注意：scipy.stats将规范化提供的数据，因此不需要显式完成，即传递计数数组可以正常工作。

Answer 6

回答by Ottotos

My favorite function for entropy is the following:

我最喜欢的熵函数如下：

def entropy(labels):
    prob_dict = {x:labels.count(x)/len(labels) for x in labels}
    probs = np.array(list(prob_dict.values()))

    return - probs.dot(np.log2(probs))

I am still looking for a nicer way to avoid the dict -> values -> list -> np.array conversion. Will comment again if I found it.

我仍在寻找一种更好的方法来避免 dict -> values -> list -> np.array 转换。如果我找到它会再次评论。

Answer 7

回答by Jarad

@Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.

@Sanjeet Gupta 的回答很好，但可以浓缩。这个问题专门询问“最快”方式，但我只在一个答案上看到次数，所以我将使用 scipy 和 numpy 与原始海报的 entropy2 答案进行比较，并稍作改动。

Four different approaches: scipy/numpy, numpy/math, pandas/numpy, numpy

四种不同的方法：scipy/numpy、numpy/math、pandas/numpy、numpy

import numpy as np
from scipy.stats import entropy
from math import log, e
import pandas as pd

import timeit

def entropy1(labels, base=None):
  value,counts = np.unique(labels, return_counts=True)
  return entropy(counts, base=base)

def entropy2(labels, base=None):
  """ Computes entropy of label distribution. """

  n_labels = len(labels)

  if n_labels <= 1:
    return 0

  value,counts = np.unique(labels, return_counts=True)
  probs = counts / n_labels
  n_classes = np.count_nonzero(probs)

  if n_classes <= 1:
    return 0

  ent = 0.

  # Compute entropy
  base = e if base is None else base
  for i in probs:
    ent -= i * log(i, base)

  return ent

def entropy3(labels, base=None):
  vc = pd.Series(labels).value_counts(normalize=True, sort=False)
  base = e if base is None else base
  return -(vc * np.log(vc)/np.log(base)).sum()

def entropy4(labels, base=None):
  value,counts = np.unique(labels, return_counts=True)
  norm_counts = counts / counts.sum()
  base = e if base is None else base
  return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()

Timeit operations:

时间操作：

repeat_number = 1000000

a = timeit.repeat(stmt='''entropy1(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',
                  repeat=3, number=repeat_number)

b = timeit.repeat(stmt='''entropy2(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',
                  repeat=3, number=repeat_number)

c = timeit.repeat(stmt='''entropy3(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',
                  repeat=3, number=repeat_number)

d = timeit.repeat(stmt='''entropy4(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',
                  repeat=3, number=repeat_number)

Timeit results:

时间结果：

# for loop to print out results of timeit
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):
  print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))

Method: scipy/numpy, Avg.: 63.315312
Method: numpy/math, Avg.: 49.256894
Method: pandas/numpy, Avg.: 884.644023
Method: numpy, Avg.: 60.026938

Winner: numpy/math(entropy2)

获胜者：numpy/math(entropy2)

It's also worth noting that the entropy2function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab')). The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.

还值得注意的是，entropy2上面的函数可以处理数字和文本数据。例如：entropy2(list('abcdefabacdebcab'))。原始海报的答案来自 2013 年，并且有一个用于分箱整数的特定用例，但它不适用于文本。

Answer 8

回答by kravietz

Uniformly distributed data (high entropy):

均匀分布的数据（高熵）：

s=range(0,256)

Shannon entropy calculation step by step:

香农熵计算步骤：

import collections

# calculate probability for each byte as number of occurrences / array length
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
# [0.00390625, 0.00390625, 0.00390625, ...]

# calculate per-character entropy fractions
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
# [0.03125, 0.03125, 0.03125, ...]

# sum fractions to obtain Shannon entropy
entropy = sum(e_x)
>>> entropy 
8.0

One-liner (assuming import collections):

单线（假设import collections）：

def H(s): return sum([-p_x*math.log(p_x,2) for p_x in [n_x/len(s) for x,n_x in collections.Counter(s).items()]])

A proper function:

一个适当的功能：

import collections

def H(s):
    probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
    e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]    
    return sum(e_x)

Test cases - English text taken from CyberChef entropy estimator:

测试用例 - 取自CyberChef entropy estimator 的英文文本：

>>> H(range(0,256))
8.0
>>> H(range(0,64))
6.0
>>> H(range(0,128))
7.0
>>> H([0,1])
1.0
>>> H('Standard English text usually falls somewhere between 3.5 and 5')
4.228788210509104

Answer 9

回答by Tan Duong

Here is my approach:

这是我的方法：

labels = [0, 0, 1, 1]

from collections import Counter
from scipy import stats

stats.entropy(list(Counter(labels).values()), base=2)

Answer 10

回答by Krishna Chaitanya Gopaluni

from collections import Counter
from scipy import stats

labels = [0.9, 0.09, 0.1]
stats.entropy(list(Counter(labels).keys()), base=2)

在 Python 中计算熵的最快方法

提问by blueSurfer

回答by blueSurfer

回答by chupvl

回答by d.b

回答by joemadeus

回答by Sanjeet Gupta

回答by Ottotos

回答by Jarad

回答by kravietz

回答by Tan Duong

回答by Krishna Chaitanya Gopaluni

相关推荐

最近更新

标签

在 Python 中计算熵的最快方法

提问by blueSurfer

回答by blueSurfer

回答by chupvl

回答by d.b

回答by joemadeus

回答by Sanjeet Gupta

回答by Ottotos

回答by Jarad

回答by kravietz

回答by Tan Duong

回答by Krishna Chaitanya Gopaluni

相关推荐

从python列表中删除字符串中所有出现的单词

Python Openpyxl 缺少“jdcal”

Python 什么是 PEP8 的 E128：为了视觉缩进而缩进不足的续行？

Python 布尔值到小写字符串

相关推荐

最近更新

标签