Python 中的现代高性能布隆过滤器？

Question

提问by Parand

I'm looking for a production quality bloom filter implementation in Python to handle fairly large numbers of items (say 100M to 1B items with 0.01% false positive rate).

我正在寻找 Python 中的生产质量布隆过滤器实现来处理相当多的项目（比如 100M 到 1B 的项目，误报率为 0.01%）。

Pybloomis one option but it seems to be showing its age as it throws DeprecationWarning errors on Python 2.5 on a regular basis. Joe Gregorio also has an implementation.

Pybloom是一种选择，但它似乎已经过时了，因为它会定期在 Python 2.5 上抛出 DeprecationWarning 错误。Joe Gregorio 也有一个实现。

Requirements are fast lookup performance and stability. I'm also open to creating Python interfaces to particularly good c/c++ implementations, or even to Jython if there's a good Java implementation.

要求是快速查找性能和稳定性。我也愿意为特别好的 c/c++ 实现创建 Python 接口，如果有好的 Java 实现，甚至可以为 Jython 创建接口。

Lacking that, any recommendations on a bit array / bit vector representation that can handle ~16E9 bits?

缺少这一点，关于可以处理 ~16E9 位的位数组/位向量表示的任何建议？

Answer 1

采纳答案by Parand

Eventually I found pybloomfiltermap. I haven't used it, but it looks like it'd fit the bill.

最终我找到了pybloomfiltermap。我没用过，但看起来很合适。

Answer 2

回答by Ryan Cox

I recently went down this path as well; though it sounds like my application was slightly different. I was interested in approximating set operations on a large number of strings.

我最近也走上了这条路；虽然听起来我的应用程序略有不同。我对在大量字符串上近似设置操作很感兴趣。

You do make the key observation that a fastbit vector is required. Depending on what you want to put in your bloom filter, you may also need to give some thought to the speed of the hashing algorithm(s) used. You might find this libraryuseful. You may also want to tinker with the random number technique used below that only hashes your key a single time.

您确实做出了需要快速位向量的关键观察。根据您想要放入布隆过滤器的内容，您可能还需要考虑所使用的散列算法的速度。你可能会发现这个库很有用。您可能还想修改下面使用的随机数技术，它只对您的密钥进行一次散列。

In terms of non-Java bit array implementations:

在非 Java 位数组实现方面：

Boost has dynamic_bitset
Java has the built in BitSet

Boost 有dynamic_bitset
Java 具有内置的BitSet

I built my bloom filter using BitVector. I spent some time profiling and optimizing the library and contributing back my patches to Avi. Go to that BitVector link and scroll down to acknowledgments in v1.5 to see details. In the end, I realized that performance was not a goal of this project and decided against using it.

我使用BitVector构建了我的布隆过滤器。我花了一些时间分析和优化库并将我的补丁贡献给 Avi。转到那个 BitVector 链接并向下滚动到 v1.5 中的确认以查看详细信息。最后，我意识到性能不是这个项目的目标，并决定不使用它。

Here's some code I had lying around. I may put this up on google code at python-bloom. Suggestions welcome.

这是我躺着的一些代码。我可以把它放在 python-bloom 的谷歌代码上。欢迎提出建议。

from BitVector import BitVector
from random import Random
# get hashes from http://www.partow.net/programming/hashfunctions/index.html
from hashes import RSHash, JSHash, PJWHash, ELFHash, DJBHash


#
# [email protected] / www.asciiarmor.com
#
# copyright (c) 2008, ryan cox
# all rights reserved 
# BSD license: http://www.opensource.org/licenses/bsd-license.php
#

class BloomFilter(object):
    def __init__(self, n=None, m=None, k=None, p=None, bits=None ):
        self.m = m
        if k > 4 or k < 1:
            raise Exception('Must specify value of k between 1 and 4')
        self.k = k
        if bits:
            self.bits = bits
        else:
            self.bits = BitVector( size=m )
        self.rand = Random()
        self.hashes = []
        self.hashes.append(RSHash)
        self.hashes.append(JSHash)
        self.hashes.append(PJWHash)
        self.hashes.append(DJBHash)

        # switch between hashing techniques
        self._indexes = self._rand_indexes
        #self._indexes = self._hash_indexes

    def __contains__(self, key):
        for i in self._indexes(key): 
            if not self.bits[i]:
                return False    
        return True 

    def add(self, key):
        dupe = True 
        bits = []
        for i in self._indexes(key): 
            if dupe and not self.bits[i]:
                dupe = False
            self.bits[i] = 1
            bits.append(i)
        return dupe

    def __and__(self, filter):
        if (self.k != filter.k) or (self.m != filter.m): 
            raise Exception('Must use bloom filters created with equal k / m paramters for bitwise AND')
        return BloomFilter(m=self.m,k=self.k,bits=(self.bits & filter.bits))

    def __or__(self, filter):
        if (self.k != filter.k) or (self.m != filter.m): 
            raise Exception('Must use bloom filters created with equal k / m paramters for bitwise OR')
        return BloomFilter(m=self.m,k=self.k,bits=(self.bits | filter.bits))

    def _hash_indexes(self,key):
        ret = []
        for i in range(self.k):
            ret.append(self.hashes[i](key) % self.m)
        return ret

    def _rand_indexes(self,key):
        self.rand.seed(hash(key))
        ret = []
        for i in range(self.k):
            ret.append(self.rand.randint(0,self.m-1))
        return ret

if __name__ == '__main__':
    e = BloomFilter(m=100, k=4)
    e.add('one')
    e.add('two')
    e.add('three')
    e.add('four')
    e.add('five')        

    f = BloomFilter(m=100, k=4)
    f.add('three')
    f.add('four')
    f.add('five')
    f.add('six')
    f.add('seven')
    f.add('eight')
    f.add('nine')
    f.add("ten")        

    # test check for dupe on add
    assert not f.add('eleven') 
    assert f.add('eleven') 

    # test membership operations
    assert 'ten' in f 
    assert 'one' in e 
    assert 'ten' not in e 
    assert 'one' not in f         

    # test set based operations
    union = f | e
    intersection = f & e

    assert 'ten' in union
    assert 'one' in union 
    assert 'three' in intersection
    assert 'ten' not in intersection
    assert 'one' not in intersection

Also, in my case I found it useful to have a faster count_bits function for BitVector. Drop this code into BitVector 1.5 and it should give you a more performant bit counting method:

此外，就我而言，我发现为 BitVector 提供更快的 count_bits 函数很有用。将此代码放入 BitVector 1.5 中，它应该会为您提供性能更高的位计数方法：

def fast_count_bits( self, v ):
    bits = (
            0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 )

    return bits[v & 0xff] + bits[(v >> 8) & 0xff] + bits[(v >> 16) & 0xff] + bits[v >> 24]

Answer 3

回答by Kazeng

In reaction to Parand, saying "common practice seems to be using something like SHA1 and split up the bits to form multiple hashes", while that may be true in the sense that it's common practice (PyBloom also uses it), it still doesn't mean it's the right thing to do ;-)

作为对 Parand 的回应，他说“通常的做法似乎是使用 SHA1 之类的东西并将位分开以形成多个散列”，虽然这可能是因为它是常见做法（PyBloom 也使用它），但它仍然没有这并不意味着这是正确的做法;-)

For a Bloom filter, the only requirement a hash function has is that its output space must be uniformly distributed given the expected input. While a cryptographic hash certainly fulfils this requirement, it's also a little bit like shooting a fly with a bazooka.

对于布隆过滤器，散列函数的唯一要求是其输出空间必须在给定预期输入的情况下均匀分布。虽然加密哈希确实满足了这个要求，但它也有点像用火箭筒射击苍蝇。

Instead try the FNV Hashwhich uses just one XOR and one multiplication per input byte, which I estimate is a few hundred times faster than SHA1 :)

而是尝试FNV 哈希，它对每个输入字节只使用一个 XOR 和一个乘法，我估计它比 SHA1 快几百倍:)

The FNV hash is not cryptographically secure, but you don't need it to be. It has slightly imperfect avalanche behaviour, but you're not using it for integrity checking either.

FNV 哈希不是加密安全的，但您不需要它。它的雪崩行为略有缺陷，但您也没有将其用于完整性检查。

About uniformity, note that the second link only did a Chi-square test for the 32-bit FNV hash. It's better to use more bits and the FNV-1 variant, which swaps the XOR and the MUL steps for better bit-dispersion. For a Bloom Filter, there's a few more catches, such as mapping the output uniformly to the index range of the bit-array. If possible, I'd say round up the size of the bit-array to the nearest power of 2 and adjust kaccordingly. That way you get better accuracy and you can use simple XOR-folding to map the range.

关于一致性，请注意，第二个链接仅对 32 位 FNV 哈希进行了卡方检验。最好使用更多位和 FNV-1 变体，它交换 XOR 和 MUL 步骤以获得更好的位分散。对于布隆过滤器，还有一些问题，例如将输出统一映射到位数组的索引范围。如果可能，我会说将位数组的大小四舍五入到最接近的 2 的幂并相应地调整k。这样，您可以获得更高的准确性，并且可以使用简单的 XOR 折叠来映射范围。

Additionally, here's a reference explaining why you don't want SHA1 (or any cryptographic hash) when you need a general purpose hash.

此外，这里有一个参考，解释了为什么在需要通用 hash时不需要 SHA1（或任何加密散列）。

Answer 4

回答by S.Lott

Look at the arraymodule.

查看数组模块。

class Bit( object ):
    def __init__( self, size ):
        self.bits= array.array('B',[0 for i in range((size+7)//8)] )
    def set( self, bit ):
        b= self.bits[bit//8]
        self.bits[bit//8] = b | 1 << (bit % 8)
    def get( self, bit ):
        b= self.bits[bit//8]
        return (b >> (bit % 8)) & 1

FWIW, all of the //8and % 8operations can be replaced with >>3and &0x07. This maylead to slightly better speed at the risk of some obscurity.

FWIW，所有的//8and% 8操作都可以替换为>>3and &0x07。这可能会导致稍微好一点的速度，但有一些模糊的风险。

Also, changing 'B'and 8to 'L'and 32should be faster on most hardware. [Changing to 'H'and 16 might be faster on some hardware, but it's doubtful.]

此外，更改'B'和8以'L'和32应在大多数硬件上更快。['H'在某些硬件上更改为和 16 可能会更快，但值得怀疑。]

Answer 5

回答by user1277476

I've put up a python bloom filter implementation at http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

我在http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ 提供了一个 python 布隆过滤器实现

It's in pure python, has good hash functions, good automated tests, a selection of backends (disk, array, mmap, more) and more intuitive arguments to the __init__method, so you can specify an ideal number of elements and desired maximum error rate, instead of somewhat ethereal, datastructure-specific tunables.

它使用纯 Python 编写，具有良好的散列函数、良好的自动化测试、一系列后端（磁盘、数组、mmap 等）和更直观的__init__方法参数，因此您可以指定理想的元素数量和所需的最大错误率，而不是一些空灵的、特定于数据结构的可调参数。

Answer 6

回答by Graham Lea

It's almost a decade since the most recent answers on here. And times do change.

距离这里的最新答案已经快十年了。时代确实在变化。

Looks like the most popular maintainedbloom filter package in late 2019 is now this one: https://github.com/joseph-fox/python-bloomfilter, available on PyPi as pybloom_live: https://pypi.org/project/pybloom_live/

像最流行的外观保持布隆过滤器包在2019年后期，现在是这个：https://github.com/joseph-fox/python-bloomfilter，可PyPI上的pybloom_live：https://pypi.org/project/pybloom_live /

Python 中的现代高性能布隆过滤器？

提问by Parand

采纳答案by Parand

回答by Ryan Cox

回答by Kazeng

回答by S.Lott

回答by user1277476

回答by Graham Lea

相关推荐

最近更新

标签

Python 中的现代高性能布隆过滤器？

提问by Parand

采纳答案by Parand

回答by Ryan Cox

回答by Kazeng

回答by S.Lott

回答by user1277476

回答by Graham Lea

相关推荐

Python 代码的代码覆盖率和单元测试

是否有跨平台的方式从 Python 的 OSError 获取信息？

UTF-8 latin-1 转换问题，python django

如何使用进度条在 Python 中复制文件？

相关推荐

最近更新

标签