Python 对 numpy 数组进行散列的最有效属性

Question

提问by sapi

I need to be able to store a numpyarrayin a dictfor caching purposes. Hash speed is important.

我需要能够将 a 存储numpyarray在 a 中以dict用于缓存目的。哈希速度很重要。

The arrayrepresents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.

该array代表indicies，所以在对象的真实身份并不重要，值。可变性不是问题，因为我只对当前值感兴趣。

What should I hash in order to store it in a dict?

为了将它存储在 a 中，我应该散列什么dict？

My current approach is to use str(arr.data), which is faster than md5in my testing.

我目前的方法是使用str(arr.data)，这比md5我的测试更快。

I've incorporated some examples from the answers to get an idea of relative times:

我从答案中结合了一些例子来了解相对时间：

In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop

In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop

In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop

In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop

In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop

It would appear that for this particular use case (small arrays of indicies), arr.tostringoffers the best performance.

对于这个特定的用例（索引的小数组），似乎arr.tostring提供了最好的性能。

While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.

虽然散列只读缓冲区本身很快，但设置可写标志的开销实际上使它变慢。

Answer 1

采纳答案by Fred Foo

You can simply hash the underlying buffer, if you make it read-only:

如果将其设置为只读，则可以简单地散列底层缓冲区：

>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop

For very large arrays, hash(str(a))is a lot faster, but then it only takes a small part of the array into account.

对于非常大的数组，hash(str(a))速度要快得多，但它只考虑了数组的一小部分。

>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'

Answer 2

回答by Hensing

What kind of data do you have?

你有什么样的数据？

array-size
do you have an index several times in the array

数组大小
你在数组中有多次索引吗

If your array only consists of permutation of indices you can use a base-convertion

如果您的数组仅包含索引排列，您可以使用基数转换

(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)

and use '10' as hash_key via

并使用 '10' 作为 hash_key 通过

import numpy as num

base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()

hashed_array = (base * array).sum()

Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.

现在您可以使用数组 (shape=(base_size, )) 而不是 dict 来访问值。

Answer 3

回答by hunse

Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:

迟到了，但对于大型数组，我认为一个不错的方法是随机对矩阵进行子采样并散列该样本：

def subsample_hash(a):
    rng = np.random.RandomState(89)
    inds = rng.randint(low=0, high=a.size, size=1000)
    b = a.flat[inds]
    b.flags.writeable = False
    return hash(b.data)

I think this is better than doing hash(str(a)), because the latter could confuse arrays that have unique data in the middle but zeros around the edges.

我认为这比 do 更好hash(str(a))，因为后者可能会混淆中间有唯一数据但边缘为零的数组。

Answer 4

回答by Cong Ma

You can try xxhashvia its Python binding. For large arrays this is much faster than hash(x.tostring()).

你可以xxhash通过它的Python binding尝试。对于大型数组，这比hash(x.tostring()).

Example IPython session:

示例 IPython 会话：

>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop

And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1or md5as hash functions. For performance reasons this is usually notacceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.

顺便说一句，在 Stack Overflow 上发布的各种博客和答案中，您会看到人们使用sha1或md5作为哈希函数。出于性能方面的原因，这通常是不接受的，因为那些“安全”散列函数是相当缓慢的。只有当哈希冲突是最重要的问题之一时，它们才有用。

Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__itself and let Python handle the hash collision[1].

尽管如此，散列冲突总是发生。如果您只需要实现__hash__数据数组对象，以便它们可以用作 Python 字典或集合中的键，我认为最好专注于__hash__自身的速度并让 Python 处理哈希冲突 [1]。

[1] You may need to override __eq__too, to help Python manage hash collision. You would want __eq__to return a boolean, rather than an array of booleans as is done by numpy.

[1] 您可能也需要覆盖__eq__，以帮助 Python 管理哈希冲突。您可能想要__eq__返回一个布尔值，而不是像numpy.

Answer 5

回答by James McGuigan

If your np.array()is small and in a tight loop, then one option is to skip hash()completely and just use np.array().data.tobytes()directly as your dict key:

如果您np.array()的尺寸很小并且处于紧密循环中，那么一种选择是hash()完全跳过并np.array().data.tobytes()直接用作您的 dict 键：

grid  = np.array([[True, False, True],[False, False, True]])
hash  = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
    cache[hash] = function(grid)
return cache[hash]

Python 对 numpy 数组进行散列的最有效属性

提问by sapi

采纳答案by Fred Foo

回答by Hensing

回答by hunse

回答by Cong Ma

回答by James McGuigan

相关推荐

最近更新

标签

Python 对 numpy 数组进行散列的最有效属性

提问by sapi

采纳答案by Fred Foo

回答by Hensing

回答by hunse

回答by Cong Ma

回答by James McGuigan

相关推荐

Python 将二进制数据读入 Pandas

python中两个numpy数组的区别

Python 不区分大小写的 Flask-SQLAlchemy 查询

命令 python setup.py egg_info 失败，错误代码为 1

相关推荐

最近更新

标签