bash 字符串的快速哈希

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21999983/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 09:40:52  来源:igfitidea点击:

Fast hash for strings

pythonbashalgorithmhashhashids

提问by Antonio

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.

我有一组 ASCII 字符串,假设它们是文件路径。它们可以很短也可以很长。

I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:

我正在寻找一种可以计算此类字符串散列的算法,该散列也是一个字符串,但具有固定长度,例如 youtube 视频 ID:

https://www.youtube.com/watch?v=-F-3E8pyjFo
                                ^^^^^^^^^^^

MD5 seems to be what I need, but it is critical for me to have a short hash strings.

MD5 似乎是我所需要的,但对我来说拥有短的哈希字符串至关重要。

Is there a shell command or python library which can do that?

是否有可以做到这一点的 shell 命令或 python 库?

采纳答案by Cilyan

I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hashbecause it is used by The Sims 3to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthashhas an implementation of it for example).

我想这个问题是题外话,因为基于意见,但至少给你一个提示,我知道FNV 哈希,因为模拟人生 3使用它来根据不同内容包之间的名称查找资源。他们使用 64 位版本,所以我想这足以避免在一组相对较大的参考字符串中发生冲突。散列很容易实现,如果没有模块满足你(例如 pyfasthash有它的实现)。

To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88=(and you can get rid or the padding =).

要从中获取一个短字符串,我建议您使用 base64 编码。例如,这是 base64 编码的 64 位散列的大小:(nsTYVQUag88=您可以摆脱或填充=)。

Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144

编辑:我终于遇到了和你一样的问题,所以我实现了上面的想法:https: //gist.github.com/Cilyan/9424144

回答by Chris

Another option: hashidsis designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids"hashes" are reversable.

另一种选择:hashids旨在解决这个问题,并已移植到多种语言,包括 Python。它并不是真正意义上的 MD5 或 SHA1 意义上的散列,它们是单向的;hashids“哈希”是可逆的。

You are responsible for seeding the library with a secret value and selecting a minimumhash length.

您负责为库设置一个秘密值并选择最小哈希长度。

Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.

完成后,库可以在整数(单个整数,如简单的主键或整数列表,以支持复合键和分片)和配置长度(或稍长)的字符串之间进行双向映射. 用于生成“哈希”的字母表是完全可配置的。

I have provided more details in this other answer.

我在另一个答案中提供了更多详细信息。

回答by Erik Aronesty

Python has a built-in hash() function that's very fast and perfect for most uses:

Python 有一个内置的 hash() 函数,它非常快速且适用于大多数用途:

>>> hash("dfds")
3591916071403198536

You can then turn it into a 16 byte hex string:

然后,您可以将其转换为 16 字节的十六进制字符串:

>>> hash("dfds").to_bytes(8,"big").hex()

Or an N*2 byte string, where N is <= 8:

或者一个 N*2 字节的字符串,其中 N <= 8:

>>> hashn=lambda word, N  : (hash(word)%(2**(N*8))).to_bytes(N,"big").hex()

..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.

..等等。如果您希望 N 大于 8 个字节,则只需散列两次即可。Python 的内置速度非常快,除非您需要安全性,否则永远不值得将 hashlib 用于任何事情……而不仅仅是抗碰撞。

>>> hashnbig=lambda word, N  : ((hash(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()

And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you

最后,使用 urlsafe base64 编码来制作比“hex”更好的字符串

>>> hashnbigu=lambda word, N  : urlsafe_b64encode(((hash(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbig("foo",8)
'ZblnvrRqHwA'

Caveats:

注意事项:

  • Be warned that in Python 3.3 and up, this function is randomized and won't work for some use cases.

  • See https://github.com/flier/pyfasthashfor fast, stable hashes that won't break your CPU for non-cryptographic applications.

  • Don't use this lambda style in real code... write it out! And stuffing things like 2**32 in your code, instead of making them constants will slow things down a lot.

  • In the end 8 bytes of collision resistance is OK for a smaller applications.... with less than a million entries, you've got collision odds of < 0.0000001%. That's a 12 byte b64 encoded string. But it might not be enough for larger apps.

  • 请注意,在 Python 3.3 及更高版本中,此函数是随机的,不适用于某些用例。

  • 请参阅https://github.com/flier/pyfasthash以获取快速、稳定的哈希值,这些哈希值不会破坏非加密应用程序的 CPU。

  • 不要在实际代码中使用这种 lambda 风格……写出来!在你的代码中填充诸如 2**32 之类的东西,而不是让它们成为常量会大大减慢速度。

  • 最后,对于较小的应用程序来说,8 字节的抗碰撞性是可以的……如果条目少于 100 万,则碰撞几率小于 0.0000001%。这是一个 12 字节的 b64 编码字符串。但对于较大的应用程序来说,这可能还不够。

回答by eugecm

You could use the sumprogram (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.

您可以使用该sum程序(假设您使用的是 linux),但请记住,哈希值越短,您可能遇到的冲突就越多。您也可以随时截断 MD5/SHA 哈希值。

EDIT: Here's a list of hash functions: List of hash functions

编辑:这是散列函数列表:散列函数列表

回答by Tim B

Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.

需要记住的是,散列码是一种方式函数 - 您不能将它们用于“视频 ID”,因为您无法从散列返回到原始路径。除了其他任何事情外,很可能发生哈希冲突,最终您会得到两个哈希值,它们都指向同一个视频而不是不同的视频。

To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.

要创建一个像 youtube 一样的 ID,最简单的方法是创建一个唯一的 ID,但您通常会这样做(例如数据库中的自动键列),然后以可逆的方式将其映射到唯一的字符串。

For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.

例如,您可以将一个整数 id 映射到 36 进制中的 0-9a-z ......甚至是 62 进制中的 0-9a-zA-Z,如果其上的 id 将生成的字符串填充到所需的长度自己没有给出足够的字符。