C++ 什么是英语单词的好的哈希函数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7700400/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 17:21:52  来源:igfitidea点击:

What's a good hash function for English words?

c++chash

提问by Mike G

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.

我有一长串英语单词,我想对它们进行哈希处理。什么是好的散列函数?到目前为止,我的散列函数对字母的 ASCII 值求和,然后对表大小求模。我正在寻找高效而简单的东西。

采纳答案by leonbloy

To simply sum the letters is not a good strategy because a permutation gives the same result.

简单地对字母求和不是一个好策略,因为排列给出了相同的结果。

This one (djb2) is quite popular and works nicely with ASCII strings.

这个 ( djb2) 非常流行并且可以很好地处理 ASCII 字符串。

unsigned long hashstring(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

If you need more alternatives and some perfomance measures, read here.

如果您需要更多替代方案和一些性能措施,请阅读此处

Added:These are generalhashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.

补充:这些是一般的散列函数,其中输入域是事先不知道的(可能除了一些非常一般的假设:例如,对于 ascii 输入,上面的工作稍微好一些),这是最常见的情况。如果您有一个已知的受限域(固定输入集),您可以做得更好,请参阅 Fionn 的回答。

回答by Fionn

Maybe something like this would help you: http://www.gnu.org/s/gperf/

也许这样的事情会帮助你:http: //www.gnu.org/s/gperf/

It generates a optimized hashing function for the input domain.

它为输入域生成优化的散列函数。

回答by selbie

If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.

如果您不需要加密安全,我建议使用 Murmur Hash。它的速度非常快,并且具有很高的扩散性。便于使用。

http://en.wikipedia.org/wiki/MurmurHash

http://en.wikipedia.org/wiki/MurmurHash

http://code.google.com/p/smhasher/wiki/MurmurHash3

http://code.google.com/p/smhasher/wiki/MurmurHash3

If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.

如果您确实需要加密安全的哈希,那么我建议通过 OpenSSL 使用 SHA1。

http://www.openssl.org/docs/crypto/sha.html

http://www.openssl.org/docs/crypto/sha.html

回答by slashmais

A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:

有点晚了,但这里有一个哈希函数,对于下面的 64 位版本具有极低的冲突率,并且 ~ 几乎 ~ 与 32 位版本一样好:

uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
    union { uint64_t h; uint8_t u[8]; };
    int i=0; h=strlen(s);
    while (*s) { u[i%8] += *s + i + (*s >> ((h/(i+1)) % 5)); s++; i++; }
    return h; //64-bit
    //return (h+(h>>32)); //32-bit
}

The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]
Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)

哈希数也非常均匀地分布在可能的范围内,没有我可以检测到的结块 - 这仅使用随机字符串进行了检查。
[编辑]
还针对从本地文本文件中提取的单词与 LibreOffice 词典/同义词词典(英语和法语 - 超过 97000 个单词和结构)中提取的单词进行了测试,64 位中有 0 次冲突,32 位中有 1 次冲突:)

(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

(同样与FNV1A_Hash_Yorikke、djb2和MurmurHash2在同组比较:Yorikke & djb2表现不佳;slash_hash在所有测试中都比MurmurHash2略好)