有一个很好的 C++ 哈希表哈希函数吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/628790/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 16:23:24  来源:igfitidea点击:

Have a good hash function for a C++ hash table?

c++hashhashtable

提问by DV.

I am in need of a performance-oriented hash function implementation in C++ for a hash table that I will be coding. I looked around already and only found questions asking what's a good hash function "in general". I've considered CRC32 (but where to find good implementation?) and a few cryptography algorithms. My table, though, has very specific requirements.

我需要一个 C++ 中面向性能的哈希函数实现,用于我将要编码的哈希表。我已经环顾四周,只发现了一些问题,询问“一般而言”什么是好的散列函数。我已经考虑过 CRC32(但在哪里可以找到好的实现?)和一些加密算法。不过,我的桌子有非常具体的要求。

Here's what the table will be like:

这是表的样子:

100,000 items max
200,000 capacity (so the load is 0.5)
hashing a 6-character string which is a part of English sentence
     examples: "become"    "and he"    ", not "

The number one priorityof my hash table is quick search (retrieval). Quick insertion is not important, but it will come along with quick search. Deletion is not important, and re-hashing is not something I'll be looking into. To handle collisions, I'll be probably using separate chainingas described here. I have already looked at this article, but would like an opinion of those who have handled such task before.

首要任务我哈希表的是快速搜索(检索)。快速插入并不重要,但它会伴随快速搜索而来。删除并不重要,重新散列不是我要研究的。为了处理冲突,我可能会使用此处所述的单独链接。我已经看过这篇文章,但想对以前处理过此类任务的人提出意见。

采纳答案by Robert Gould

Now assumming you want a hash, and want something blazing fastthat would work in your case, because your strings are just 6 chars long you could use this magic:

现在假设你想要一个散列,并且想要一些可以在你的情况下工作的快速的东西,因为你的字符串只有 6 个字符长,你可以使用这个魔法:

size_t precision = 2; //change the precision with this
size_t hash(const char* str)
{
   return (*(size_t*)str)>> precision;
}

CRC is for slowpokes ;)

CRC 用于慢戳 ;)

Explanation:This works by casting the contents of the string pointer to "look like" a size_t (int32 or int64 based on the optimal match for your hardware). So the contents of the string are interpreted as a raw number, no worries about characters anymore, and you then bit-shift this the precision needed (you tweak this number to the best performance, I've found 2 works well for hashing strings in set of a few thousands).

解释:这是通过将字符串指针的内容转换为“看起来像”一个 size_t(基于硬件的最佳匹配的 int32 或 int64)来工作的。因此,字符串的内容被解释为原始数字,不再担心字符,然后您将其移位所需的精度(您将此数字调整为最佳性能,我发现 2 适用于散列字符串一套几千)。

Also the really neat part is any decent compiler on modern hardware will hash a string like this in 1 assembly instruction, hard to beat that ;)

此外,真正整洁的部分是现代硬件上的任何体面的编译器都会在 1 条汇编指令中散列这样的字符串,很难被击败 ;)

回答by George V. Reilly

This simple polynomial works surprisingly well. I got it from Paul Larson of Microsoft Research who studied a wide variety of hash functions and hash multipliers.

这个简单的多项式效果出奇地好。我是从微软研究院的 Paul Larson 那里得到的,他研究了各种各样的散列函数和散列乘法器。

unsigned hash(const char* s, unsigned salt)
{
    unsigned h = salt;
    while (*s)
        h = h * 101 + (unsigned) *s++;
    return h;
}

saltshould be initialized to some randomlychosen value before the hashtable is created to defend against hash table attacks. If this isn't an issue for you, just use 0.

salt应在创建哈希表之前将其初始化为某个随机选择的值,以防御哈希表攻击。如果这对您来说不是问题,请使用 0。

The size of the table is important too, to minimize collisions. Sounds like yours is fine.

表的大小也很重要,以尽量减少冲突。听起来你的没问题。

回答by Ferruccio

Boost.Functional/Hashmight be of use to you. I've not tried it, so I can't vouch for its performance.

Boost.Functional/Hash可能对你有用。我没有尝试过,所以我不能保证它的性能。

Boost also has a CRC library.

Boost 还有一个CRC 库

I would look a Boost.Unorderedfirst (i.e. boost::unordered_map<>). It uses hash maps instead of binary trees for containers.

我会先看看Boost.Unordered(即 boost::unordered_map<>)。它对容器使用哈希映射而不是二叉树。

I believe some STL implementations have a hash_map<> container in the stdext namespace.

我相信一些 STL 实现在 stdext 命名空间中有一个 hash_map<> 容器。

回答by Arnold Spence

The size of your table will dictate what size hash you should use. You would like to minimize collisions of course. I'm not sure what you are specifying by max items and capacity (they seem like the same thing to me) In any case either of those numbers suggest that a 32 bit hash would be sufficient. You might get away with CRC16 (~65,000 possibilities) but you would probably have a lot of collisions to deal with. On the other hand, a collision may be quicker to deal with than than a CRC32 hash.

你的表的大小将决定你应该使用什么大小的哈希。当然,您希望尽量减少冲突。我不确定您通过最大项目和容量指定了什么(它们对我来说似乎是同一件事)无论如何,这些数字中的任何一个都表明 32 位哈希就足够了。您可能会使用 CRC16(约 65,000 种可能性),但您可能需要处理很多冲突。另一方面,处理冲突可能比处理 CRC32 哈希更快。

I would say, go with CRC32. You'll find no shortage of documentation and sample code. Since you have your maximums figured out and speed is a priority, go with an array of pointers. Use the hash to generate an index. On collision, increment index until you hit an empty bucket.. quick and simple.

我会说,使用 CRC32。您会发现不乏文档和示例代码。由于您已经确定了最大值并且速度是优先事项,因此请使用指针数组。使用哈希生成索引。在碰撞时,增加索引直到你碰到一个空桶......快速而简单。

回答by sth

Since you store english words, most of your characters will be letters and there won't be much variation in the most significant two bits of your data. Besides of that I would keep it very simple, just using XOR. After all you're not looking for cryptographic strength but just for a reasonably even distribution. Something along these lines:

由于您存储英语单词,因此您的大部分字符都是字母,并且数据的最重要的两位不会有太大变化。除此之外,我会保持非常简单,只使用 XOR。毕竟,您不是在寻找加密强度,而只是在寻找合理均匀的分布。沿着这些路线的东西:

size_t hash(const std::string &data) {
  size_t h(0);
  for (int i=0; i<data.length(); i++)
    h = (h << 6) ^ (h >> 26) ^ data[i];
  }
  return h;
}

Besides of that, have you looked at std::tr1::hash as a hashing function and/or std::tr1::unordered_map as an implementation of a hash table? Using these would probably be save much work opposed to implementing your own classes.

除此之外,您是否将 std::tr1::hash 视为散列函数和/或将 std::tr1::unordered_map 视为散列表的实现?使用这些可能会节省很多工作,而不是实现你自己的类。

回答by Bob Somers

The number one priority of my hash table is quick search (retrieval).

我的哈希表的第一要务是快速搜索(检索)。

Well then you are using the right data structure, as searching in a hash table is O(1)! :)

那么您使用的是正确的数据结构,因为在哈希表中搜索是 O(1)!:)

The CRC32 should do fine. The implementation isn't that complex, it's mainly based on XORs. Just make sure it uses a good polynomial.

CRC32 应该没问题。实现并不复杂,它主要基于异或。只要确保它使用一个好的多项式。

回答by David Norman

How about something simple:

简单的事情怎么样:

// Initialize hash lookup so that it maps the characters
// in your string to integers between 0 and 31
int hashLookup[256];

// Hash function for six character strings.
int hash(const char *str)
{
    int ret = 0, mult = 1;
    for (const char *p = str; *p; *p++, mult *= 32) {
        assert(*p >= 0 && *p < 256);
        ret += mult * hashLookup[*p];
    }

    return ret;
}

This assumes 32 bit ints. It uses 5 bits per character, so the hash value only has 30 bits in it. You could fix this, perhaps, by generating six bits for the first one or two characters. If you character set is small enough, you might not need more than 30 bits.

这假设 32 位整数。它每个字符使用 5 位,因此哈希值只有 30 位。您也许可以通过为前一个或两个字符生成六位来解决此问题。如果字符集足够小,则可能不需要超过 30 位。

回答by Robert Gould

If you need to search short strings and insertion is not an issue, maybe you could use a B-tree, or a 2-3 tree, you don't gain much by hashing in your case.

如果您需要搜索短字符串并且插入不是问题,也许您可​​以使用 B 树或 2-3 树,在您的情况下通过散列不会获得太多收益。

The way you would do this is by placing a letter in each node so you first check for the node "a", then you check "a"'s children for "p", and it's children for "p", and then "l" and then "e". In situations where you have "apple" and "apply" you need to seek to the last node, (since the only difference is in the last "e" and "y")

这样做的方法是在每个节点中放置一个字母,因此您首先检查节点“a”,然后检查“a”的子节点的“p”,以及“p”的子节点,然后“ l”然后是“e”。在您有“apple”和“apply”的情况下,您需要寻找最后一个节点,(因为唯一的区别是最后一个“e”和“y”)

But but in most cases you'll be able to get the word after a just a few steps ("xylophone" => "x"->"ylophone"), so you can optimize like this. This can be faster than hashing

但在大多数情况下,您只需几步即可获得单词(“xylophone”=>“x”->“ylophone”),因此您可以像这样优化。这可能比散列更快

回答by Raedwald

Since C++11, C++ has provided a std::hash< string >( string ). That is likely to be an efficient hashing function that provides a good distribution of hash-codesfor most strings.

从 C++11 开始,C++ 提供了一个std::hash< string >( string ). 这可能是一种高效的散列函数,可为大多数字符串提供良好的散列码分布

Furthermore, if you are thinking of implementing a hash-table, you should now be considering using a C++ std::unordered_mapinstead.

此外,如果您正在考虑实现一个哈希表,您现在应该考虑使用 C++ std::unordered_map