C语言 简单的哈希函数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14409466/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 05:05:12  来源:igfitidea点击:

Simple hash functions

cfunctionhashtablestring-hashing

提问by Hardell

I'm trying to write a Cprogram that uses a hash table to store different words and I could use some help.

我正在尝试编写一个使用哈希表来存储不同单词的C程序,我可以使用一些帮助。

Firstly, I create a hash table with the size of a prime number which is closest to the number of the words I have to store, and then I use a hash function to find an address for each word. I started with the simplest function, adding the letters together, which ended up with 88% collision. Then I started experimenting with the function and found out that whatever I change it to, the collisions don't get lower than 35%. Right now I'm using

首先,我创建一个哈希表,其大小与我必须存储的单词数量最接近,然后我使用哈希函数为每个单词找到一个地址。我从最简单的函数开始,将字母加在一起,最终发生了 88% 的碰撞。然后我开始试验这个函数,发现无论我把它改成什么,碰撞都不会低于 35%。现在我正在使用

unsigned int stringToHash(char *word, unsigned int hashTableSize){
  unsigned int counter, hashAddress =0;
  for (counter =0; word[counter]!='
hashAddress = 0;
for (counter = 0; word[counter]!='
hashAddress = 5381;
for (counter = 0; word[counter]!='
uint32_t adler32(const void *buf, size_t buflength) {
     const uint8_t *buffer = (const uint8_t*)buf;

     uint32_t s1 = 1;
     uint32_t s2 = 0;

     for (size_t n = 0; n < buflength; n++) {
        s1 = (s1 + buffer[n]) % 65521;
        s2 = (s2 + s1) % 65521;
     }     
     return (s2 << 16) | s1;
}

// ...

hashAddress = adler32(word, strlen(word));
'; counter++){ hashAddress = ((hashAddress << 5) + hashAddress) + word[counter]; }
'; counter++){ hashAddress = word[counter] + (hashAddress << 6) + (hashAddress << 16) - hashAddress; }
'; counter++){ hashAddress = hashAddress*word[counter] + word[counter] + counter; } return (hashAddress%hashTableSize); }

which is just a random function that I came up with, but it gives me the best results - around 35% collision.

这只是我想出的一个随机函数,但它给了我最好的结果——大约 35% 的碰撞。

I've been reading articles on hash functions for the past a few hours and I tried to use a few simple ones, such as djb2, but all of them gave me even worse results.(djb2 resulted in 37% collision, which is't much worse, but I was expecting something better rather than worse) I also don't know how to use some of the other, more complex ones, such as the murmur2, because I don't know what the parameters (key, len, seed) they take in are.

过去几个小时我一直在阅读有关哈希函数的文章,我尝试使用一些简单的函数,例如 djb2,但所有这些都给了我更糟糕的结果。(djb2 导致 37% 的冲突,这是' t 更糟,但我期待更好而不是更糟的东西)我也不知道如何使用其他一些更复杂的,例如 murmur2,因为我不知道参数是什么(key,len ,种子)他们接受的是。

Is it normal to get more than 35% collisions, even with using the djb2, or am I doing something wrong? What are the key, len and seed values?

即使使用 djb2,发生超过 35% 的碰撞是否正常,还是我做错了什么?什么是 key、len 和 seed 值?

回答by Mecki

Try sdbm:

试试 sdbm:

x % 4  == x & 3
x % 8  == x & 7
x % 16 == x & 15
x % 32 == x & 31
...

Or djb2:

或 djb2:

#include <stdint.h>
#include <stdlib.h>

#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))

#define mix(a,b,c) \
{ \
  a -= c;  a ^= rot(c, 4);  c += b; \
  b -= a;  b ^= rot(a, 6);  a += c; \
  c -= b;  c ^= rot(b, 8);  b += a; \
  a -= c;  a ^= rot(c,16);  c += b; \
  b -= a;  b ^= rot(a,19);  a += c; \
  c -= b;  c ^= rot(b, 4);  b += a; \
}

#define final(a,b,c) \
{ \
  c ^= b; c -= rot(b,14); \
  a ^= c; a -= rot(c,11); \
  b ^= a; b -= rot(a,25); \
  c ^= b; c -= rot(b,16); \
  a ^= c; a -= rot(c,4);  \
  b ^= a; b -= rot(a,14); \
  c ^= b; c -= rot(b,24); \
}

uint32_t lookup3 (
  const void *key,
  size_t      length,
  uint32_t    initval
) {
  uint32_t  a,b,c;
  const uint8_t  *k;
  const uint32_t *data32Bit;

  data32Bit = key;
  a = b = c = 0xdeadbeef + (((uint32_t)length)<<2) + initval;

  while (length > 12) {
    a += *(data32Bit++);
    b += *(data32Bit++);
    c += *(data32Bit++);
    mix(a,b,c);
    length -= 12;
  }

  k = (const uint8_t *)data32Bit;
  switch (length) {
    case 12: c += ((uint32_t)k[11])<<24;
    case 11: c += ((uint32_t)k[10])<<16;
    case 10: c += ((uint32_t)k[9])<<8;
    case 9 : c += k[8];
    case 8 : b += ((uint32_t)k[7])<<24;
    case 7 : b += ((uint32_t)k[6])<<16;
    case 6 : b += ((uint32_t)k[5])<<8;
    case 5 : b += k[4];
    case 4 : a += ((uint32_t)k[3])<<24;
    case 3 : a += ((uint32_t)k[2])<<16;
    case 2 : a += ((uint32_t)k[1])<<8;
    case 1 : a += k[0];
             break;
    case 0 : return c;
  }
  final(a,b,c);
  return c;
}

Or Adler32:

或 Adler32:

unsigned int stringToHash(char *word, unsigned int hashTableSize){
  unsigned int initval;
  unsigned int hashAddress;

  initval = 12345;
  hashAddress = lookup3(word, strlen(word), initval);
  return (hashAddress%hashTableSize);
  // If hashtable is guaranteed to always have a size that is a power of 2,
  // replace the line above with the following more effective line:
  //     return (hashAddress & (hashTableSize - 1));
}

None of these are really great, though. If you really want good hashes, you need something more complex like lookup3for example.

不过,这些都不是很好。如果你真的想要好的散列,你需要更复杂的东西,例如lookup3

Note that a hashtable is expected to have plenty of collisions as soon as it is filled by more than 70-80%. This is perfectly normal and will even happen if you use a very good hash algorithm. That's why most hashtable implementations increase the capacity of the hashtable (e.g. capacity * 1.5or even capacity * 2) as soon as you are adding something to the hashtable and the ratio size / capacityis already above 0.7 to 0.8. Increasing the capacity means a new hashtable is created with a higher capacity, all values from the current one are added to the new one (therefor they must all be rehashed, as their new index will be different in most cases), the new hastable array replaces the old one and the old one is released/freed. If you plan on hashing 1000 words, a hashtable capacity of at 1250 least recommended, better 1400 or even 1500.

请注意,哈希表一旦被填充超过 70-80%,就会发生大量冲突。这是完全正常的,如果您使用非常好的散列算法,甚至会发生这种情况。这就是为什么大多数哈希表实现会在向哈希表和比率中添加内容时立即增加哈希表的容量(例如capacity * 1.5甚至capacity * 2size / capacity已经在 0.7 到 0.8 以上。增加容量意味着创建一个容量更大的新哈希表,当前哈希表的所有值都添加到新哈希表中(因此它们必须全部重新哈希,因为它们的新索引在大多数情况下会有所不同),新的哈希表替换旧的,旧的被释放/释放。如果您计划对 1000 个单词进行散列,那么建议的散列表容量至少为 1250,最好为 1400 甚至 1500。

Hashtables are not supposed to be "filled to brim", at least not if they shall be fast and efficient (thus they always should have spare capacity). That's the downsize of hashtables, they are fast (O(1)), yet they will usually waste more space than would be necessary for storing the same data in another structure (when you store them as a sorted array, you will only need a capacity of 1000 for 1000 words; the downsize is that the lookup cannot be faster than O(log n)in that case). A collision free hashtable is not possible in most cases either way. Pretty much all hashtable implementations expect collisions to happen and usually have some kind of way to deal with them (usually collisions make the lookup somewhat slower, but the hashtable will still work and still beat other data structures in many cases).

哈希表不应该被“填满”,至少如果它们要快速和高效(因此它们总是应该有备用容量)。这就是散列表的缩小,它们很快 ( O(1)),但它们通常会浪费比将相同数据存储在另一个结构中所需的更多空间(当您将它们存储为排序数组时,您只需要 1000 1000 字;缩小是查找不能比O(log n)这种情况快)。在大多数情况下,无论哪种方式都不可能有无冲突的哈希表。几乎所有的哈希表实现都希望发生冲突,并且通常有某种方法来处理它们(通常冲突会使查找速度变慢,但哈希表仍然可以工作并且在许多情况下仍然击败其他数据结构)。

Also note that if you are using a pretty good hash function, there is no requirement, yet not even an advantage, if the hashtable has a power of 2 capacity if you are cropping hash values using modulo (%) in the end. The reason why many hashtable implementations always use power of 2 capacities is because they do not use modulo, instead they use AND (&) for cropping because an AND operation is among the fastest operations you will find on most CPUs (modulo is never faster than AND, in the best case it would be equally fast, in most cases it is a lot slower). If your hashtable uses power of 2 sizes, you can replace any module with an AND operation:

另请注意,如果您使用的是非常好的散列函数,则没有要求,甚至没有优势,如果散列表的容量为 2 的幂,如果您%最终使用模 ( )裁剪散列值。许多哈希表实现总是使用 2 次幂的原因是因为它们不使用模数,而是使用 AND ( &) 进行裁剪,因为 AND 运算是您在大多数 CPU 上发现的最快的运算之一(模数永远不会比 AND 快,在最好的情况下它会同样快,在大多数情况下它会慢很多)。如果您的哈希表使用 2 次幂,您可以用 AND 操作替换任何模块:

##代码##

This only works for power of 2 sizes, though. If you use modulo, power of 2 sizes can only buy something, if the hash is a very bad hash with a very bad "bit distribution". A bad bit distribution is usually caused by hashes that do not use any kind of bit shifting (>>or <<) or any other operations that would have a similar effect as bit shifting.

不过,这仅适用于 2 个大小的幂。如果您使用模数,如果散列是非常糟糕的“位分布”非常糟糕的散列,则 2 大小的幂只能买东西。错误的位分布通常是由不使用任何类型的位移 (>><<) 或任何其他具有与位移类似效果的操作的散列引起的。

I created a stripped down lookup3 implementation for you:

我为您创建了一个精简的 lookup3 实现:

##代码##

This code is not as highly optimized for performance as the original code, therefor it is a lot simpler. It is also not as portable as the original code, but it is portable to all major consumer platforms in use today. It is also completely ignoring the CPU endian, yet that is not really an issue, it will work on big and little endian CPUs. Just keep in mind that it will not calculate the same hash for the same data on big and little endian CPUs, but that is no requirement; it will calculate a good hash on both kind of CPUs and its only important that it always calculates the same hash for the same input data on a single machine.

这段代码不像原始代码那样对性能进行高度优化,因此它要简单得多。它也不像原始代码那样可移植,但可以移植到当今使用的所有主要消费平台。它也完全忽略了 CPU 端,但这并不是真正的问题,它可以在大小端 CPU 上运行。请记住,它不会为大小端 CPU 上的相同数据计算相同的哈希值,但这不是必需的;它将在两种 CPU 上计算出良好的哈希值,唯一重要的是它始终为单台机器上的相同输入数据计算相同的哈希值。

You would use this function as follows:

您可以按如下方式使用此函数:

##代码##

You way wonder what initvalis. Well, it is whatever you want it to be. You could call it a salt. It will influence the hash values, yet the hash values will not get better or worse in quality because of this (at least not in the average case, it may lead to more or less collisions for very specific data, though). E.g. you can use different initvalvalues if you want to hash the same data twice, yet each time should produce a different hash value (there is no guarantee it will, but it is rather likely if initvalis different; if it creates the same value, this would be a very unlucky coincident that you must treat that as a kind of collision). It is not advisable to use different initvalvalues when hashing data for the same hashtable (this will rather cause more collisions on average). Another use for initval is if you want to combine a hash with some other data, in which case the already existing hash becomes initvalwhen hashing the other data (so both, the other data as well as the previous hash influence the outcome of the hash function). You may even set initvalto 0if you like or pick a random value when the hashtable is created (and always use this random value for this instance of hashtable, yet each hashtable has its own random value).

你想知道什么initval是。嗯,这就是你想要的。你可以称之为盐。它会影响散列值,但散列值的质量不会因此而变好或变差(至少在一般情况下不会,但是对于非常特定的数据,它可能会导致或多或少的冲突)。例如,initval如果你想对相同的数据进行两次散列,你可以使用不同的值,但每次都应该产生不同的散列值(不能保证它会,但很可能initval是不同的;如果它创建相同的值,这将是一个非常不幸的巧合,您必须将其视为一种碰撞)。不建议使用不同的initval为同一个哈希表散列数据时的值(这会导致平均更多的冲突)。initval 的另一个用途是,如果您想将散列与其他一些数据结合起来,在这种情况下,在散列其他数据时,已经存在的散列将变为initval(因此,其他数据以及先前的散列都会影响散列函数的结果) )。您甚至可以在创建哈希表时设置initval0是否喜欢或选择一个随机值(并且始终将此随机值用于此哈希表实例,但每个哈希表都有其自己的随机值)。

A note on collisions:

碰撞注意事项:

Collisions are usually not such a huge problem in practice, it usually does not pay off to waste tons of memory just to avoid them. The question is rather how you are going to deal with them in an efficient way.

冲突在实践中通常不是一个大问题,为了避免它们而浪费大量内存通常不会得到回报。问题在于您将如何以有效的方式处理它们。

You said you are currently dealing with 9000 words. If you were using an unsorted array, finding a word in the array will need 4500 comparisons on average. On my system, 4500 string comparisons (assuming that words are between 3 and 20 characters long) need 38 microseconds (0.000038 seconds). So even such a simple, ineffective algorithm is fast enough for most purposes. Assuming that you are sorting the word list and use a binary search, finding a word in the array will need only 13 comparisons on average. 13 comparisons are close to nothing in terms of time, it's too little to even benchmark reliably. So if finding a word in a hashtable needs 2 to 4 comparisons, I wouldn't even waste a single second on the question whether that may be a huge performance problem.

您说您目前正在处理 9000 个单词。如果您使用的是未排序的数组,则在数组中查找单词平均需要 4500 次比较。在我的系统上,4500 个字符串比较(假设单词长度在 3 到 20 个字符之间)需要 38 微秒(0.000038 秒)。因此,即使是这样一个简单、低效的算法,对于大多数用途来说也足够快。假设您正在对单词列表进行排序并使用二分搜索,那么在数组中查找单词平均只需要 13 次比较。13 次比较在时间上几乎为零,甚至无法可靠地进行基准测试。因此,如果在哈希表中找到一个词需要 2 到 4 次比较,我什至不会在这是否可能是一个巨大的性能问题的问题上浪费一秒钟。

In your case, a sorted list with binary search may even beat a hashtable by far. Sure, 13 comparisons need more time than 2-4 comparisons, however, in case of a hashtable you must first hash the input data to perform a lookup. Hashing alone may already take longer than 13 comparisons! The betterthe hash, the longerit will take for the same amount of data to be hashed. So a hashtable only pays off performance-wise if you have a really huge amount of data or if you must update the data frequently (e.g. constantly adding/removing words to/from the table, since these operations are less costly for a hashtable than they are for a sorted list). The fact that a hashatble is O(1)only means that regardless how big it is, a lookup will approx. always need the same amount of time. O(log n)only means that the lookup grows logarithmically with the number of words, that means more words, slower lookup. Yet the Big-O notation says nothing about absolute speed! This is a big misunderstanding. It is not said that a O(1)algorithm always performs faster than a O(log n)one. The Big-O notation only tells you that if the O(log n)algorithm is faster for a certain number of values and you keep increasing the number of values, the O(1)algorithm will certainly overtake the O(log n)algorithm at some point of time, but your current word count may be far below that point. Without benchmarking both approaches, you cannot say which one is faster by just looking at the Big-O notation.

在您的情况下,带有二分搜索的排序列表甚至可能远远超过哈希表。当然,13 次比较比 2-4 次比较需要更多的时间,但是,在哈希表的情况下,您必须首先对输入数据进行哈希处理以执行查找。单独的散列可能已经超过 13 次比较!在更好的散列值,该将采取相同数量的数据进行散列。因此,如果您拥有大量数据或者您必须经常更新数据(例如,不断地向/从表中添加/删除单词,因为对于散列表而言,这些操作的成本低于它们),因此哈希表只会在性能方面带来回报用于排序列表)。哈希表的事实O(1)仅意味着无论它有多大,查找都会大约。总是需要相同的时间。O(log n)only 意味着查找随着单词的数量呈对数增长,这意味着更多的单词,更慢的查找。然而,Big-O 符号并没有说明绝对速度!这是一个很大的误解。并不是说O(1)算法总是比算法执行得更快O(log n)。Big-O 表示法只告诉你,如果O(log n)算法对于一定数量的值更快,并且你不断增加值的数量,那么O(1)算法肯定会O(log n)在某个时间点超过算法,但你当前的字数可能相差很远低于那个点。如果不对这两种方法进行基准测试,您无法仅通过查看 Big-O 表示法来判断哪种方法更快。

Back to collisions. What should you do if you run into a collision? If the number of collisions is small, and here I don't mean the overall number of collisions (the number of words that are colliding in the hashtable) but the per index one (the number of words stored at the same hashtable index, so in your case maybe 2-4), the simplest approach is to store them as a linked list. If there was no collision so far for this table index, there is just a single key/value pair. If there was a collision, there is a linked list of key/value pairs. In that case your code must iterate over the linked list and verify each of the keys and return the value if it matches. Going by your numbers, this linked list won't have more than 4 entries and doing 4 comparisons is insignificant in terms of performance. So finding the index is O(1), finding the value (or detecting that this key is not in the table) is O(n), but here nis only the number of linked list entries (so it is 4 at most).

回到碰撞。如果您遇到碰撞,您应该怎么做?如果冲突的数量很小,这里我不是指冲突的总数(哈希表中发生冲突的单词数)而是每个索引一个(存储在同一哈希表索引中的单词数,所以在您的情况下可能是 2-4),最简单的方法是将它们存储为链表。如果到目前为止该表索引没有发生冲突,则只有一个键/值对。如果发生冲突,则存在键/值对的链表。在这种情况下,您的代码必须遍历链表并验证每个键并在匹配时返回值。根据您的数字,此链接列表不会超过 4 个条目,并且进行 4 次比较在性能方面无关紧要。所以找到索引是O(1),找到值(或检测到该键不在表中)是O(n),但这里n只是链接列表条目的数量(因此最多为4)。

If the number of collisions raises, a linked list can become to slow and you may also store a dynamically sized, sorted array of key/value pairs, which allows lookups of O(log n)and again, nis only the number of keys in that array, not of all keys in the hastable. Even if there were 100 collisions at one index, finding the right key/value pair takes at most 7 comparisons. That's still close to nothing. Despite the fact that if you really have 100 collisions at one index, either your hash algorithm is unsuited for your key data or the hashtable is far too small in capacity. The disadvantage of a dynamically sized, sorted array is that adding/removing keys is somewhat more work than in case of a linked list (code-wise, not necessarily performance-wise). So using a linked list is usually sufficient if you keep the number of collisions low enough and it is almost trivial to implement such a linked list yourself in C and add it to an existing hashtable implementation.

如果冲突次数增加,链表可能会变慢,您还可以存储一个动态大小、排序的键/值对数组,它允许反复查找O(log n)n只是该数组中的键数,而不是 hastable 中的所有键数。即使在一个索引处有 100 次冲突,找到正确的键/值对也最多需要 7 次比较。这仍然几乎没有。尽管事实上如果你真的在一个索引上有 100 次冲突,要么你的哈希算法不适合你的关键数据,要么哈希表的容量太小。动态大小的排序数组的缺点是添加/删除键比链表的情况要多一些(代码方面,不一定是性能方面)。因此,如果您保持足够低的冲突次数,使用链表通常就足够了,并且在 C 中自己实现这样的链表并将其添加到现有的哈希表实现中几乎是微不足道的。

Most hashtable implementations I have seem use such a "fallback to an alternate data structure" to deal with collisions. The disadvantage is that these require a little bit extra memory to store the alternative data structure and a bit more code to also search for keys in that structure. There are also solutions that store collisions inside the hashtable itself and that don't require any additional memory. However, these solutions have a couple of drawbacks. The first drawback is that every collision increases the chances for even more collisions as more data is added. The second drawback is that while lookup times for keys decrease linearly with the number of collisions so far (and as I said before, every collision leads to even more collisions as data is added), lookup times for keys not in the hashtable decrease even worse and in the end, if you perform a lookup for a key that is not in the hashtable (yet you cannot know without performing the lookup), the lookup may take as long as a linear search over the whole hashtable (YUCK!!!). So if you can spare the extra memory, go for an alternate structure to handle collisions.

我使用的大多数哈希表实现似乎都使用这种“回退到替代数据结构”来处理冲突。缺点是这些需要一点额外的内存来存储替代数据结构和更多的代码来搜索该结构中的键。还有一些解决方案可以将冲突存储在哈希表本身中,并且不需要任何额外的内存。然而,这些解决方案有几个缺点。第一个缺点是,随着添加更多数据,每次碰撞都会增加更多碰撞的机会。第二个缺点是,虽然到目前为止,键的查找时间随着碰撞次数线性减少(正如我之前所说,随着数据的添加,每次碰撞都会导致更多的碰撞),不在哈希表中的键的查找时间会减少得更糟,最后,如果您对不在哈希表中的键执行查找(但不执行查找就无法知道),查找可能需要与线性一样长的时间搜索整个哈希表(糟糕!!!)。因此,如果您可以节省额外的内存,请使用替代结构来处理冲突。

回答by Emanuele Paolini

Firstly, I create a hash table with the size of a prime number which is the closes to the number of the words I have to store, and then I use a hash function to find an address for each word.

首先,我创建一个大小为素数的哈希表,它是我必须存储的单词数的近似值,然后我使用哈希函数为每个单词找到一个地址。

...

...

return (hashAddress%hashTableSize);

返回 (hashAddress%hashTableSize);

Since the number of different hashes is comparable to the number of words you cannot expect to have much lower collisions.

由于不同散列的数量与单词数量相当,因此您不能期望冲突少得多。

I made a simple statistical test with a random hash (which is the best you could achieve) and found that 26% is the limiting collision rate if you have #words == #different hashes.

我用随机散列(这是你能达到的最好结果)做了一个简单的统计测试,发现如果你有 #words == #different 散列,26% 是极限碰撞率。