C语言字符串的哈希函数

Question

提问by lilawood

I'm working on hash table in C language and I'm testing hash function for string.

我正在使用 C 语言处理哈希表，并且正在测试字符串的哈希函数。

The first function I've tried is to add ascii code and use modulo (%100) but i've got poor results with the first test of data: 40 collisions for 130 words.

我尝试的第一个函数是添加 ascii 代码并使用模数 (%100)，但是我在第一次数据测试中得到的结果很差：130 个单词有 40 次冲突。

The final input data will contain 8 000 words (it's a dictionnary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a txt file.

最终的输入数据将包含 8 000 个单词（它是一个存储在文件中的字典）。哈希表声明为 int table[10000] 并包含单词在 txt 文件中的位置。

The first question is which is the best algorithm for hashing string ? and how to determinate the size of hash table ?

第一个问题是散列字符串的最佳算法是哪种？以及如何确定哈希表的大小？

thanks in advance !

提前致谢！

:-)

Answer 1

回答by cnicutar

I've had nice results with djb2by Dan Bernstein.

djb2丹·伯恩斯坦 (Dan Bernstein)给我带来了不错的结果。

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

Answer 2

回答by Jerry Coffin

First, you generally do notwant to use a cryptographic hash for a hash table. An algorithm that's veryfast by cryptographic standards is still excruciatingly slow by hash table standards.

首先，你通常不希望使用加密哈希的哈希表。按照加密标准非常快的算法按照哈希表标准仍然非常慢。

Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do notwant the rotation to be an even multiple of the byte size either.

其次，您要确保输入的每一位都可以/将影响结果。一种简单的方法是将当前结果旋转一定数量的位，然后将当前哈希码与当前字节进行异或。重复直到到达字符串的末尾。请注意，您通常也不希望旋转是字节大小的偶数倍。

For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:

例如，假设 8 位字节的常见情况，您可能会旋转 5 位：

int hash(char const *input) { 
    int result = 0x55555555;

    while (*input) { 
        result ^= *input++;
        result = rol(result, 5);
    }
}

Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).

编辑：另请注意，10000 个插槽很少是哈希表大小的好选择。您通常需要两件事之一：您想要一个质数作为大小（需要确保某些类型的散列解析的正确性）或 2 的幂（因此可以使用简单的方法将值减少到正确的范围）位掩码）。

Answer 3

回答by RushPL

Wikipedia showsa nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.

维基百科展示了一个很好的字符串哈希函数，称为 Jenkins One At A Time Hash。它还引用了此哈希的改进版本。

uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
    uint32_t hash, i;
    for(hash = i = 0; i < len; ++i)
    {
        hash += key[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

Answer 4

回答by Nick Johnson

There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APRand glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.

C 有许多现有的哈希表实现，从 C 标准库 hcreate/hdestroy/hsearch 到APR和glib中的那些，它们也提供预构建的哈希函数。我强烈建议使用这些而不是发明自己的哈希表或哈希函数；它们已经针对常见用例进行了大量优化。

If your dataset is static, however, your best solution is probably to use a perfect hash. gperfwill generate a perfect hash for you for a given dataset.

但是，如果您的数据集是静态的，那么最好的解决方案可能是使用完美的 hash。gperf将为给定的数据集生成一个完美的哈希值。

Answer 5

回答by Wolfgang Brehm

djb2 has 317 collisions for this 466k english dictionarywhile MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes). My recommendation is using MurmurHashif available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:

djb2 对于这个 466k 的英语词典有 317 次冲突，而 MurmurHash 对 64 位哈希没有冲突，而对于 32 位哈希则有 21 次（预计 466k 随机 32 位哈希大约有 25 次）。我的建议是使用MurmurHash（如果可用），它非常快，因为它一次需要几个字节。但是，如果您需要一个简单而简短的哈希函数来复制并粘贴到您的项目中，我建议您使用 murmurs 一次一个字节的版本：

uint32_t inline MurmurOAAT32 ( const char * key)
{
  uint32_t h(3323198485ul);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e995;
    h ^= h >> 15;
  }
  return h;
}

uint64_t inline MurmurOAAT64 ( const char * key)
{
  uint64_t h(525201411107845655ull);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e9955bd1e995;
    h ^= h >> 47;
  }
  return h;
}

The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexityfor hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.

哈希表的最佳大小 - 简而言之 - 在仍然适合内存的情况下尽可能大。因为我们通常不知道或不想查看我们有多少可用内存，甚至可能会发生变化，所以最佳哈希表大小大约是表中预期存储元素数量的 2 倍。分配比这多得多的值将使您的哈希表更快，但收益会迅速减少，使您的哈希表比这更小将使其速度呈指数级增长。这是因为哈希表的空间和时间复杂度之间存在非线性权衡，最佳负载因子为 2-sqrt(2) = 0.58 ......显然。

Answer 6

回答by Pascal Cuoq

First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.

首先，130 个单词的 40 次碰撞是否散列到 0..99 不好？如果您没有采取专门的措施来实现完美的散列，就不能指望完美的散列。大多数情况下，普通哈希函数的碰撞次数不会比随机生成器少。

A hash function with a good reputation is MurmurHash3.

一个声誉良好的哈希函数是MurmurHash3。

Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.

最后，关于哈希表的大小，实际上取决于您想到的哈希表类型，尤其是桶是可扩展的还是单槽的。如果存储桶是可扩展的，那么还有一个选择：您为您拥有的内存/速度限制选择平均存储桶长度。

Answer 7

回答by Xiaoning Bian

I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.

我尝试过这些哈希函数并得到以下结果。我有大约 960^3 个条目，每个 64 字节长，64 个不同顺序的字符，哈希值 32 位。代码来自这里。

Hash function    | collision rate | how many minutes to finish
==============================================================
MurmurHash3      |           6.?% |                      4m15s
Jenkins One..    |           6.1% |                      6m54s   
Bob, 1st in link |          6.16% |                      5m34s
SuperFastHash    |            10% |                      4m58s
bernstein        |            20% |       14s only finish 1/20
one_at_a_time    |          6.16% |                       7m5s
crc              |          6.16% |                      7m56s

One strange things is that almost all the hash functions have 6% collision rate for my data.

一件奇怪的事情是，几乎所有的哈希函数对我的数据都有 6% 的碰撞率。

Answer 8

回答by Gabriel Staples

Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&Rhashes too:

虽然djb2，正如cnicutar 在 stackoverflow 上提出的，几乎肯定更好，但我认为也值得展示K&R哈希：

1) Apparently a terriblehash algorithm, as presented in K&R 1st edition (source)

1) 显然是一种糟糕的哈希算法，如 K&R 第一版中所述（来源）

unsigned long hash(unsigned char *str)
{
    unsigned int hash = 0;
    int c;

    while (c = *str++)
        hash += c;

    return hash;
}

2) Probably a pretty decent hash algorithm, as presented in K&R version 2(verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZEfrom the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned longinstead of the simple unsigned(int).

2) 可能是一个相当不错的散列算法，如 K&R 版本 2 所示（由我在本书的第 144 页验证）；注意：% HASHSIZE如果您计划在哈希算法之外进行模数调整到您的数组长度，请务必从 return 语句中删除。另外，我建议您使用 return 和“hashval”类型unsigned long而不是简单的unsigned(int) 类型。

unsigned hash(char *s)
{
    unsigned hashval;

    for (hashval = 0; *s != '// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}
'; s++)
        hashval = *s + 31*hashval;
    return hashval % HASHSIZE;
}

Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab")would therefore return the same value as hash("ba"). This is notso with the 2nd edition hash, however, which would (much better!) return two different values for those strings.

请注意，从这两种算法中可以清楚地看出，第 1 版哈希如此糟糕的一个原因是它没有考虑字符串字符顺序，因此hash("ab")会返回与hash("ba"). 这是不是使之与第二版散，然而，对于这些字符串这将（好多了！）返回两个不同的值。

The GCC C++11 hashing functions used for unordered_map(a hash table template) and unordered_set(a hash set template) appear to be as follows.

用于unordered_map（哈希表模板）和unordered_set（哈希集模板）的 GCC C++11 哈希函数如下所示。

Thisis a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017):

这是对使用什么 GCC C++11 哈希函数的问题的部分回答，说明 GCC 使用了 Austin Appleby ( http://murmurhash.googlepages.com/)的“MurmurHashUnaligned2”的实现。
在文件“gcc/libstdc++-v3/libsupc++/hash_bytes.cc”中，这里（https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc），我发现实现。例如，这是“32 位 size_t”返回值的返回值（2017 年 8 月 11 日提取）：

Code:

代码：

##代码##

Answer 9

回答by Michael Nett

One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).

我用过的效果很好的一件事如下（我不知道它是否已经提到过，因为我不记得它的名字）。

You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.

您为密钥的字母表 [0,255] 中的每个字符预先计算了一个带有随机数的表 T。你通过取 T[k0] xor T[k1] xor ... xor T[kN] 来散列你的密钥 'k0 k1 k2 ... kN'。你可以很容易地证明这和你的随机数生成器一样随机，而且它在计算上非常可行，如果你真的遇到一个非常糟糕的例子，有很多冲突，你可以使用一批新的随机数重复整个过程。

C语言字符串的哈希函数

提问by lilawood

回答by cnicutar

回答by Jerry Coffin

回答by RushPL

回答by Nick Johnson

回答by Wolfgang Brehm

回答by Pascal Cuoq

回答by Xiaoning Bian

回答by Gabriel Staples

回答by Michael Nett

相关推荐

最近更新

标签

C语言 字符串的哈希函数

提问by lilawood

回答by cnicutar

回答by Jerry Coffin

回答by RushPL

回答by Nick Johnson

回答by Wolfgang Brehm

回答by Pascal Cuoq

回答by Xiaoning Bian

回答by Gabriel Staples

回答by Michael Nett

相关推荐

C语言 我得到了函数 strncmp 的隐式声明

C语言 包含来自另一个目录的头文件

C语言 && 逻辑和 || 的真假 逻辑表

C语言 二维数组是双指针吗？

相关推荐

最近更新

标签

C语言字符串的哈希函数

C语言我得到了函数 strncmp 的隐式声明

C语言包含来自另一个目录的头文件

C语言 && 逻辑和 || 的真假逻辑表

C语言二维数组是双指针吗？