C++ 使用 hash_map 时，在 stl 字符串上使用的最佳散列算法是什么？

Question

提问by PiNoYBoY82

I've found the standard hashing function on VS2005 is painfully slow when trying to achieve high performance look ups. What are some good examples of fast and efficient hashing algorithms that should void most collisions?

我发现 VS2005 上的标准散列函数在尝试实现高性能查找时非常缓慢。有哪些快速有效的散列算法可以避免大多数冲突的好例子？

Answer 1

回答by George V. Reilly

I worked with Paul Larsonof Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.

我与Microsoft Research 的Paul Larson合作研究了一些哈希表实现。他在各种数据集上研究了许多字符串散列函数，发现简单的乘以 101 和加法循环的效果出奇地好。

unsigned int
hash(
    const char* s,
    unsigned int seed = 0)
{
    unsigned int hash = seed;
    while (*s)
    {
        hash = hash * 101  +  *s++;
    }
    return hash;
}

Answer 2

回答by Dark Shikari

From some old code of mine:

从我的一些旧代码：

/* magic numbers from http://www.isthe.com/chongo/tech/comp/fnv/ */
static const size_t InitialFNV = 2166136261U;
static const size_t FNVMultiple = 16777619;

/* Fowler / Noll / Vo (FNV) Hash */
size_t myhash(const string &s)
{
    size_t hash = InitialFNV;
    for(size_t i = 0; i < s.length(); i++)
    {
        hash = hash ^ (s[i]);       /* xor  the low 8 bits */
        hash = hash * FNVMultiple;  /* multiply by the magic number */
    }
    return hash;
}

Its fast. Really freaking fast.

它很快。真他妈快。

Answer 3

回答by David Pierre

Boost has an boost::hashlibrary which can provides some basic hash functions for most common types.

Boost 有一个boost::hash库，它可以为大多数常见类型提供一些基本的哈希函数。

Answer 4

回答by Nils Pipenbrinck

That always depends on your data-set.

这始终取决于您的数据集。

I for one had surprisingly good results by using the CRC32 of the string. Works very good with a wide range of different input sets.

我通过使用字符串的 CRC32 获得了令人惊讶的好结果。适用于各种不同的输入集。

Lots of good CRC32 implementations are easy to find on the net.

在网上很容易找到许多好的 CRC32 实现。

Edit:Almost forgot: This page has a nice hash-function shootout with performance numbers and test-data:

编辑：差点忘了：这个页面有一个很好的散列函数枪战，包括性能数据和测试数据：

http://smallcode.weblogs.us/<-- further down the page.

http://smallcode.weblogs.us/<-- 在页面下方。

Answer 5

回答by SquareCog

I've use the Jenkins hash to write a Bloom filter library, it has great performance.

我已经使用 Jenkins 哈希编写了一个 Bloom 过滤器库，它具有很好的性能。

Details and code are available here: http://burtleburtle.net/bob/c/lookup3.c

详细信息和代码可在此处获得：http: //burtleburtle.net/bob/c/lookup3.c

This is what Perl uses for its hashing operation, fwiw.

这是 Perl 用于其散列操作 fwiw 的内容。

Answer 6

回答by bk1e

If you are hashing a fixed set of words, the best hash function is often a perfect hash function. However, they generally require that the set of words you are trying to hash is known at compile time. Detection of keywords in a lexer(and translation of keywords to tokens) is a common usage of perfect hash functions generated with tools such as gperf. A perfect hash also lets you replace hash_mapwith a simple array or vector.

如果您正在对一组固定的单词进行散列，最好的散列函数通常是完美的散列函数。但是，它们通常要求您尝试散列的一组单词在编译时是已知的。在词法分析器中检测关键字（以及将关键字转换为标记）是使用gperf等工具生成的完美散列函数的常见用法。完美的散列还可以让您hash_map用简单的数组或vector.

If you're not hashing a fixed set of words, then obviously this doesn't apply.

如果您不散列一组固定的单词，那么显然这不适用。

Answer 7

回答by Brian

One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)

字符串哈希的一个经典建议是将字母一个一个地添加到累加器中，每次将累加器乘以一个素数。（允许散列值溢出）

  template <> struct myhash{};

  template <> struct myhash<string>
    {
    size_t operator()(string &to_hash) const
      {
      const char * in = to_hash.c_str();
      size_t out=0;
      while(NULL != *in)
        {
        out*= 53; //just a prime number
        out+= *in;
        ++in;
        }
      return out;
      }
    };

  hash_map<string, int, myhash<string> > my_hash_map;

It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.

在不丢弃数据的情况下，很难获得比这更快的速度。如果您知道您的字符串只能通过几个字符而不是它们的全部内容来区分，那么您可以做得更快。

You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.

如果值计算过于频繁，您可以尝试通过创建一个记住其哈希值的 basic_string 的新子类来更好地缓存哈希值。不过，hash_map 应该在内部这样做。

Answer 8

回答by Josh S

I did a little searching, and funny thing, Paul Larson's little algorithm showed up here http://www.strchr.com/hash_functionsas having the least collisions of any tested in a number of conditions, and it's very fast for one that it's unrolled or table driven.

我做了一些搜索，有趣的事情，Paul Larson 的小算法在这里出现 http://www.strchr.com/hash_functions作为在许多条件下测试的冲突最少，而且它的速度非常快展开或表驱动。

Larson's being the simple multiply by 101 and add loop above.

Larson 是简单的乘以 101 并在上面添加循环。

Answer 9

回答by George V. Reilly

Python 3.4 includes a new hash algorithm based on SipHash. PEP 456is very informative.

Python 3.4 包含一个基于SipHash的新哈希算法。PEP 456信息量很大。

Answer 10

回答by philix

From Hash Functions all the way down:

从哈希函数一直向下：

MurmurHashgot quite popular, at least in game developer circles, as a “general hash function”.
It's a fine choice, but let's see later if we can generally do better. Another fine choice, especially if you know more about your data than “it's gonna be an unknown number of bytes”, is to roll your own (e.g. see Won Chun's replies, or Rune's modified xxHash/Murmur that are specialized for 4-byte keys etc.). If you know your data, always try to see whether that knowledge can be used for good effect!

MurmurHash作为“通用哈希函数”变得非常流行，至少在游戏开发者圈子中是如此。
这是一个不错的选择，但让我们稍后看看我们是否可以做得更好。另一个不错的选择，特别是如果您对数据的了解比“它将是未知字节数”更多时，是滚动您自己的（例如，参见 Won Chun 的回复，或 Rune 修改后的 xxHash/Murmur，专门用于 4 字节密钥等等。）。如果您知道您的数据，请始终尝试查看该知识是否可以产生良好的效果！

Without more information I would recommend MurmurHashas a general purpose non-cryptographic hash function. For small strings (of the size of the average identifier in programs) the very simple and famous djb2and FNVare very good.

如果没有更多信息，我会推荐MurmurHash作为通用非加密哈希函数。对于小字符串（程序中的平均标识符的大小），非常简单和著名的djb2和FNV非常好。

Here (data sizes < 10 bytes) we can see that the ILP smartness of other algorithms does not get to show itself, and the super-simplicity of FNV or djb2 win in performance.

在这里（数据大小<10字节）我们可以看到其他算法的ILP智能没有体现出来，FNV或djb2的超级简单在性能上胜出。

djb2

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

FNV-1

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash × FNV_prime
     hash = hash XOR byte_of_data
return hash

FNV-1A

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash XOR byte_of_data
     hash = hash × FNV_prime
return hash

A note about security and availability

关于安全性和可用性的说明

Hash functions can make your code vulnerable to denial-of-service attacks. If an attacker is able to force your server to handle too many collisions, your server may not be able to cope with requests.

散列函数会使您的代码容易受到拒绝服务攻击。如果攻击者能够强制您的服务器处理过多的冲突，您的服务器可能无法处理请求。

Some hash functions like MurmurHashaccept a seed that you can provide to drastically reduce the ability of attackers to predict the hashes your server software is generating. Keep that in mind.

一些散列函数（如MurmurHash）接受您可以提供的种子，以显着降低攻击者预测您的服务器软件生成的散列的能力。记在脑子里。

C++ 使用 hash_map 时，在 stl 字符串上使用的最佳散列算法是什么？

提问by PiNoYBoY82

回答by George V. Reilly

回答by Dark Shikari

回答by David Pierre

回答by Nils Pipenbrinck

回答by SquareCog

回答by bk1e

回答by Brian

回答by Josh S

回答by George V. Reilly

回答by philix

djb2

djb2

FNV-1

FNV-1

FNV-1A

FNV-1A

A note about security and availability

关于安全性和可用性的说明

相关推荐

最近更新

标签