C++ 具有 32 位整数的低冲突率的快速字符串散列算法

Question

提问by Jason Citron

I have lots of unrelated named things that I'd like to do quick searches against. An "aardvark" is always an "aardvark" everywhere, so hashing the string and reusing the integer would work well to speed up comparisons. The entire set of names is unknown (and changes over time). What is a fast string hashing algorithm that will generate small (32 or 16) bit values and have a low collision rate?

我有很多不相关的命名事物，我想对其进行快速搜索。“土豚”在任何地方都是“土豚”，因此散列字符串并重用整数可以很好地加速比较。整个名称集是未知的（并且随着时间的推移而变化）。什么是快速字符串散列算法，它会生成小的（32 或 16）位值并具有低冲突率？

I'd like to see an optimized implementation specific to C/C++.

我希望看到特定于 C/C++ 的优化实现。

Answer 1

采纳答案by Nick Johnson

One of the FNV variantsshould meet your requirements. They're fast, and produce fairly evenly distributed outputs.

其中的FNV变种应该满足你的要求。它们速度很快，并产生相当均匀分布的输出。

Answer 2

回答by yrp

Murmur Hashis pretty nice.

Murmur Hash很不错。

Answer 3

回答by Nils Pipenbrinck

For a fixed string-set use gperf.

对于固定的字符串集，请使用 gperf。

If your string-set changes you have to pick one hash function. That topic has been discussed before:

如果您的字符串集发生变化，您必须选择一个哈希函数。之前已经讨论过这个话题：

What's the best hashing algorithm to use on a stl string when using hash_map?

使用 hash_map 时，在 stl 字符串上使用的最佳散列算法是什么？

Answer 4

回答by Christoph

There's also a nice articleat eternallyconfuzzled.com.

在eternallyconfuzzled.com上还有一篇不错的文章。

Jenkins' One-at-a-Time hash for strings should look something like this:

Jenkins 一次一次的字符串哈希应该是这样的：

#include <stdint.h>

uint32_t hash_string(const char * s)
{
    uint32_t hash = 0;

    for(; *s; ++s)
    {
        hash += *s;
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }

    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return hash;
}

Answer 5

回答by Carl Seleborg

Another solution that could be even better depending on your use-case is interned strings. This is how symbols work e.g. in Lisp.

根据您的用例，另一个可能更好的解决方案是interned strings。这就是符号的工作方式，例如在 Lisp 中。

An interned string is a string object whose value is the address of the actual string bytes. So you create an interned string object by checking in a global table: if the string is in there, you initialize the interned string to the address of that string. If not, you insert it, and then initialize your interned string.

内部字符串是一个字符串对象，其值是实际字符串字节的地址。因此，您通过检查全局表来创建一个内部字符串对象：如果字符串在那里，则将内部字符串初始化为该字符串的地址。如果没有，则插入它，然后初始化您的实习字符串。

This means that two interned strings built from the same string will have the same value, which is an address. So if N is the number of interned strings in your system, the characteristics are:

这意味着从同一个字符串构建的两个内部字符串将具有相同的值，即地址。因此，如果 N 是您系统中的内部字符串的数量，则其特征为：

Slow construction (needs lookup and possibly memory allocation)
Requires global data and synchronization in the case of concurrent threads
Compare is O(1), because you're comparing addresses, not actual string bytes (this means sorting works well, but it won't be an alphabetic sort).

缓慢的构建（需要查找和可能的内存分配）
在并发线程的情况下需要全局数据和同步
比较是 O(1)，因为您比较的是地址，而不是实际的字符串字节（这意味着排序效果很好，但不会是字母排序）。

Cheers,

干杯，

Carl

卡尔

Answer 6

回答by Antonio Morales

It is never late for a good subject and I am sure people would be interested on my findings.

一个好的主题永远不会迟到，我相信人们会对我的发现感兴趣。

I needed a hash function and after reading this post and doing a bit of research on the links given here, I came up with this variation of Daniel J Bernstein's algorithm, which I used to do an interesting test:

我需要一个哈希函数，在阅读了这篇文章并对这里给出的链接进行了一些研究后，我想出了 Daniel J Bernstein 算法的这种变体，我曾经做过一个有趣的测试：

unsigned long djb_hashl(const char *clave)
{
    unsigned long c,i,h;

    for(i=h=0;clave[i];i++)
    {
        c = toupper(clave[i]);
        h = ((h << 5) + h) ^ c;
    }
    return h;
}

unsigned long djb_hashl(const char *clave)
{
    unsigned long c,i,h;

    for(i=h=0;clave[i];i++)
    {
        c = toupper(clave[i]);
        h = ((h << 5) + h) ^ c;
    }
    return h;
}

This variation hashes strings ignoring the case, which suits my need of hashing users login credentials. 'clave' is 'key' in Spanish. I am sorry for the spanish but its my mother tongue and the program is written on it.

这种变体会忽略大小写对字符串进行哈希处理，这符合我对用户登录凭据进行哈希处理的需要。“clave”在西班牙语中是“关键”。我很抱歉西班牙语，但它是我的母语，并且程序是写在上面的。

Well, I wrote a program that will generate usernames from 'test_aaaa' to 'test_zzzz', and -to make the strings longer- I added to them a random domain in this list: 'cloud-nueve.com', 'yahoo.com', 'gmail.com' and 'hotmail.com'. Therefore each of them would look like:

好吧，我编写了一个程序，该程序将生成从“test_aaaa”到“test_zzzz”的用户名，并且 - 为了使字符串更长 - 我在此列表中向它们添加了一个随机域：“cloud-nueve.com”、“yahoo.com” '、'gmail.com' 和 'hotmail.com'。因此，它们中的每一个看起来都像：

[email protected], [email protected], 
[email protected], [email protected] and so on.

Here is the output of the test -'Colision entre XXX y XXX' means 'Collision of XXX and XXX'. 'palabras' means 'words' and 'Total' is the same in both languages-.

这是测试的输出 -'Colision entre XXX y XXX' 表示 'XXX 和 XXX 的碰撞'。“palabras”的意思是“单词”，“Total”在两种语言中都是一样的。

    Buscando Colisiones...
    Colision entre '[email protected]' y '[email protected]' (1DB903B7)
    Colision entre '[email protected]' y '[email protected]' (2F5BC088)
    Colision entre '[email protected]' y '[email protected]' (51FD09CC)
    Colision entre '[email protected]' y '[email protected]' (52F5480E)
    Colision entre '[email protected]' y '[email protected]' (74FF72E2)
    Colision entre '[email protected]' y '[email protected]' (7FD70008)
    Colision entre '[email protected]' y '[email protected]' (9BD351C4)
    Colision entre '[email protected]' y '[email protected]' (A86953E1)
    Colision entre '[email protected]' y '[email protected]' (BA6B0718)
    Colision entre '[email protected]' y '[email protected]' (D0523F88)
    Colision entre '[email protected]' y '[email protected]' (DEE08108)
    Total de Colisiones: 11
    Total de Palabras  : 456976

That is not bad, 11 collisions out of 456,976 (off course using the full 32 bit as table lenght).

这还不错，456,976 次碰撞中有 11 次碰撞（当然使用完整的 32 位作为表长度）。

Running the program using 5 chars, that is from 'test_aaaaa' to 'test_zzzzz', actually runs out of memory building the table. Below is the output. 'No hay memoria para insertar XXXX (insertadas XXX)' means 'There is not memory left to insert XXX (XXX inserted)'. Basically malloc() failed at that point.

使用 5 个字符运行程序，即从 'test_aaaaa' 到 'test_zzzzz'，实际上会耗尽构建表的内存。下面是输出。“No hay memoria para insertar XXXX (insertadas XXX)”的意思是“没有剩余的内存可以插入 XXX（已插入 XXX）”。基本上 malloc() 在那一点上失败了。

    No hay memoria para insertar 'test_epjcv' (insertadas 2097701).

    Buscando Colisiones...

    ...451 'colision' strings...

    Total de Colisiones: 451
    Total de Palabras  : 2097701

Which means just 451 collisions on 2,097,701 strings. Note that in none of the occasions, there were more than 2 collisions per code. Which I confirm it is a great hash for me, as what I need is to convert the login ID to a 40 bit unique id for indexing. So I use this to convert the login credentials to a 32 bit hash and use the extra 8 bits to handle up to 255 collisions per code, which lookign at the test results would be almost impossible to generate.

这意味着 2,097,701 个字符串上只有 451 次碰撞。请注意，在任何情况下，每个代码的冲突都不超过 2 次。我确认这对我来说是一个很好的哈希值，因为我需要将登录 ID 转换为 40 位唯一 ID 以进行索引。因此，我使用它来将登录凭据转换为 32 位哈希，并使用额外的 8 位来处理每个代码多达 255 次的冲突，这在测试结果中几乎不可能生成。

Hope this is useful to someone.

希望这对某人有用。

EDIT:

编辑：

Like the test box is AIX, I run it using LDR_CNTRL=MAXDATA=0x20000000 to give it more memory and it run longer, the results are here:

就像测试盒是 AIX，我使用 LDR_CNTRL=MAXDATA=0x20000000 运行它来给它更多的内存并运行更长的时间，结果在这里：

Buscando Colisiones... Total de Colisiones: 2908 Total de Palabras : 5366384

Buscando Colisiones... 总人数：2908 总人数：5366384

That is 2908 after 5,366,384 tries!!

经过 5,366,384 次尝试，这是 2908 次！！

VERY IMPORTANT: Compiling the program with -maix64 (so unsigned long is 64 bits), the number of collisions is 0 for all cases!!!

非常重要：使用 -maix64 编译程序（因此 unsigned long 为 64 位），所有情况下冲突次数均为 0 ！！！

Answer 7

回答by Bernard Igiri

Why don't you just use Boost libraries?Their hashing function is simple to use and most of the stuff in Boost will soon be part of the C++ standard. Some of it already is.

为什么不直接使用Boost 库？他们的散列函数使用简单，Boost 中的大部分内容很快就会成为 C++ 标准的一部分。其中一些已经是。

Boost hash is as easy as

Boost hash 很简单

#include <boost/functional/hash.hpp>

int main()
{
    boost::hash<std::string> string_hash;

    std::size_t h = string_hash("Hash me");
}

You can find boost at boost.org

你可以在boost.org 上找到 boost

Answer 8

回答by user7116

Bob Jenkins has many hash functions available, all of which are fast and have low collision rates.

Bob Jenkins 有许多可用的散列函数，所有这些函数都很快并且冲突率很低。

Answer 9

回答by James Antill

The Hsiehhash function is pretty good, and has some benchmarks/comparisons, as a general hash function in C. Depending on what you want (it's not completely obvious) you might want to consider something like cdbinstead.

在谢哈希函数是相当不错的，并且具有一定的基准测试/对比，因为根据你想要什么C.一般散列函数（它不是完全明显），你可能要考虑像国开行代替。

Answer 10

回答by Rob Wells

Have a look at GNU gperf.

看看 GNU gperf。

C++ 具有 32 位整数的低冲突率的快速字符串散列算法

提问by Jason Citron

采纳答案by Nick Johnson

回答by yrp

回答by Nils Pipenbrinck

回答by Christoph

回答by Carl Seleborg

回答by Antonio Morales

回答by Bernard Igiri

回答by user7116

回答by James Antill

回答by Rob Wells

相关推荐

最近更新

标签

C++ 具有 32 位整数的低冲突率的快速字符串散列算法

提问by Jason Citron

采纳答案by Nick Johnson

回答by yrp

回答by Nils Pipenbrinck

回答by Christoph

回答by Carl Seleborg

回答by Antonio Morales

回答by Bernard Igiri

回答by user7116

回答by James Antill

回答by Rob Wells

相关推荐

如何在 C++ 中的 big-endian 和 little-endian 值之间进行转换？

C++ 将 time_t 转换为 int

C++ GCC - “')' 标记之前的预期不合格 ID”

C++ 虚函数 const 与虚函数非常量

相关推荐

最近更新

标签