C# 在哈希冲突和字符串性能方面的最佳哈希算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/251346/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best hashing algorithm in terms of hash collisions and performance for strings
提问by dpan
What would be the best hashing algorithm if we had the following priorities (in that order):
如果我们有以下优先级(按此顺序),最好的哈希算法是什么:
- Minimal hash collisions
- Performance
- 最小的哈希冲突
- 表现
It doesn't have to be secure. Basically I'm trying to create an index based on a combination of properties of some objects. All the properties are strings.
它不必是安全的。基本上,我试图根据某些对象的属性组合创建索引。所有的属性都是字符串。
Any references to c# implementations would be appreciated.
任何对 c# 实现的引用将不胜感激。
采纳答案by Mecki
Forget about the term "best". No matter which hash algorithm anyone might come up with, unless you have a very limited set of data that needs to be hashed, every algorithm that performs very well on average can become completely useless if only being fed with the right (or from your perspective "wrong") data.
忘记“最好”这个词吧。无论任何人可能想出哪种散列算法,除非您需要散列的数据集非常有限,否则如果仅以正确的方式(或从您的角度来看),每个平均表现非常好的算法都可能变得完全无用“错误”)数据。
Instead of wasting too much time thinking about how to get the hash more collision-free without using too much CPU time, I'd rather start thinking about "How to make collisions less problematic". E.g. if every hash bucket is in fact a table and all strings in this table (that had a collision) are sorted alphabetically, you can search within a bucket table using binary search (which is only O(log n)) and that means, even when every second hash bucket has 4 collisions, your code will still have decent performance (it will be a bit slower compared to a collision free table, but not that much). One big advantage here is that if your table is big enough and your hash is not too simple, two strings resulting in the same hash value will usually look completely different (hence the binary search can stop comparing strings after maybe one or two characters on average; making every compare very fast).
与其浪费太多时间思考如何在不使用太多 CPU 时间的情况下让哈希更无冲突,我宁愿开始思考“如何减少冲突的问题”。例如,如果每个哈希桶实际上都是一个表,并且该表中的所有字符串(发生冲突)都按字母顺序排序,则可以使用二分搜索(仅为 O(log n))在桶表中进行搜索,这意味着,即使每第二个哈希桶有 4 次冲突,您的代码仍然具有不错的性能(与无冲突表相比,它会慢一点,但不会那么多)。这里的一大优势是,如果您的表足够大并且您的散列不是太简单,
Actually I had a situation myself before where searching directly within a sorted table using binary search turned out to be faster than hashing! Even though my hash algorithm was simple, it took quite some time to hash the values. Performance testing showed that only if I get more than about 700-800 entries, hashing is indeed faster than binary search. However, as the table could never grow larger than 256 entries anyway and as the average table was below 10 entries, benchmarking clearly showed that on every system, every CPU, the binary search was faster. Here, the fact that usually already comparing the first byte of the data was enough to lead to the next bsearch iteration (as the data used to be very different in the first one to two byte already) turned out as a big advantage.
实际上,我自己之前也遇到过这样的情况:使用二分搜索直接在排序表中搜索结果比散列更快!尽管我的散列算法很简单,但散列这些值需要花费相当长的时间。性能测试表明,只有当我得到超过 700-800 个条目时,散列确实比二分查找快。然而,由于表永远不会增长超过 256 个条目,并且平均表低于 10 个条目,基准测试清楚地表明,在每个系统、每个 CPU 上,二分查找速度更快。在这里,通常已经比较数据的第一个字节的事实足以导致下一次 bsearch 迭代(因为数据曾经在第一个到两个字节中已经非常不同)被证明是一个很大的优势。
So to summarize: I'd take a decent hash algorithm, that doesn't cause too many collisions on average and is rather fast (I'd even accept some more collisions, if it's just very fast!) and rather optimize my code how to get the smallest performance penalty once collisions do occur (and they will! They will unless your hash space is at least equal or bigger than your data space and you can map a unique hash value to every possible set of data).
总结一下:我会采用一个不错的散列算法,它平均不会引起太多冲突,而且速度相当快(如果速度非常快,我什至会接受更多冲突!),而是如何优化我的代码一旦发生冲突,获得最小的性能损失(他们会!除非您的哈希空间至少等于或大于您的数据空间,并且您可以将唯一的哈希值映射到每个可能的数据集,否则他们会这样做)。
回答by ConcernedOfTunbridgeWells
There is no one single optimum hashing algorithm. If you have a known input domain you can use a perfect-hashing generator such as gperfto generate a hashing algorithm that will get a 100% rate on that particular input set. Otherwise, there is no 'right' answer to this question.
没有一种最佳散列算法。如果您有一个已知的输入域,您可以使用完美散列生成器(例如gperf)来生成一个散列算法,该算法将在该特定输入集上获得 100% 的比率。否则,这个问题就没有“正确”的答案。
回答by Jason Cohen
You can get both using the Knuth hash function described here.
您可以使用此处描述的 Knuth 散列函数获得两者。
It's extremely fast assuming a power-of-2 hash table size -- just one multiply, one shift, and one bit-and. More importantly (for you) it's great at minimizing collisions (see this analysis).
假设哈希表大小为 2 的幂,它非常快——只需一次乘法、一次移位和一位与。更重要的是(对您而言)它非常擅长减少碰撞(请参阅此分析)。
Some other good algorithms are described here.
回答by activout.se
The simple hashCode used by Java's String class might show a suitable algorithm.
Java 的 String 类使用的简单 hashCode 可能会显示合适的算法。
Below is the "GNU Classpath" implementation. (License: GPL)
下面是“GNU Classpath”的实现。(许可证:GPL)
/**
* Computes the hashcode for this String. This is done with int arithmetic,
* where ** represents exponentiation, by this formula:<br>
* <code>s[0]*31**(n-1) + s[1]*31**(n-2) + ... + s[n-1]</code>.
*
* @return hashcode value of this String
*/
public int hashCode()
{
if (cachedHashCode != 0)
return cachedHashCode;
// Compute the hash code using a local variable to be reentrant.
int hashCode = 0;
int limit = count + offset;
for (int i = offset; i < limit; i++)
hashCode = hashCode * 31 + value[i];
return cachedHashCode = hashCode;
}
回答by Michael Burr
As Nigel Campbellindicated, there's no such thing as the 'best' hash function, as it depends on the data characteristics of what you're hashing as well as whether or not you need cryptographic quality hashes.
正如Nigel Campbell指出的那样,没有“最佳”哈希函数这样的东西,因为它取决于您正在哈希的数据特征以及您是否需要加密质量哈希。
That said, here are some pointers:
也就是说,这里有一些提示:
Since the items you're using as input to the hash are just a set of strings, you could simply combine the hashcodes for each of those individual strings. I've seen the following pseudo-code suggested to do this, but I don't know of any particular analysis of it:
int hashCode = 0; foreach (string s in propertiesToHash) { hashCode = 31*hashCode + s.GetHashCode(); }
According to this article, System.Web has an internal method that combines hashcodes using
combinedHash = ((combinedHash << 5) + combinedHash) ^ nextObj.GetHashCode();
I've also seen code that simply xor's the hashcodes together, but that seems like a bad idea to me (though I again have no analysis to back this up). If nothing else, you end up with a collision if the same strings are hashed in a different order.
I've used FNV to good effect: http://www.isthe.com/chongo/tech/comp/fnv/
Paul Hsieh has a decent article: http://www.azillionmonkeys.com/qed/hash.html
Another nice article by Bob Jenkins that was originally published in 1997 in Doctor Dobb's Journal (the linked article has updates): http://burtleburtle.net/bob/hash/doobs.html
由于您用作散列输入的项目只是一组字符串,因此您可以简单地组合每个单独字符串的散列码。我已经看到以下建议执行此操作的伪代码,但我不知道对其进行任何特定分析:
int hashCode = 0; foreach (string s in propertiesToHash) { hashCode = 31*hashCode + s.GetHashCode(); }
根据这篇文章,System.Web 有一个内部方法,它使用组合哈希码
combinedHash = ((combinedHash << 5) + combinedHash) ^ nextObj.GetHashCode();
我也看到过简单地将哈希码异或在一起的代码,但这对我来说似乎是一个坏主意(尽管我再次没有分析来支持这一点)。如果不出意外,如果相同的字符串以不同的顺序散列,最终会发生冲突。
我使用 FNV 效果很好:http://www.isthe.com/chongo/tech/comp/fnv/
Paul Hsieh 有一篇不错的文章:http: //www.azillionmonkeys.com/qed/hash.html
鲍勃·詹金斯 (Bob Jenkins) 的另一篇不错的文章最初发表于 1997 年的 Dobb 博士杂志(链接文章有更新):http: //burtleburtle.net/bob/hash/doobs.html
回答by Jason Z
I love Stackoverflow! Reading this question made me look into hash functions a bit more and I found the Cuckoo Hash.
我喜欢 Stackoverflow!阅读这个问题让我更多地研究了哈希函数,我发现了Cuckoo Hash。
From the article:
从文章:
Lookup requires inspection of just two locations in the hash table, which takes constant time in the worst case (see Big O notation). This is in contrast to many other hash table algorithms, which may not have a constant worst-case bound on the time to do a lookup.
查找只需要检查哈希表中的两个位置,在最坏的情况下需要恒定的时间(请参阅大 O 表示法)。这与许多其他哈希表算法形成对比,这些算法可能没有恒定的最坏情况限制来进行查找。
I think that fits into your criteria of collisions and performance. It appears that the tradeoff is that this type of hash table can only get 49% full.
我认为这符合您的碰撞和性能标准。看起来权衡是这种类型的哈希表只能得到 49% 的满。
回答by Andrei R?nea
I am going to be lame here and give a more theoretical response rather a pin-pointing answer but please take the value in it.
我将在这里蹩脚,并给出更具理论性的回答而不是精确的答案,但请考虑其中的价值。
First there are two distinct problems :
首先有两个不同的问题:
a. Collision probability b. Performance of hashing (i.e.: time, cpu-cycles etc.)
一种。碰撞概率 B. 散列的性能(即:时间、CPU 周期等)
The two problems are mildly corellated. They are not perfectly correlated.
这两个问题有轻微的关联。它们并不完全相关。
Problem a deals with the difference between the hashee and the resulted hash spaces. When you hash a 1KB file (1024 bytes) file and the hash has 32 bytes there will be :
问题 a 处理 hashee 和结果散列空间之间的差异。当您散列一个 1KB 文件(1024 字节)文件并且散列有 32 个字节时,将有:
1,0907481356194159294629842447338e+2466 (i.e. a number with 2466 zeros) possible combinations of input files
1,0907481356194159294629842447338e+2466(即一个有 2466 个零的数字)输入文件的可能组合
and the hash space will have
并且哈希空间将有
1,1579208923731619542357098500869e+77 (i.e. a number with 77 zeros)
1,1579208923731619542357098500869e+77(即一个有77个零的数字)
The difference IS HUGE. there are 2389 zeros difference between them. THERE WILL BE COLLISIONS (a collision is a special case when two DIFFERENT input files will have the exact same hash) since we are reducing 10^2466 cases to 10^77 cases.
差异很大。它们之间有 2389 个零的差异。会有冲突(当两个不同的输入文件具有完全相同的散列时,冲突是一种特殊情况),因为我们将 10^2466 个案例减少到 10^77 个案例。
The only way to minimize collison risk is to enlarge the hash space and therefore to make the hahs longer. Ideally the hash will have the file length but this is somehow moronic.
最小化冲突风险的唯一方法是扩大哈希空间,从而使哈希更长。理想情况下,哈希将具有文件长度,但这在某种程度上是愚蠢的。
The second problem is performance. This only deals with the algorithm of the hash. Ofcourse that a longer hash will most probably require more cpu cycles but a smarter algorithm might not. I have no clear case answer for this question. It's just too tough.
第二个问题是性能。这只涉及哈希算法。当然,更长的散列很可能需要更多的 cpu 周期,但更智能的算法可能不需要。对于这个问题,我没有明确的案例答案。这太难了。
However you can benchmark/measure different hashing implementations and draw pre-conclusions from this.
但是,您可以对不同的散列实现进行基准测试/测量,并从中得出预先结论。
Good luck ;)
祝你好运 ;)
回答by Abhishek Jain
Here is a straightforward way of implementing it yourself: http://www.devcodenote.com/2015/04/collision-free-string-hashing.html
这是自己实现它的直接方法:http: //www.devcodenote.com/2015/04/collision-free-string-hashing.html
Here is a snippet from the post:
这是帖子中的一个片段:
if say we have a character set of capital English letters, then the length of the character set is 26 where A could be represented by the number 0, B by the number 1, C by the number 2 and so on till Z by the number 25. Now, whenever we want to map a string of this character set to a unique number , we perform the same conversion as we did in case of the binary format
如果说我们有一个大写英文字母的字符集,那么字符集的长度是 26,其中 A 可以用数字 0 表示,B 用数字 1 表示,C 用数字 2 表示,依此类推,直到用数字表示 Z 25. 现在,每当我们想将这个字符集的字符串映射到一个唯一的数字时,我们执行与二进制格式相同的转换
回答by Alex
"Murmurhash" is pretty good on both performance and collisions.
“Murmurhash”在性能和碰撞方面都非常出色。
The mentioned thread at "softwareengineering.stackexchange" has some tests and Murmur wins.
“softwareengineering.stackexchange”中提到的线程进行了一些测试,Murmur 获胜。
I wrote my own C# port of MurmurHash 2 to .NET and tested it on a list of 466k English words, got 22 collisions.
我将自己的 MurmurHash 2 的 C# 端口编写到 .NET 并在 466k 英语单词列表上对其进行了测试,发生了 22 次冲突。
The results and implementation are here: https://github.com/jitbit/MurmurHash.net(disclaimer, I'm involved with this open source project!)
结果和实现在这里:https: //github.com/jitbit/MurmurHash.net(免责声明,我参与了这个开源项目!)