java 什么是哈希码计算的合理素数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1835976/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is a sensible prime for hashcode calculation?
提问by Hans-Peter St?rr
Eclipse 3.5 has a very nice feature to generate Java hashCode() functions. It would generate for example (slightly shortened:)
Eclipse 3.5 有一个非常好的特性来生成 Java hashCode() 函数。它会生成例如(略有缩短:)
class HashTest {
int i;
int j;
public int hashCode() {
final int prime = 31;
int result = prime + i;
result = prime * result + j;
return result;
}
}
(If you have more attributes in the class, result = prime * result + attribute.hashCode();is repeated for each additional attribute. For ints .hashCode() can be omitted.)
(如果类中有更多属性,result = prime * result + attribute.hashCode();则为每个附加属性重复。对于整数,.hashCode() 可以省略。)
This seems fine but for the choice 31 for the prime. It is probably taken from the hashCode implementation of Java String, which was used for performance reasons that are long gone after the introduction of hardware multipliers. Here you have many hashcode collisions for small values of i and j: for example (0,0) and (-1,31) have the same value. I think that is a Bad Thing(TM), since small values occur often. For String.hashCode you'll also find many short strings with the same hashcode, for instance "Ca" and "DB". If you take a large prime, this problem disappears if you choose the prime right.
这似乎很好,但对于素数的选择 31。它可能取自Java String的hashCode 实现,它用于性能原因,在引入硬件乘法器后早已不复存在。在这里,对于 i 和 j 的小值,您有许多哈希码冲突:例如 (0,0) 和 (-1,31) 具有相同的值。我认为这是一件坏事(TM),因为小值经常出现。对于 String.hashCode,您还会发现许多具有相同哈希码的短字符串,例如“Ca”和“DB”。如果你取一个大素数,如果你选择正确的素数,这个问题就消失了。
So my question: what is a good prime to choose? What criteria do you apply to find it?
所以我的问题是:选择什么是好的素数?你用什么标准来找到它?
This is meant as a general question - so I do not want to give a range for i and j. But I suppose in most applications relatively small values occur more often than large values. (If you have large values the choice of the prime is probably unimportant.) It might not make much of a difference, but a better choice is an easy and obvious way to improve this - so why not do it? Commons lang HashCodeBuilderalso suggests curiously small values.
这是一个一般性问题 - 所以我不想给出 i 和 j 的范围。但我想在大多数应用程序中,相对较小的值比大值更频繁地出现。(如果您有较大的值,素数的选择可能并不重要。)它可能没有太大区别,但更好的选择是改善这一点的简单而明显的方法 - 那么为什么不这样做呢?Commons lang HashCodeBuilder也提出了奇怪的小值。
(Clarification: this is nota duplicate of Why does Java's hashCode() in String use 31 as a multiplier?since my question is not concerned with the history of the 31 in the JDK, but on what would be a better value in new code using the same basic template. None of the answers there try to answer that.)
(澄清:这是不是一个重复为什么Java的hashCode()方法在字符串中使用31作为乘数?因为我的问题是不与JDK 31的历史有关,但是这将是在新的代码更好的价值使用相同的基本模板。那里的答案都没有试图回答这个问题。)
回答by Hans-Peter St?rr
I recommend using 92821. Here's why.
我建议使用92821。这是为什么。
To give a meaningful answer to this you have to know something about the possible values of iand j. The only thing I can think of in general is, that in many cases small values will be more common than large values. (The odds of 15 appearing as a value in your program are much better than, say, 438281923.) So it seems a good idea to make the smallest hashcode collision as large as possible by choosing an appropriate prime. For 31 this rather bad - already for i=-1and j=31you have the same hash value as for i=0and j=0.
要对此给出有意义的答案,您必须了解i和的可能值j。一般来说,我唯一能想到的是,在许多情况下,小值比大值更常见。(15 作为一个值出现在你的程序中的几率比 438281923 好得多。)因此,通过选择一个合适的素数来使最小的哈希码冲突尽可能大似乎是一个好主意。对于 31 这相当糟糕 - 已经为i=-1并且j=31您具有与 for 相同的哈希值i=0和j=0。
Since this is interesting, I've written a little program that searched the whole int range for the best prime in this sense. That is, for each prime I searched for the minimum value of Math.abs(i) + Math.abs(j)over all values of i,jthat have the same hashcode as 0,0, and then took the prime where this minimum value is as large as possible.
由于这很有趣,我编写了一个小程序,在整个 int 范围内搜索这个意义上的最佳质数。也就是说,对于每个素数,我搜索与具有相同哈希码的Math.abs(i) + Math.abs(j)所有值的最小值,然后在该最小值尽可能大的地方取素数。i,j0,0
Drumroll: the best prime in this sense is 486187739 (with the smallest collision being i=-25486, j=67194). Nearly as good and much easier to remember is 92821 with the smallest collision being i=-46272 and j=46016.
Drumroll:在这个意义上最好的质数是 486187739(最小的碰撞是i=-25486, j=67194)。与 92821 几乎一样好且更容易记住的是 92821,其中最小的碰撞是i=-46272 and j=46016.
If you give "small" another meaning and want to be the minimum of Math.sqrt(i*i+j*j)for the collision as large as possible, the results are a little different: the best would be 1322837333 with i=-6815 and j=70091, but my favourite 92821 (smallest collision -46272,46016) is again almost as good as the best value.
如果您赋予“小”另一个含义并希望Math.sqrt(i*i+j*j)碰撞的最小值尽可能大,则结果会有所不同:最好的将是 1322837333 i=-6815 and j=70091,但我最喜欢的 92821(最小碰撞-46272,46016)又几乎一样好作为最好的价值。
I do acknowledge that it is quite debatable whether these calculation make much sense in practice. But I do think that taking 92821 as prime makes much more sense than 31, unless you have good reasons not to.
我承认,这些计算在实践中是否有意义是值得商榷的。但我确实认为将 92821 作为质数比 31 更有意义,除非你有充分的理由不这样做。
回答by Pascal Cuoq
Actually, if you take a prime so large that it comes close to INT_MAX, you have the same problem because of modulo arithmetic. If you expect to hash mostly strings of length 2, perhaps a prime near the square root of INT_MAXwould be best, if the strings you hash are longer it doesn't matter so much and collisions are unavoidable anyway...
实际上,如果你取一个大到接近 的素数INT_MAX,由于模运算,你会遇到同样的问题。如果您希望散列大部分长度为 2 的字符串,那么可能最好是接近平方根的素数INT_MAX,如果您散列的字符串更长,那么就没有那么重要了,无论如何碰撞是不可避免的......
回答by Romain
Collisions may not be such a big issue... The primary goal of the hash is to avoid using equals for 1:1 comparisons. If you have an implementation where equals is "generally" extremely cheap for objects that have collided hashs, then this is not an issue (at all).
冲突可能不是什么大问题……散列的主要目标是避免在 1:1 比较中使用 equals。如果你有一个实现,其中 equals 对于散列冲突的对象来说“通常”非常便宜,那么这不是问题(根本)。
In the end, what is the best way of hashing depends on what you are comparing. In the case of an int pair (as in your example), using basic bitwise operators could be sufficient (as using & or ^).
最后,散列的最佳方法是什么取决于您要比较的内容。对于 int 对(如您的示例),使用基本的按位运算符就足够了(如使用 & 或 ^)。
回答by Peter Lawrey
You need to define your range for i and j. You could use a prime number for both.
您需要定义 i 和 j 的范围。您可以对两者使用质数。
public int hashCode() {
http://primes.utm.edu/curios/ ;)
return 97654321 * i ^ 12356789 * j;
}
回答by Erich Kitzmueller
I'd choose 7243. Large enough to avoid collissions with small numbers. Doesn't overflow to small numbers quickly.
我会选择 7243。足够大以避免与小数字发生冲突。不会很快溢出到小数字。
回答by neoedmund
I just want to point out that hashcode has nothing to do with prime. In JDK implementation
我只想指出哈希码与素数无关。在JDK实现中
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
I found if you replace 31with 27, the result are very similar.
我发现如果用27替换31,结果非常相似。

