Java 为什么在 hashCode 中使用质数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3613102/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why use a prime number in hashCode?
提问by Ian Dallas
I was just wondering why is that primes are used in a class's hashCode()
method? For example, when using Eclipse to generate my hashCode()
method there is always the prime number 31
used:
我只是想知道为什么在类的hashCode()
方法中使用素数?例如,当使用 Eclipse 生成我的hashCode()
方法时,总是使用质数31
:
public int hashCode() {
final int prime = 31;
//...
}
References:
参考:
Here is a good primer on Hashcode and article on how hashing works that I found (C# but the concepts are transferrable): Eric Lippert's Guidelines and rules for GetHashCode()
这是我发现的关于 Hashcode 和散列如何工作的一篇很好的入门书(C#,但概念是可转移的): Eric Lippert 的 GetHashCode() 指南和规则
采纳答案by ILMTitan
Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.
因为您希望乘以的数字和插入的桶数具有正交素数分解。
Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.
假设有 8 个桶要插入。如果您用来乘以的数字是 8 的某个倍数,那么插入的存储桶将仅由最不重要的条目(根本没有乘以的条目)确定。类似的条目会发生冲突。不适合散列函数。
31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).
31 是一个足够大的素数,桶的数量不太可能被它整除(事实上,现代 java HashMap 实现将桶的数量保持为 2 的幂)。
回答by Steve Kuo
I heard that 31 was chosen so that the compiler can optimize the multiplication to left-shift 5 bits then subtract the value.
我听说选择了 31 以便编译器可以优化乘法左移 5 位然后减去该值。
回答by Steve Kuo
It generally helps achieve a more even spread of your data among the hash buckets, especially for low-entropy keys.
它通常有助于在散列桶之间实现更均匀的数据分布,尤其是对于低熵键。
回答by John
回答by advait
Prime numbers are chosen to best distribute data among hash buckets. If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter. It only has an impact when there is a certain pattern to the inputs.
选择质数以在散列桶之间最好地分配数据。如果输入的分布是随机且均匀分布的,则哈希码/模数的选择无关紧要。只有当输入存在某种模式时,它才会产生影响。
This is often the case when dealing with memory locations. For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:
处理内存位置时经常会出现这种情况。例如,所有 32 位整数都与可被 4 整除的地址对齐。查看下表以可视化使用质数与非质数模数的效果:
Input Modulo 8 Modulo 7
0 0 0
4 4 4
8 0 1
12 4 5
16 0 2
20 4 6
24 0 3
28 4 0
Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.
请注意使用质数模数与非质数模数时几乎完美的分布。
However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs, using a prime number modulus will yield the best distribution.
然而,虽然上面的例子很大程度上是人为设计的,但一般原则是,在处理输入模式时,使用质数模数将产生最佳分布。
回答by polygenelubricants
For what it's worth, Effective Java 2nd Editionhand-waives around the mathematics issue and just say that the reason to choose 31 is:
值得一提的是,Effective Java 2nd Edition放弃了数学问题,只是说选择 31 的原因是:
- Because it's an odd prime, and it's "traditional" to use primes
- It's also one less than a power of two, which permits for bitwise optimization
- 因为它是一个奇素数,而且使用素数是“传统的”
- 它也小于 2 的幂,这允许按位优化
Here's the full quote, from Item 9: Always override hashCode
when you override equals
:
这是第 9 项中hashCode
equals
的完整引用:覆盖时始终覆盖:
The value 31 was chosen because it's an odd prime. If it were even and multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional.
A nice property of 31 is that the multiplication can be replaced by a shift (§15.19) and subtraction for better performance:
31 * i == (i << 5) - i
Modern VMs do this sort of optimization automatically.
While the recipe in this item yields reasonably good hash functions, it does not yield state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of release 1.6. Writing such hash functions is a research topic, best left to mathematicians and theoretical computer scientists.
Perhaps a later release of the platform will provide state-of-the-art hash functions for its classes and utility methods to allow average programmers to construct such hash functions. In the meantime, the techniques described in this item should be adequate for most applications.
选择值 31 是因为它是一个奇质数。如果它是偶数并且乘法溢出,信息就会丢失,因为乘以 2 相当于移位。使用素数的优点不太清楚,但它是传统的。
31 的一个很好的属性是乘法可以用移位(第15.19 节)和减法代替以获得更好的性能:
31 * i == (i << 5) - i
现代虚拟机会自动进行这种优化。
虽然本项中的配方产生了相当好的散列函数,但它没有产生最先进的散列函数,Java 平台库也没有提供这样的散列函数(截至 1.6 版)。编写这样的散列函数是一个研究课题,最好留给数学家和理论计算机科学家。
也许该平台的后续版本将为其类和实用程序方法提供最先进的散列函数,以允许普通程序员构建此类散列函数。同时,本项中描述的技术应该适用于大多数应用程序。
Rather simplistically, it can be said that using a multiplier with numerous divisors will result in more hash collisions. Since for effective hashing we want to minimize the number of collisions, we try to use a multiplier that has fewer divisors. A prime number by definition has exactly two distinct, positive divisors.
更简单地说,可以说使用具有多个除数的乘数会导致更多的哈希冲突。由于为了有效散列我们希望最小化冲突的数量,我们尝试使用具有较少除数的乘法器。根据定义,素数正好有两个不同的正除数。
Related questions
相关问题
- Java hashCode from one field- the recipe, plus example of using Apache Commons Lang's builders
- is it incorrect to define an hashcode of an object as the sum, multiplication, whatever, of all class variables hashcodes?
- Absolute Beginner's Guide to Bit Shifting?
- 来自一个领域的 Java hashCode- 秘诀,以及使用 Apache Commons Lang 构建器的示例
- 将对象的哈希码定义为所有类变量哈希码的总和、乘法等是否不正确?
- 位移位绝对初学者指南?
回答by starblue
First you compute the hash value modulo 2^32 (the size of an int
), so you want something relatively prime to 2^32 (relatively prime means that there are no common divisors). Any odd number would do for that.
首先计算散列值模 2^32( an 的大小int
),因此您需要与 2^32 相对质数的值(相对质数意味着没有公约数)。任何奇数都可以。
Then for a given hash table the index is usually computed from the hash value modulo the size of the hash table, so you want something that is relatively prime to the size of the hash table. Often the sizes of hash tables are chosen as prime numbers for that reason. In the case of Java the Sun implementation makes sure that the size is always a power of two, so an odd number would suffice here, too. There is also some additional massaging of the hash keys to limit collisions further.
然后对于给定的哈希表,索引通常是根据哈希值以哈希表的大小为模计算的,因此您需要一些与哈希表的大小相对质数的东西。出于这个原因,哈希表的大小通常被选为素数。在 Java 的情况下,Sun 实现确保大小始终是 2 的幂,因此这里奇数也足够了。还有一些额外的散列键按摩以进一步限制冲突。
The bad effect if the hash table and the multiplier had a common factor n
could be that in certain circumstances only 1/n entries in the hash table would be used.
如果散列表和乘数有一个共同的因素,那么不好的影响n
可能是在某些情况下散列表中只有 1/n 个条目会被使用。
回答by DED
31 is also specific to Java HashMap which uses a int as hash data type. Thus the max capacity of 2^32. There is no point in using larger Fermat or Mersenne primes.
31 也特定于 Java HashMap,它使用 int 作为哈希数据类型。因此最大容量为 2^32。使用更大的费马或梅森素数是没有意义的。
回答by Amar Magar
The reason why prime numbers are used is to minimize collisions when the data exhibits some particular patterns.
使用质数的原因是在数据表现出某些特定模式时尽量减少冲突。
First things first: If the data is random then there's no need for a prime number, you can do a mod operation against any number and you will have the same number of collisions for each possible value of the modulus.
首先要做的事情是:如果数据是随机的,则不需要质数,您可以对任何数字进行模运算,并且对于模数的每个可能值,您将有相同数量的碰撞。
But when data is not random then strange things happen. For example consider numeric data that is always a multiple of 10.
但是当数据不是随机的时,就会发生奇怪的事情。例如,考虑始终是 10 倍数的数字数据。
If we use mod 4 we find:
如果我们使用 mod 4,我们会发现:
10 mod 4 = 2
10 模 4 = 2
20 mod 4 = 0
20 模 4 = 0
30 mod 4 = 2
30 模 4 = 2
40 mod 4 = 0
40 模 4 = 0
50 mod 4 = 2
50 模 4 = 2
So from the 3 possible values of the modulus (0,1,2,3) only 0 and 2 will have collisions, that is bad.
因此,从模数 (0,1,2,3) 的 3 个可能值中,只有 0 和 2 会发生冲突,这很糟糕。
If we use a prime number like 7:
如果我们使用像 7 这样的质数:
10 mod 7 = 3
10 模 7 = 3
20 mod 7 = 6
20 模 7 = 6
30 mod 7 = 2
30 模 7 = 2
40 mod 7 = 4
40 模 7 = 4
50 mod 7 = 1
50 模 7 = 1
etc
等等
We also note that 5 is not a good choice but 5 is prime the reason is that all our keys are a multiple of 5. This means we have to choose a prime number that doesn't divide our keys, choosing a large prime number is usually enough.
我们还注意到 5 不是一个好的选择,但 5 是质数,原因是我们所有的键都是 5 的倍数。这意味着我们必须选择一个不能整除我们的键的质数,选择一个大的质数是通常足够了。
So erring on the side of being repetitive the reason prime numbers are used is to neutralize the effect of patterns in the keys in the distribution of collisions of a hash function.
因此,在重复方面犯了错误,使用质数的原因是为了抵消散列函数冲突分布中键中模式的影响。