为什么 Java 在 String 中的 hashCode() 使用 31 作为乘数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/299304/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why does Java's hashCode() in String use 31 as a multiplier?
提问by jacobko
Per the Java documentation, the hash codefor a String
object is computed as:
每Java文档中,哈希代码的String
对象被计算为:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using
int
arithmetic, wheres[i]
is the ith character of the string,n
is the length of the string, and^
indicates exponentiation.
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
使用
int
算术,其中s[i]
是 字符串的第i个字符,是字符串n
的长度,并^
表示求幂。
Why is 31 used as a multiplier?
为什么将 31 用作乘数?
I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?
我理解乘数应该是一个相对较大的素数。那么为什么不是 29,或 37,甚至 97?
采纳答案by matt b
According to Joshua Bloch's Effective Java(a book that can't be recommended enough, and which I bought thanks to continual mentions on stackoverflow):
根据 Joshua Bloch 的Effective Java(一本不太推荐的书,由于不断提到 stackoverflow,我买了这本书):
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance:
31 * i == (i << 5) - i
. Modern VMs do this sort of optimization automatically.
选择值 31 是因为它是一个奇素数。如果它是偶数并且乘法溢出,信息就会丢失,因为乘以 2 相当于移位。使用素数的优点不太清楚,但它是传统的。31一个很好的特性是乘法可以通过移位,并获得更好的性能减法来代替:
31 * i == (i << 5) - i
。现代虚拟机会自动进行这种优化。
(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)
(来自第 3 章,第 9 项:在覆盖等于时始终覆盖哈希码,第 48 页)
回答by Dave L.
I'm not sure, but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings.
我不确定,但我猜他们测试了一些素数样本,发现 31 给出了一些可能字符串样本的最佳分布。
回答by Tom Hawtin - tackline
On (mostly) old processors, multiplying by 31 can be relatively cheap. On an ARM, for instance, it is only one instruction:
在(大多数)旧处理器上,乘以 31 可能相对便宜。例如,在 ARM 上,它只有一条指令:
RSB r1, r0, r0, ASL #5 ; r1 := - r0 + (r0<<5)
Most other processors would require a separate shift and subtract instruction. However, if your multiplier is slow this is still a win. Modern processors tend to have fast multipliers so it doesn't make much difference, so long as 32 goes on the correct side.
大多数其他处理器需要单独的移位和减法指令。但是,如果您的乘数很慢,这仍然是一个胜利。现代处理器往往具有快速乘法器,因此只要 32 在正确的一侧,它就没有太大区别。
It's not a great hash algorithm, but it's good enough and better than the 1.0 code (and very much better than the 1.0 spec!).
它不是一个很好的散列算法,但它已经足够好并且比 1.0 代码更好(并且比 1.0 规范好得多!)。
回答by JohnZaj
As Goodrich and Tamassiapoint out, If you take over 50,000 English words (formed as the union of the word lists provided in two variants of Unix), using the constants 31, 33, 37, 39, and 41 will produce less than 7 collisions in each case. Knowing this, it should come as no surprise that many Java implementations choose one of these constants.
正如Goodrich 和 Tamassia指出的那样,如果您使用超过 50,000 个英语单词(由两个 Unix 变体中提供的单词列表的并集构成),使用常量31、33、37、39和 41 将产生少于 7 次的冲突在每种情况下。知道了这一点,许多 Java 实现选择这些常量之一也就不足为奇了。
Coincidentally, I was in the middle of reading the section "polynomial hash codes" when I saw this question.
巧的是,看到这个问题的时候,我正在阅读“多项式哈希码”一节。
EDIT: here is link to the ~10mb PDF book i'm referring to above. See section 10.2 Hash Tables (page 413) of Data Structures and Algorithms in Java
编辑:这里是我在上面提到的 ~10mb PDF 书的链接。请参阅Java 中的数据结构和算法的10.2 哈希表(第 413 页)部分
回答by erickson
By multiplying, bits are shifted to the left. This uses more of the available space of hash codes, reducing collisions.
通过乘法,位向左移动。这使用了更多的哈希码可用空间,减少了冲突。
By not using a power of two, the lower-order, rightmost bits are populated as well, to be mixed with the next piece of data going into the hash.
通过不使用 2 的幂,低位、最右边的位也被填充,以与进入散列的下一段数据混合。
The expression n * 31
is equivalent to (n << 5) - n
.
表达式n * 31
等价于(n << 5) - n
。
回答by Jason
Bloch doesn't quite go into this, but the rationale I've always heard/believed is that this is basic algebra. Hashes boil down to multiplication and modulus operations, which means that you never want to use numbers with common factors if you can help it. In other words, relatively prime numbers provide an even distribution of answers.
Bloch 并没有深入探讨这一点,但我一直听到/相信的基本原理是这是基本代数。哈希归结为乘法和模运算,这意味着如果可以帮助,您永远不想使用具有公因数的数字。换句话说,相对质数提供了均匀分布的答案。
The numbers that make up using a hash are typically:
使用散列组成的数字通常是:
- modulus of the data type you put it into (2^32 or 2^64)
- modulus of the bucket count in your hashtable (varies. In java used to be prime, now 2^n)
- multiply or shift by a magic number in your mixing function
- The input value
- 您放入的数据类型的模数(2^32 或 2^64)
- 哈希表中桶数的模数(变化。在 java 中曾经是素数,现在是 2^n)
- 在混合函数中乘以或移位一个幻数
- 输入值
You really only get to control a couple of these values, so a little extra care is due.
您实际上只能控制其中的几个值,因此需要格外小心。
回答by hrr
Actually, 37 would work pretty well! z := 37 * x can be computed as y := x + 8 * x; z := x + 4 * y
. Both steps correspond to one LEA x86 instructions, so this is extremely fast.
实际上,37 会很好用!z := 37 * x 可以计算为y := x + 8 * x; z := x + 4 * y
。这两个步骤都对应于一条 LEA x86 指令,因此速度非常快。
In fact, multiplication with the even-larger prime 73could be done at the same speed by setting y := x + 8 * x; z := x + 8 * y
.
事实上,通过设置 ,可以以相同的速度与更大的素数73相乘y := x + 8 * x; z := x + 8 * y
。
Using 73 or 37 (instead of 31) might be better, because it leads to denser code: The two LEA instructions only take 6 bytes vs. the 7 bytes for move+shift+subtract for the multiplication by 31. One possible caveat is that the 3-argument LEA instructions used here became slower on Intel's Sandy bridge architecture, with an increased latency of 3 cycles.
使用 73 或 37(而不是 31)可能会更好,因为它会导致更密集的代码:两个 LEA 指令仅占用 6 个字节,而 move+shift+subtract 的 7 个字节用于乘以 31。一个可能的警告是此处使用的 3 参数 LEA 指令在 Intel 的 Sandy 桥架构上变得更慢,延迟增加了 3 个周期。
Moreover, 73is Sheldon Cooper's favorite number.
此外,73是 Sheldon Cooper 最喜欢的数字。
回答by TheJuice
回答by David Ongaro
You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31)
was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to choose one and so he took P(31)
since it seemed to perform well enough. Even though P(33)
was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:
您可以在http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622 的“评论”下阅读 Bloch 的原始推理。他研究了不同哈希函数在哈希表中产生的“平均链大小”的性能。P(31)
是当时他在 K&R 的书中发现的常见函数之一(但即使是 Kernighan 和 Ritchie 也不记得它是从哪里来的)。最后他基本上不得不选择一个,所以他选择了,P(31)
因为它似乎表现得足够好。尽管P(33)
并不是真的更糟,并且乘以 33 的计算速度同样快(只是移位 5 和加法),但他选择了 31,因为 33 不是质数:
Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.
在剩下的四个中,我可能会选择 P(31),因为它是在 RISC 机器上计算最便宜的(因为 31 是 2 的两个幂的差)。P(33) 计算同样便宜,但它的性能稍微差一些,33 是复合的,这让我有点紧张。
So the reasoning was not as rational as many of the answers here seem to imply. But we're all good in coming up with rational reasons after gut decisions (and even Bloch might be prone to that).
因此,推理并不像这里的许多答案所暗示的那样合理。但是我们都善于在直觉决定后提出合理的理由(甚至布洛赫也可能倾向于这样做)。
回答by Flow
From JDK-4045622, where Joshua Bloch describes the reasons why that particular (new) String.hashCode()
implementation was chosen
来自JDK-4045622,其中 Joshua Bloch 描述了String.hashCode()
选择该特定(新)实现的原因
The table below summarizes the performance of the various hash functions described above, for three data sets:
1) All of the words and phrases with entries in Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars).
2) All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/and /usr/openwin/bin/* (66,304 strings, avg length 21 characters).
3) A list of URLs gathered by a web-crawler that ran for several hours last night (28,372 strings, avg length 49 characters).
The performance metric shown in the table is the "average chain size" over all elements in the hash table (i.e., the expected value of the number of key compares to look up an element).
Webster's Code Strings URLs --------- ------------ ---- Current Java Fn. 1.2509 1.2738 13.2560 P(37) [Java] 1.2508 1.2481 1.2454 P(65599) [Aho et al] 1.2490 1.2510 1.2450 P(31) [K+R] 1.2500 1.2488 1.2425 P(33) [Torek] 1.2500 1.2500 1.2453 Vo's Fn 1.2487 1.2471 1.2462 WAIS Fn 1.2497 1.2519 1.2452 Weinberger's Fn(MatPak) 6.5169 7.2142 30.6864 Weinberger's Fn(24) 1.3222 1.2791 1.9732 Weinberger's Fn(28) 1.2530 1.2506 1.2439
Looking at this table, it's clear that all of the functions except for the current Java function and the two broken versions of Weinberger's function offer excellent, nearly indistinguishable performance. I strongly conjecture that this performance is essentially the "theoretical ideal", which is what you'd get if you used a true random number generator in place of a hash function.
I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the far simpler functions. Any of the remaining six functions seem like excellent choices, but we have to pick one. I suppose I'd rule out Vo's variant and Weinberger's function because of their added complexity, albeit minor. Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.
Josh
下表总结了上述各种散列函数对于三个数据集的性能:
1) Merriam-Webster's 2nd Int'l Unabridged Dictionary(311,141 个字符串,平均长度为 10 个字符)中的所有词和词组。
2) /bin/ 、 /usr/bin/、 /usr/lib/ 、 /usr/ucb/和 /usr/openwin/bin/* 中的所有字符串(66,304 个字符串,平均长度为 21 个字符)。
3) 昨晚运行数小时的网络爬虫收集的 URL 列表(28,372 个字符串,平均长度为 49 个字符)。
表中显示的性能指标是哈希表中所有元素的“平均链大小”(即,与查找元素相比的键数的预期值)。
Webster's Code Strings URLs --------- ------------ ---- Current Java Fn. 1.2509 1.2738 13.2560 P(37) [Java] 1.2508 1.2481 1.2454 P(65599) [Aho et al] 1.2490 1.2510 1.2450 P(31) [K+R] 1.2500 1.2488 1.2425 P(33) [Torek] 1.2500 1.2500 1.2453 Vo's Fn 1.2487 1.2471 1.2462 WAIS Fn 1.2497 1.2519 1.2452 Weinberger's Fn(MatPak) 6.5169 7.2142 30.6864 Weinberger's Fn(24) 1.3222 1.2791 1.9732 Weinberger's Fn(28) 1.2530 1.2506 1.2439
查看此表,很明显,除了当前的 Java 函数和 Weinberger 函数的两个损坏版本之外的所有函数都提供了几乎无法区分的出色性能。我强烈推测这种性能本质上是“理论上的理想”,如果您使用真正的随机数生成器代替散列函数,您会得到这种性能。
我将排除 WAIS 函数,因为它的规范包含随机数页,并且它的性能并不比任何简单得多的函数好。其余六个功能中的任何一个看起来都是不错的选择,但我们必须选择一个。我想我会排除 Vo 的变体和 Weinberger 的功能,因为它们增加了复杂性,尽管很小。在剩下的四个中,我可能会选择 P(31),因为它是在 RISC 机器上计算最便宜的(因为 31 是 2 的两个幂的差)。P(33) 计算同样便宜,但它的性能稍微差一些,33 是复合的,这让我有点紧张。
乔希