java 为什么 String 的 hashCode() 不缓存 0?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2310498/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why doesn't String's hashCode() cache 0?
提问by polygenelubricants
I noticed in the Java 6 source code for String that hashCode only caches values other than 0. The difference in performance is exhibited by the following snippet:
我注意到在 String 的 Java 6 源代码中,hashCode 仅缓存 0 以外的值。以下代码段展示了性能差异:
public class Main{
static void test(String s) {
long start = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++) {
s.hashCode();
}
System.out.format("Took %d ms.%n", System.currentTimeMillis() - start);
}
public static void main(String[] args) {
String z = "Allocator redistricts; strict allocator redistricts strictly.";
test(z);
test(z.toUpperCase());
}
}
Running this in ideone.comgives the following output:
在 ideone.com 中运行它会得到以下输出:
Took 1470 ms.
Took 58 ms.
So my questions are:
所以我的问题是:
- Why doesn't String's hashCode() cache 0?
- What is the probability that a Java string hashes to 0?
- What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?
- Is this the best-practice way of caching values? (i.e. cache all except one?)
- 为什么 String 的 hashCode() 不缓存 0?
- Java 字符串散列为 0 的概率是多少?
- 避免每次为散列为 0 的字符串重新计算散列值的性能损失的最佳方法是什么?
- 这是缓存值的最佳实践方式吗?(即缓存除一个之外的所有内容?)
For your amusement, each line here is a string that hash to 0:
为了您的娱乐,这里的每一行都是一个哈希为 0 的字符串:
pollinating sandboxes
amusement & hemophilias
schoolworks = perversive
electrolysissweeteners.net
constitutionalunstableness.net
grinnerslaphappier.org
BLEACHINGFEMININELY.NET
WWW.BUMRACEGOERS.ORG
WWW.RACCOONPRUDENTIALS.NET
Microcomputers: the unredeemed lollipop...
Incentively, my dear, I don't tessellate a derangement.
A person who never yodelled an apology, never preened vocalizing transsexuals.
采纳答案by Kevin Bourrillion
You're worrying about nothing. Here's a way to think about this issue.
你什么都不担心。这是思考这个问题的一种方式。
Suppose you have an application that does nothing but sit around hashing Strings all year long. Let's say it takes a thousand strings, all in memory, calls hashCode() on them repeatedly in round-robin fashion, a million times through, then gets another thousand new strings and does it again.
假设您有一个应用程序,它除了整年都在散列字符串之外什么都不做。假设它需要一千个字符串,全部在内存中,以循环方式对它们重复调用 hashCode() 一百万次,然后再获取一千个新字符串并再次执行。
And suppose that the likelihood of a string's hash code being zero were, in fact, much greater than 1/2^32. I'm sure it is somewhatgreater than 1/2^32, but let's say it's a lot worse than that, like 1/2^16 (the square root! now that's a lot worse!).
并且假设字符串的哈希码为零的可能性实际上远大于 1/2^32。我敢肯定,这是有点更大的1/2 ^ 32,但让我们说这是比差很多,像1 / ^ 16(平方根!现在这是一个差很多!)。
In this situation, you have more to benefit from Oracle's engineers improving how these strings' hash codes are cached than anyone else alive. So you write to them and ask them to fix it. And they work their magic so that whenever s.hashCode() is zero, it returns instantaneously(even the first time! a 100% improvement!). And let's say that they do this without degrading the performance at all for any other case.
在这种情况下,您可以从 Oracle 工程师改进这些字符串的哈希码的缓存方式中受益,而不是其他任何人。所以你写信给他们并要求他们修复它。并且他们发挥了他们的魔力,这样每当 s.hashCode() 为零时,它就会立即返回(即使是第一次!100% 的改进!)。并且假设他们这样做并没有降低任何其他情况的性能。
Hooray! Now your app is... let's see... 0.0015% faster!
万岁!现在你的应用是...让我们看看...快 0.0015%!
What used to take an entire day now takes only 23 hours, 57 minutes and 48 seconds!
过去需要一整天的时间,现在只需 23 小时 57 分 48 秒!
And remember, we set up the scenario to give every possible benefit of the doubt, often to a ludicrous degree.
请记住,我们设置了场景以提供怀疑的所有可能的好处,通常达到可笑的程度。
Does this seem worth it to you?
你觉得这值得吗?
EDIT:since posting this a couple hours ago, I've let one of my processors run wild looking for two-word phrases with zero hash codes. So far it's come up with: bequirtle zorillo, chronogrammic schtoff, contusive cloisterlike, creashaks organzine, drumwood boulderhead, electroanalytic exercisable, and favosely nonconstruable. This is out of about 2^35 possibilities, so with perfect distribution we'd expect to see only 8. Clearly by the time it's done we'll have a few times that many, but not outlandishly more. What's more significant is that I've now come up with a few interesting band names/album names! No fair stealing!
编辑:自从几个小时前发布这篇文章以来,我让我的一个处理器疯狂寻找具有零哈希码的两个词短语。到目前为止,它提出了:bequirtle zorillo、chronogrammic schtoff、contusive cloisterlike、creashaks organzine、drumwood boulderhead、electroanalytic execisable 和 favosely unconstruable。这是大约 2^35 种可能性,因此在完美分布的情况下,我们预计只会看到 8 种可能性。很明显,当它完成时,我们将拥有数倍的数量,但不会多得离谱。更重要的是,我现在想出了几个有趣的乐队名称/专辑名称!没有公平的偷窃!
回答by Jon Skeet
It uses 0 to indicate "I haven't worked out the hashcode yet". The alternative would be to use a separate Boolean flag, which would take more memory. (Or to not cache the hashcode at all, of course.)
它使用 0 表示“我还没有算出哈希码”。另一种方法是使用单独的布尔标志,这将占用更多内存。(当然,或者根本不缓存哈希码。)
I don't expect manystrings hash to 0; arguably it would make sense for the hashing routine to deliberately avoid 0 (e.g. translate a hash of 0 to 1, and cache that). That would increase collisions but avoid rehashing. It's too late to do that now though, as the String hashCode algorithm is explicitly documented.
我不希望很多字符串散列为 0;可以说,散列例程有意避免 0(例如,将 0 的散列转换为 1,并缓存它)是有意义的。这会增加碰撞但避免重新散列。但是现在这样做已经太晚了,因为 String hashCode 算法已被明确记录。
As for whether this is a good idea in general: it's an certainly efficient caching mechanism, and might(see edit) be even better with a change to avoid rehashing values which end up with a hash of 0. Personally I would be interested to see the data which led Sun to believe this was worth doing in the first place - it's taking up an extra 4 bytes for every string ever created, however often or rarely it's hashed, and the only benefit is for strings which are hashed more than once.
至于这是否是一个总体上的好主意:它是一种肯定有效的缓存机制,并且可能(请参阅编辑)通过更改来避免重新散列最终以 0 散列的值可能会更好。就我个人而言,我很想看看首先让 Sun 相信这是值得做的数据 - 它为曾经创建的每个字符串占用额外的 4 个字节,无论它经常或很少被散列,唯一的好处是对散列不止一次的字符串。
EDIT: As KevinB points out in a comment elsewhere, the "avoid 0" suggestion above may well have a net costbecause it helps a very rarecase, but requires an extra comparison for everyhash calculation.
编辑:正如 KevinB 在其他地方的评论中指出的那样,上面的“避免 0”建议很可能有净成本,因为它有助于非常罕见的情况,但需要对每个哈希计算进行额外的比较。
回答by MB.
I think there's something important that the other answers so far are missing: the zero value exists so that the hashCode-caching mechanism works robustly in a multi-threaded environment.
我认为到目前为止,其他答案都缺少一些重要的东西:零值存在以便 hashCode 缓存机制在多线程环境中稳健运行。
If you had two variables, like cachedHashCode itself and an isHashCodeCalculated boolean to indicate whether cachedHashCode had been calculated, you'd need thread synchronization for things to work in a multithreaded environment. And synchronization would be bad for performance, especially since Strings are very commonly reused in multiple threads.
如果你有两个变量,比如 cachedHashCode 本身和一个 isHashCodeCalculated 布尔值来指示是否已经计算了 cachedHashCode,那么你需要线程同步才能在多线程环境中工作。并且同步对性能不利,特别是因为字符串在多个线程中非常普遍地重用。
My understanding of the Java memory model is a little sketchy, but here's roughly what's going on:
我对 Java 内存模型的理解有点粗略,但大致是这样:
When multiple threads access a variable (like the cached hashCode), there's no guarantee that each thread will see the latest value. If a variable starts on zero, then A updates it (sets it to a non-zero value), then thread B reads it shortly afterwards, thread B could still see the zero value.
There's another problem with accessing shared values from multiple threads (without synchronization) - you can end up trying to use an object that's only been partly initialized (constructing an object is not an atomic process). Multi-threaded reads and writes of 64-bit primitives like longs and doubles are not necessarily atomic either, so if two threads try to read and change the value of a long or a double, one thread can end up seeing something weird and partially set. Or something like that anyway. There are similar problems if you try to use two variables together, like cachedHashCode and isHashCodeCalculated - a thread can easily come along and see the latest version of one of those variables, but an older version of another.
The usual way to get around these multi-threading issues is to use synchronization. For example, you could put all access to the cached hashCode inside a synchronized block, or you could use the volatile keyword (although be careful with that because the semantics are a little confusing).
However, synchronization slows things down. Bad idea for something like a string hashCode. Strings are very often used as keys in HashMaps, so you need the hashCode method to perform well, including in multi-threaded environments.
Java primitives that are 32-bits or less, like int, are special. Unlike, say, a long (64-bit value), you can be sure that you will never read a partially initialized value of an int (32 bits). When you read an int without synchronization, you can't be sure that you'll get the latest set value, but you can be sure that the value you do get is a value that has explicitly been set at some point by your thread or another thread.
当多个线程访问一个变量(如缓存的 hashCode)时,不能保证每个线程都会看到最新的值。如果一个变量从零开始,那么 A 更新它(将它设置为一个非零值),然后线程 B 不久之后读取它,线程 B 仍然可以看到零值。
从多个线程访问共享值(没有同步)还有另一个问题 - 您最终可能会尝试使用仅部分初始化的对象(构造对象不是原子过程)。多线程读取和写入 64 位基元(如 longs 和 doubles)也不一定是原子的,因此如果两个线程尝试读取和更改 long 或 double 的值,一个线程最终可能会看到一些奇怪的东西并且部分设置. 或者类似的东西。如果您尝试同时使用两个变量(例如 cachedHashCode 和 isHashCodeCalculated),则会出现类似的问题 - 一个线程很容易出现并查看其中一个变量的最新版本,但查看另一个变量的旧版本。
解决这些多线程问题的常用方法是使用同步。例如,您可以将所有对缓存的 hashCode 的访问放在一个同步块中,或者您可以使用 volatile 关键字(尽管要小心,因为语义有点混乱)。
但是,同步会减慢速度。对于字符串 hashCode 之类的东西是个坏主意。字符串在 HashMaps 中经常用作键,因此您需要 hashCode 方法执行良好,包括在多线程环境中。
32 位或更少的 Java 原语,如 int,是特殊的。与 long(64 位值)不同,您可以确保永远不会读取 int(32 位)的部分初始化值。当您在没有同步的情况下读取 int 时,您不能确定您将获得最新的设置值,但您可以确定您获得的值是您的线程在某个时间点明确设置的值或另一个线程。
The hashCode caching mechanism in java.lang.String is set up to rely on point 5 above. You might understand it better by looking at the source of java.lang.String.hashCode(). Basically, with multiple threads calling hashCode at once, hashCode might end up being calculated multiple times (either if the calculated value is zero or if multiple threads call hashCode at once and both see a zero cached value), but you can be sure that hashCode() will always return the same value. So it's robust, and it's performant too (because there's no synchronization to act as a bottleneck in multi-threaded environments).
java.lang.String 中的hashCode 缓存机制的设置依赖于上面的第5 点。你可以通过查看 java.lang.String.hashCode() 的源代码更好地理解它。基本上,当多个线程同时调用 hashCode 时,hashCode 可能会被多次计算(如果计算值为零,或者如果多个线程一次调用 hashCode 并且都看到零缓存值),但您可以确定 hashCode () 将始终返回相同的值。所以它很健壮,而且性能也很好(因为在多线程环境中没有同步作为瓶颈)。
Like I said, my understanding of the Java memory model is a little sketchy, but I'm pretty sure I've got the gist of the above right. Ultimately it's a very clever idiom for caching the hashCode without the overhead of synchronization.
就像我说的,我对 Java 内存模型的理解有点粗略,但我很确定我已经掌握了上面的要点。最终,这是一个非常聪明的习惯用法,用于缓存 hashCode 而没有同步的开销。
回答by Adamski
0 isn't cached as the implementation interprets a cached value of 0 as "cached value not yet initialised". The alternative would have been to use a java.lang.Integer, whereby null implied that the value was not yet cached. However, this would have meant an additional storage overhead.
0 未缓存,因为实现将缓存值 0 解释为“缓存值尚未初始化”。另一种方法是使用 a java.lang.Integer,其中 null 表示该值尚未缓存。然而,这意味着额外的存储开销。
Regarding the probability of a String's hash code being computed as 0 I would say the probability is quite low and can happen in the following cases:
关于字符串的哈希码被计算为 0 的概率,我会说概率非常低,可能发生在以下情况:
- The String is empty (although recomputing this hash code each time is effectively O(1)).
- An overflow occurs whereby the final computed hash code is 0 (
e.g. Integer.MAX_VALUE + h(c1) + h(c2) + ... h(cn) == 0). - The String contains only Unicode character 0. Very unlikely as this is a control character with no meaning apart from in the "paper tape world" (!):
- String 为空(尽管每次重新计算此哈希码实际上是 O(1))。
- 发生溢出,最终计算出的哈希码为 0 (
e.g. Integer.MAX_VALUE + h(c1) + h(c2) + ... h(cn) == 0)。 - 该字符串仅包含 Unicode 字符 0。非常不可能,因为这是一个控制字符,除了在“纸带世界”(!)之外没有任何意义:
From Wikipedia:
来自维基百科:
Code 0 (ASCII code name NUL) is a special case. In paper tape, it is the case when there are no holes. It is convenient to treat this as a fill character without meaning otherwise.
代码 0(ASCII 代码名称 NUL)是一种特殊情况。在纸带中,是没有孔的情况。将其视为没有其他意义的填充字符会很方便。
回答by cdunn2001
This turns out to be a good question, related to a security vulnerability.
事实证明,这是一个与安全漏洞相关的好问题。
"When hashing a string, Java also caches the hash value in the hash attribute, but only if the result is different from zero. Thus, the target value zero is particularly interesting for an attacker as it prevents caching and forces re-hashing."
“当对字符串进行散列时,Java 也会在散列属性中缓存散列值,但前提是结果不为零。因此,目标值零对攻击者来说特别有趣,因为它可以防止缓存并强制重新散列。”
回答by The Coordinator
Well folks, it keeps 0 because if it is zero length, it will end up as zero anyways.
伙计们,它保持 0,因为如果它的长度为零,无论如何它最终都会为零。
And it doesn't take long to figure out that the len is zero and so must the hashcode be.
很快就会发现 len 为零,因此哈希码也必须为零。
So, for your code-reviewz! Here it is in all it's Java 8 glory:
所以,为了你的代码!这是 Java 8 的全部荣耀:
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
As you can see, this will always return a quick zero if the string is empty:
如您所见,如果字符串为空,这将始终返回一个快速零:
if (h == 0 && value.length > 0) ...
回答by Mike Liddell
The "avoid 0" suggestion seems appropriate to recommend as best practice as it helps a genuine problem (seriously unexpected performance degradation in constructible cases that can be attacker supplied) for the meager cost of a branch operation prior to a write. There is some remaining 'unexpected performance degradation' that can be exercised if the only things going into a set hash to the special adjusted value. But this is at worst a 2x degradation rather than unbounded.
“避免 0”建议似乎适合推荐为最佳实践,因为它有助于解决真正的问题(攻击者可能提供的可构造案例中的严重意外性能下降),而在写入之前进行分支操作的成本微薄。如果唯一的事情进入到特殊调整值的集合散列中,则可以执行一些剩余的“意外性能下降”。但这在最坏的情况下是 2 倍的退化,而不是无限的。
Of course, String's implementation can't be changed but there is no need to perpetuate the problem.
当然,String 的实现不能改变,但没有必要使问题永久化。
回答by Stephen C
- Why doesn't String's hashCode() cache 0?
- 为什么 String 的 hashCode() 不缓存 0?
The value zero is reserved as meaning "the hash code is not cached".
值零被保留为表示“不缓存哈希码”。
- What is the probability that a Java string hashes to 0?
- Java 字符串散列为 0 的概率是多少?
According to the Javadoc, the formula for a String's hashcode is:
根据 Javadoc,字符串哈希码的公式是:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using intarithmetic, where s[i]is the ith character of the string and nis the length of the string. (The hash of the empty String is defined to be zero as a special case.)
使用int算术,其中s[i]是字符串的第 i 个字符, 是字符串n的长度。(作为特殊情况,空字符串的散列被定义为零。)
My intuition is that the hashcode function as above gives a uniform spread of String hash values across the range of intvalues. A uniform spread that would mean that the probability of a randomly generated String hashing to zero was 1 in 2^32.
我的直觉是,上面的 hashcode 函数在值的范围内给出了 String 散列值的均匀分布int。均匀分布意味着随机生成的字符串散列为零的概率为 2^32 中的 1。
- What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?
- 避免每次为散列为 0 的字符串重新计算散列值的性能损失的最佳方法是什么?
The best strategy is to ignore the issue. If you are repeatedly hashing the same String value, there is something rather strange about your algorithm.
最好的策略是忽略这个问题。如果您反复对相同的 String 值进行散列,则您的算法有些奇怪。
- Is this the best-practice way of caching values? (i.e. cache all except one?)
- 这是缓存值的最佳实践方式吗?(即缓存除一个之外的所有内容?)
This is a space versus time trade-off. AFAIK, the alternatives are:
这是空间与时间的权衡。AFAIK,替代方案是:
Add a
cachedflag to each String object, making every Java String take an extra word.Use the top bit of the
hashmember as the cached flag. That way you can cache all hash values, but you only have half as many possible String hash values.Don't cache hashcodes on Strings at all.
cached为每个 String 对象添加一个标志,使每个 Java String 多出一个单词。使用
hash成员的最高位作为缓存标志。这样您就可以缓存所有散列值,但您只有一半的可能字符串散列值。不要在字符串上缓存哈希码。
I think that the Java designers have made the right call for Strings, and I'm sure that they have done extensive profiling that confirms the soundness of their decision. However, it does notfollow that this would alwaysbe the best way to deal with caching.
我认为 Java 设计者对 Strings 做出了正确的决定,而且我确信他们已经进行了大量的分析,以证实他们的决定是合理的。然而,这并不意味着这始终是处理缓存的最佳方式。
(Note that there are two "common" String values which hash to zero; the empty String, and the String consisting of just a NUL character. However, the cost of calculating the hashcodes for these values is small compared with the cost of calculating the hashcode for a typical String value.)
(请注意,有两个散列为零的“公共”字符串值;空字符串和仅由 NUL 字符组成的字符串。但是,与计算这些值的散列码的成本相比,计算这些值的成本很小典型字符串值的哈希码。)

