Java 在 HashMap 中使用 String 键是个坏主意?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1516549/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 13:44:20  来源:igfitidea点击:

Bad idea to use String key in HashMap?

javastringmaphashcode

提问by Marcus Leon

I understand that the String class' hashCode()method is notguarantied to generate unique hash codes for distinct String-s. I see a lot of usage of putting String keys into HashMap-s (using the default String hashCode() method). A lot of this usage could result in significant application issues if a map putdisplaced a HashMap entry that was previously put onto the map with a truely distinct String key.

我知道 String 类的hashCode()方法不能保证为不同的 String-s 生成唯一的哈希码。我看到很多将 String 键放入 HashMap-s 的用法(使用默认的 String hashCode() 方法)。如果映射put替换了先前使用真正不同的字符串键放在映射上的 HashMap 条目,则很多这种用法可能会导致严重的应用程序问题。

What are the odds that you will run into the scenario where String.hashCode() returns the same value for distinct String-s? How do developers work around this issue when the key is a String?

您遇到 String.hashCode() 为不同的 String-s 返回相同值的情况的几率有多大?当键是字符串时,开发人员如何解决这个问题?

采纳答案by CPerkins

Developers do not have to work around the issue of hash collisions in HashMap in order to achieve program correctness.

开发人员不必为了实现程序正确性而解决 HashMap 中的哈希冲突问题。

There are a couple of key things to understand here:

这里有几个关键的事情需要理解:

  1. Collisions are an inherent feature of hashing, and they have to be. The number of possible values (Strings in your case, but it applies to other types as well) is vastly bigger than the range of integers.

  2. Every usage of hashing has a way to handle collisions, and the Java Collections (including HashMap) is no exception.

  3. Hashing is not involved in equality testing. It is true that equal objects must have equal hashcodes, but the reverse is not true: many values will have the same hashcode. So don't try using a hashcode comparison as a substitute for equality. Collections don't. They use hashing to select a sub-collection (called a bucket in the Java Collections world), but they use .equals() to actually check for equality.

  4. Not only do you not have to worry about collisions causing incorrect results in a collection, but for most applications, you also *usually* don't have to worry about performance - Java hashed Collections do a pretty good job of managing hashcodes.

  5. Better yet, for the case you asked about (Strings as keys), you don't even have to worry about the hashcodes themselves, because Java's String class generates a pretty good hashcode. So do most of the supplied Java classes.
  1. 冲突是散列的固有特征,而且必须如此。可能值的数量(在您的情况下是字符串,但它也适用于其他类型)远远大于整数范围。

  2. 散列的每种用法都有处理冲突的方法,Java 集合(包括 HashMap)也不例外。

  3. 哈希不涉及相等性测试。确实,相等的对象必须具有相同的哈希码,但反之则不然:许多值将具有相同的哈希码。所以不要尝试使用哈希码比较来代替相等。集合没有。他们使用散列来选择一个子集合(在 Java 集合世界中称为存储桶),但他们使用 .equals() 来实际检查相等性。

  4. 您不仅不必担心在集合中导致错误结果的冲突,而且对于大多数应用程序,您*通常*也不必担心性能 - Java 散列集合在管理散列码方面做得非常好。

  5. 更好的是,对于您询问的情况(字符串作为键),您甚至不必担心哈希码本身,因为 Java 的 String 类生成了一个非常好的哈希码。大多数提供的 Java 类也是如此。

Some more detail, if you want it:

一些更多的细节,如果你想要的话:

The way hashing works (in particular, in the case of hashed collections like Java's HashMap, which is what you asked about) is this:

散列的工作方式(特别是在像 Java 的 HashMap 这样的散列集合的情况下,这就是您所问的)是这样的:

  • The HashMap stores the values you give it in a collection of sub-collections, called buckets. These are actually implemented as linked lists. There are a limited number of these: iirc, 16 to start by default, and the number increases as you put more items into the map. There should always be more buckets than values. To provide one example, using the defaults, if you add 100 entries to a HashMap, there will be 256 buckets.

  • Every value which can be used as a key in a map must be able to generate an integer value, called the hashcode.

  • The HashMap uses this hashcode to select a bucket. Ultimately, this means taking the integer value modulothe number of buckets, but before that, Java's HashMap has an internal method (called hash()), which tweaks the hashcode to reduce some known sources of clumping.

  • When looking up a value, the HashMap selects the bucket, and then searches for the individual element by a linear search of the linked list, using .equals().

  • HashMap 将您提供的值存储在子集合的集合中,称为桶。这些实际上是作为链表实现的。这些数量有限:iirc,默认为 16,并且随着您将更多项目放入地图,数量会增加。应该总是有比值更多的桶。举一个例子,使用默认值,如果向 HashMap 添加 100 个条目,将有 256 个桶。

  • 每个可以用作映射中键的值都必须能够生成一个整数值,称为哈希码。

  • HashMap 使用这个哈希码来选择一个桶。最终,这意味着将整数值modulo作为桶的数量,但在此之前,Java 的 HashMap 有一个内部方法(称为hash()),它调整哈希码以减少一些已知的聚集源。

  • 查找值时,HashMap 选择桶,然后通过链表的线性搜索来搜索单个元素,使用.equals().

So: you don't have to work around collisions for correctness, and you usually don't have to worry about them for performance, and if you're using native Java classes (like String), you don't have to worry about generating the hashcode values either.

所以:您不必为了正确性而解决冲突,而且您通常不必担心它们的性能,如果您使用本机 Java 类(如 String),则不必担心生成哈希码值。

In the case where you do have to write your own hashcode method (which means you've written a class with a compound value, like a first name/last name pair), things get slightly more complicated. It's quite possible to get it wrong here, but it's not rocket science. First, know this: the only thing you must do in order to assure correctness is to assure that equal objects yield equal hashcodes. So if you write a hashcode() method for your class, you must also write an equals() method, and you must examine the same values in each.

如果您必须编写自己的 hashcode 方法(这意味着您编写了一个具有复合值的类,例如名字/姓氏对),事情会稍微复杂一些。在这里很可能出错,但这不是火箭科学。首先,知道这一点:为了确保正确性,你唯一必须做的就是确保相等的对象产生相等的哈希码。因此,如果为类编写 hashcode() 方法,则还必须编写 equals() 方法,并且必须检查每个方法中的相同值。

It is possible to write a hashcode() method which is bad but correct, by which I mean that it would satisfy the "equal objects must yield equal hashcodes" constraint, but still perform very poorly, by having a lot of collisions.

有可能编写一个不好但正确的 hashcode() 方法,我的意思是它会满足“相等的对象必须产生相等的哈希码”约束,但由于有很多冲突,性能仍然很差。

The canonical degenerate worst case of this would be to write a method which simply returns a constant value (e.g., 3) for all cases. This would mean that every value would be hashed into the same bucket.

这种规范退化的最坏情况是编写一个方法,该方法只为所有情况返回一个常量值(例如,3)。这意味着每个值都将被散列到同一个桶中。

It would still work, but performance would degrade to that of a linked list.

它仍然可以工作,但性能会下降到链表的性能。

Obviously, you won't write such a terrible hashcode() method. If you're using a decent IDE, it's capable of generating one for you. Since StackOverflow loves code, here's the code for the firstname/lastname class above.

显然,您不会编写如此糟糕的 hashcode() 方法。如果您使用的是不错的 IDE,它可以为您生成一个。由于 StackOverflow 喜欢代码,这里是上面 firstname/lastname 类的代码。


public class SimpleName {
    private String firstName;
    private String lastName;
    public SimpleName(String firstName, String lastName) {
        super();
        this.firstName = firstName;
        this.lastName = lastName;
    }
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result
                + ((firstName == null) ? 0 : firstName.hashCode());
        result = prime * result
                + ((lastName == null) ? 0 : lastName.hashCode());
        return result;
    }
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        SimpleName other = (SimpleName) obj;
        if (firstName == null) {
            if (other.firstName != null)
                return false;
        } else if (!firstName.equals(other.firstName))
            return false;
        if (lastName == null) {
            if (other.lastName != null)
                return false;
        } else if (!lastName.equals(other.lastName))
            return false;
        return true;
    }
}

回答by coobird

I strongly suspect that the HashMap.putmethod does not determine whether the key is the same by just looking at String.hashCode.

我强烈怀疑该HashMap.put方法并不能仅通过查看String.hashCode.

There is definitely going to be a chance of a hash collision, so one would expect that the String.equalsmethod will also be called to be sure that the Strings are truly equal, if there is indeed a case where the two Strings have the same value returned from hashCode.

肯定会有散列冲突的可能性,因此人们会期望该String.equals方法也将被调用以确保Strings 真正相等,如果确实存在两个Strings 具有相同返回值的情况hashCode.

Therefore, the new key Stringwould only be judged to be the same key Stringas one that is already in the HashMapif and only if the value returned by hashCodeis equal, and the equalsmethod returns true.

因此,String只有当且仅当 返回的值相等时,新键才会被判断为与String已经存在的键相同,并且该方法返回。HashMaphashCodeequalstrue

Also to add, this thought would also be true for classes other than String, as the Objectclass itself already has the hashCodeand equalsmethods.

另外要补充的是,这种想法也适用于 以外StringObject类,因为类本身已经具有hashCodeequals方法。

Edit

编辑

So, to answer the question, no, it would not be a bad idea to use a Stringfor a key to a HashMap.

因此,要回答这个问题,不,使用 aString作为 a 的键并不是一个坏主意HashMap

回答by Keith Randall

You are talking about hash collisions. Hash collisions are an issue regardless of the type being hashCode'd. All classes that use hashCode (e.g. HashMap) handle hash collisions just fine. For example, HashMap can store multiple objects per bucket.

你在谈论哈希冲突。无论哈希编码的类型如何,哈希冲突都是一个问题。所有使用 hashCode(例如 HashMap)的类都可以很好地处理散列冲突。例如,HashMap 可以在每个桶中存储多个对象。

Don't worry about it unless you are calling hashCode yourself. Hash collisions, though rare, don't break anything.

除非您自己调用 hashCode,否则不要担心。哈希冲突虽然很少见,但不会破坏任何东西。

回答by Michael Borgwardt

This is not an issue, it's just how hashtables work. It's provably impossible to have distinct hashcodes for all distinct strings, because there are far more distinct strings than integers.

这不是问题,这只是哈希表的工作方式。可以证明,所有不同的字符串都有不同的哈希码是不可能的,因为不同的字符串比整数多得多。

As others have written, hash collisions are resolved via the equals() method. The only problem this can cause is degeneration of the hashtable, leading to bad performance. That's why Java's HashMap has a load factor, a ratio between buckets and inserted elements which, when exceeded, will cause rehashing of the table with twice the number of buckets.

正如其他人所写,散列冲突是通过 equals() 方法解决的。这可能导致的唯一问题是哈希表的退化,从而导致性能不佳。这就是为什么 Java 的 HashMap 有一个load factor,即存储桶和插入元素之间的比率,当超过该比率时,将导致以两倍的存储桶数量重新散列表。

This generally works very well, but only if the hash function is good, i.e. does not result in more than the statistically expected number of collisions for your particular input set. String.hashCode()is good in this regard, but this was not always so. Allegedly, prior to Java 1.2 it only inlcuded every n'th character. This was faster, but caused predictable collisions for all String sharing each n'th character - very bad if you're unluck enough to have such regular input, or if someone want to do a DOS attack on your app.

这通常非常有效,但前提是哈希函数良好,即不会导致特定输入集的统计预期冲突次数。String.hashCode()在这方面是好的,但并非总是如此。据称,在 Java 1.2 之前,它只包含每个第 n 个字符。这速度更快,但会导致共享每个第 n 个字符的所有 String 发生可预测的冲突 - 如果您运气不好有这样的常规输入,或者有人想对您的应用程序进行 DOS 攻击,那就太糟糕了。

回答by dberm22

I direct you to the answer here. While it is not a badidea to use strings( @CPerkins explained why, perfectly), storing the values in a hashmap with integer keysis better, since it is generally quicker(although unnoticeably) and has lower chance (actually, no chance) of collisions.

我将您引向这里的答案。虽然使用字符串不是一个主意(@CPerkins 完美地解释了原因),但将值存储在带有整数键的哈希图中更好,因为它通常更快(虽然不明显)并且机会较低(实际上,没有机会)的碰撞。

See this chart of collisions using 216553 keys in each case, (stolen from this post, reformatted for our discussion)

请参阅此在每种情况下使用 216553 个密钥的冲突图表,(从这篇文章中窃取,重新格式化以供我们讨论)

Hash           Lowercase      Random UUID  Numbers 
=============  =============  ===========  ==============
Murmur            145 ns      259 ns          92 ns
                    6 collis    5 collis       0 collis
FNV-1a            152 ns      504 ns          86 ns
                    4 collis    4 collis       0 collis
FNV-1             184 ns      730 ns          92 ns
                    1 collis    5 collis       0 collis*
DBJ2a             158 ns      443 ns          91 ns
                    5 collis    6 collis       0 collis***
DJB2              156 ns      437 ns          93 ns
                    7 collis    6 collis       0 collis***
SDBM              148 ns      484 ns          90 ns
                    4 collis    6 collis       0 collis**
CRC32             250 ns      946 ns         130 ns
                    2 collis    0 collis       0 collis

Avg Time per key    0.8ps       2.5ps         0.44ps
Collisions (%)      0.002%      0.002%         0%
Hash           Lowercase      Random UUID  Numbers 
=============  =============  ===========  ==============
Murmur            145 ns      259 ns          92 ns
                    6 collis    5 collis       0 collis
FNV-1a            152 ns      504 ns          86 ns
                    4 collis    4 collis       0 collis
FNV-1             184 ns      730 ns          92 ns
                    1 collis    5 collis       0 collis*
DBJ2a             158 ns      443 ns          91 ns
                    5 collis    6 collis       0 collis***
DJB2              156 ns      437 ns          93 ns
                    7 collis    6 collis       0 collis***
SDBM              148 ns      484 ns          90 ns
                    4 collis    6 collis       0 collis**
CRC32             250 ns      946 ns         130 ns
                    2 collis    0 collis       0 collis

Avg Time per key    0.8ps       2.5ps         0.44ps
Collisions (%)      0.002%      0.002%         0%

Of course, the number of integers is limited to 2^32, where as there is no limit to the number of strings (and there is no theoretical limit to the amount of keys that can be stored in a HashMap). If you use a long(or even a float), collisions will be inevitable, and therefore no "better" than a string. However, even despite hash collisions, put()and get()will always put/get the correct key-value pair (See edit below).

当然,整数的数量限制为 2^32,因为字符串的数量没有限制(并且可以存储在 a 中的键数量没有理论上的限制HashMap)。如果您使用 a long(甚至 a float),碰撞将是不可避免的,因此没有比字符串“更好”的了。然而,即使哈希冲突,put()并且get()总是会放置/获取正确的键值对(请参阅下面的编辑)。

In the end, it really doesn't matter, so use whatever is more convenient. But if convenience makes no difference, and you do not intend to have more than 2^32 entries, I suggest you use intsas keys.

最后,真的没有关系,所以使用更方便的东西。但是如果方便没有区别,并且您不打算有超过 2^32 个条目,我建议您将其ints用作键。



EDIT

编辑

While the above is definitely true, NEVER use "StringKey".hashCode() to generate a key in place of the original Stringkey for performance reasons- 2 different strings can have the same hashCode, causing overwriting on your put()method. Java's implementation of HashMapis smart enough to handle strings (any type of key, actually) with the same hashcode automatically, so it is wise to let Java handle these things for you.

虽然上述内容绝对正确,但String出于性能原因,切勿使用 "StringKey".hashCode() 生成一个键来代替原始键 - 2 个不同的字符串可以具有相同的 hashCode,从而导致您的put()方法被覆盖。Java 的实现HashMap足够智能,可以自动处理具有相同哈希码的字符串(实际上是任何类型的键),因此让 Java 为您处理这些事情是明智的。