HashMap Java 8 实现
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43911369/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HashMap Java 8 implementation
提问by Hasnain Ali Bohra
As per the following link document: Java HashMap Implementation
根据以下链接文档:Java HashMap 实现
I'm confused with the implementation of HashMap
(or rather, an enhancement in HashMap
). My queries are:
我对 的实现HashMap
(或者更确切地说,是 中的增强HashMap
)感到困惑。我的查询是:
Firstly
首先
static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
static final int MIN_TREEIFY_CAPACITY = 64;
Why and how are these constants used? I want some clear examples for this.How they are achieving a performance gain with this?
为什么以及如何使用这些常量?我想要一些明确的例子。他们如何通过此实现性能提升?
Secondly
其次
If you see the source code of HashMap
in JDK, you will find the following static inner class:
如果你查看HashMap
JDK 中的源代码,你会发现如下静态内部类:
static final class TreeNode<K, V> extends java.util.LinkedHashMap.Entry<K, V> {
HashMap.TreeNode<K, V> parent;
HashMap.TreeNode<K, V> left;
HashMap.TreeNode<K, V> right;
HashMap.TreeNode<K, V> prev;
boolean red;
TreeNode(int arg0, K arg1, V arg2, HashMap.Node<K, V> arg3) {
super(arg0, arg1, arg2, arg3);
}
final HashMap.TreeNode<K, V> root() {
HashMap.TreeNode arg0 = this;
while (true) {
HashMap.TreeNode arg1 = arg0.parent;
if (arg0.parent == null) {
return arg0;
}
arg0 = arg1;
}
}
//...
}
How is it used? I just want an explanation of the algorithm.
它是如何使用的?我只想解释一下算法。
采纳答案by Michael
HashMap
contains a certain number of buckets. It uses hashCode
to determine which bucket to put these into. For simplicity's sake imagine it as a modulus.
HashMap
包含一定数量的桶。它用于hashCode
确定将这些放入哪个桶。为简单起见,把它想象成一个模数。
If our hashcode is 123456 and we have 4 buckets, 123456 % 4 = 0
so the item goes in the first bucket, Bucket 1.
如果我们的哈希码是 123456 并且我们有 4 个桶,123456 % 4 = 0
那么该项目进入第一个桶,即桶 1。
If our hashcode function is good, it should provide an even distribution so all the buckets will be used somewhat equally. In this case, the bucket uses a linked list to store the values.
如果我们的 hashcode 函数很好,它应该提供一个均匀的分布,这样所有的桶都会在某种程度上平等地使用。在这种情况下,存储桶使用链表来存储值。
But you can't rely on people to implement good hash functions. People will often write poor hash functions which will result in a non-even distribution. It's also possible that we could just get unlucky with our inputs.
但是你不能依靠人来实现好的哈希函数。人们经常会编写糟糕的散列函数,这将导致分布不均匀。我们也有可能只是因为我们的输入而倒霉。
The less even this distribution is, the further we're moving from O(1) operations and the closer we're moving towards O(n) operations.
这种分布越不均匀,我们离 O(1) 操作越远,离 O(n) 操作越近。
The implementation of Hashmap tries to mitigate this by organising some buckets into trees rather than linked lists if the buckets become too large. This is what TREEIFY_THRESHOLD = 8
is for. If a bucket contains more than eight items, it should become a tree.
如果桶变得太大,Hashmap 的实现试图通过将一些桶组织成树而不是链表来缓解这种情况。这TREEIFY_THRESHOLD = 8
就是为了。如果一个桶包含八个以上的项目,它应该成为一棵树。
This tree is a Red-Black tree. It is first sorted by hash code. If the hash codes are the same, it uses the compareTo
method of Comparable
if the objects implement that interface, else the identity hash code.
这棵树是红黑树。它首先按哈希码排序。如果散列码相同,它使用compareTo
的方法,Comparable
如果对象实现该接口,否则身份哈希码。
If entries are removed from the map, the number of entries in the bucket might reduce such that this tree structure is no longer necessary. That's what the UNTREEIFY_THRESHOLD = 6
is for. If the number of elements in a bucket drops below six, we might as well go back to using a linked list.
如果从映射中删除条目,则存储桶中的条目数量可能会减少,从而不再需要此树结构。这就是UNTREEIFY_THRESHOLD = 6
它的用途。如果桶中的元素数量低于 6,我们不妨回到使用链表。
Finally, there is the MIN_TREEIFY_CAPACITY = 64
.
最后,有MIN_TREEIFY_CAPACITY = 64
.
When a hash map grows in size, it automatically resizes itself to have more buckets. If we have a small hash map, the likelihood of us getting very full buckets is quite high, because we don't have that many different buckets to put stuff into. It's much better to have a bigger hash map, with more buckets that are less full. This constant basically says not to start making buckets into trees if our hash map is very small - it should resize to be larger first instead.
当哈希映射的大小增加时,它会自动调整自身大小以拥有更多存储桶。如果我们有一个小的哈希映射,我们得到非常满的桶的可能性非常高,因为我们没有那么多不同的桶可以放入东西。有一个更大的哈希映射会更好,有更多不那么满的桶。这个常量基本上是说如果我们的哈希图非常小,不要开始将桶变成树 - 它应该首先调整大小以变大。
To answer your question about the performance gain, these optimisations were added to improve the worstcase. I'm only speculating but you would probably only see a noticeable performance improvement because of these optimisations if your hashCode
function was not very good.
为了回答您关于性能提升的问题,添加了这些优化以改善最坏的情况。我只是推测,但如果您的hashCode
功能不是很好,您可能只会看到由于这些优化而引起的显着性能改进。
回答by Eran
TreeNode
is an alternative way to store the entries that belong to a single bin of the HashMap
. In older implementations the entries of a bin were stored in a linked list. In Java 8, if the number of entries in a bin passed a threshold (TREEIFY_THRESHOLD
), they are stored in a tree structure instead of the original linked list. This is an optimization.
TreeNode
是存储属于HashMap
. 在较旧的实现中,bin 的条目存储在链表中。在 Java 8 中,如果 bin 中的条目数超过阈值 ( TREEIFY_THRESHOLD
),它们将存储在树结构中,而不是原始链表中。这是一个优化。
From the implementation:
从实施来看:
/*
* Implementation notes.
*
* This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node). Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.
回答by rentedrainbow
You would need to visualize it: say there is a Class Key with only hashCode() function overridden to always return same value
您需要对其进行可视化:假设有一个类键,只有 hashCode() 函数被覆盖以始终返回相同的值
public class Key implements Comparable<Key>{
private String name;
public Key (String name){
this.name = name;
}
@Override
public int hashCode(){
return 1;
}
public String keyName(){
return this.name;
}
public int compareTo(Key key){
//returns a +ve or -ve integer
}
}
and then somewhere else, I am inserting 9 entries into a HashMap with all keys being instances of this class. e.g.
然后在其他地方,我将 9 个条目插入到 HashMap 中,所有键都是此类的实例。例如
Map<Key, String> map = new HashMap<>();
Key key1 = new Key("key1");
map.put(key1, "one");
Key key2 = new Key("key2");
map.put(key2, "two");
Key key3 = new Key("key3");
map.put(key3, "three");
Key key4 = new Key("key4");
map.put(key4, "four");
Key key5 = new Key("key5");
map.put(key5, "five");
Key key6 = new Key("key6");
map.put(key6, "six");
Key key7 = new Key("key7");
map.put(key7, "seven");
Key key8 = new Key("key8");
map.put(key8, "eight");
//Since hascode is same, all entries will land into same bucket, lets call it bucket 1. upto here all entries in bucket 1 will be arranged in LinkedList structure e.g. key1 -> key2-> key3 -> ...so on. but when I insert one more entry
Key key9 = new Key("key9");
map.put(key9, "nine");
threshold value of 8 will be reached and it will rearrange bucket1 entires into Tree (red-black) structure, replacing old linked list. e.g.
key1
/ \
key2 key3
/ \ / \
Tree traversal is faster {O(log n)} than LinkedList {O(n)} and as n grows, the difference becomes more significant.
树遍历 {O(log n)} 比 LinkedList {O(n)} 快,并且随着 n 的增长,差异变得更加显着。
回答by Eugene
To put it simpler (as much as I could simpler) + some more details.
说得更简单(尽可能简单)+更多细节。
These properties depend on a lot of internal things that would be very cool to understand - before moving to them directly.
这些属性取决于很多内部的东西,在直接移动到它们之前,这些东西很容易理解。
TREEIFY_THRESHOLD-> when a singlebucket reaches this (and the total number exceeds MIN_TREEIFY_CAPACITY
), it is transformed into a perfectly balanced red/black tree node. Why? Because of search speed. Think about it in a different way:
TREEIFY_THRESHOLD-> 当单个桶达到这个(并且总数超过MIN_TREEIFY_CAPACITY
)时,它会转化为一个完美平衡的红/黑树节点。为什么?因为搜索速度。换个角度想一想:
it would take at most 32 stepsto search for an Entry within a bucket/bin with Integer.MAX_VALUEentries.
这将需要最多32个步骤,以搜索与桶/箱中的条目,Integer.MAX_VALUE的条目。
Some intro for the next topic. Why is the number of bins/buckets always a power of two? At least two reasons: faster than modulo operation and modulo on negative numbers will be negative. And you can't put an Entry into a "negative" bucket:
下一个主题的一些介绍。为什么 bins/buckets 的数量总是 2 的幂?至少有两个原因:比取模运算快和对负数取模将是负数。并且您不能将 Entry 放入“否定”存储桶中:
int arrayIndex = hashCode % buckets; // will be negative
buckets[arrayIndex] = Entry; // obviously will fail
Insteadthere is a nice trick used instead of modulo:
相反,使用了一个很好的技巧来代替模数:
(n - 1) & hash // n is the number of bins, hash - is the hash function of the key
That is semantically the sameas modulo operation. It will keep the lower bits. This has an interesting consequence when you do:
这在语义上与模运算相同。它将保留较低的位。当你这样做时,这会产生一个有趣的结果:
Map<String, String> map = new HashMap<>();
In the case above, the decision of where an entry goes is taken based on the last 4 bits onlyof you hashcode.
在上述情况下,仅根据哈希码的最后 4 位来决定条目的去向。
This is where multiplying the buckets comes into play. Under certain conditions (would take a lot of time to explain in exact details), buckets are doubled in size. Why? When buckets are doubled in size, there is one more bit coming into play.
这就是将桶相乘的地方。在某些情况下(需要花很多时间来详细解释),桶的大小会翻倍。为什么?当桶的大小增加一倍时,还有一个位开始发挥作用。
So you have 16 buckets - last 4 bits of the hashcode decide where an entry goes. You double the buckets: 32 buckets - 5 last bits decide where entry will go.
所以你有 16 个桶——哈希码的最后 4 位决定了条目的去向。您将存储桶加倍:32 个存储桶 - 最后 5 位决定条目的去向。
As such this process is called re-hashing. This might get slow. That is (for people who care) as HashMap is "joked" as: fast, fast, fast, slooow. There are other implementations - search pauseless hashmap...
因此,此过程称为重新散列。这可能会变慢。也就是说(对于关心的人)因为 HashMap 被“开玩笑”为:fast, fast, fast, slooow。还有其他实现 - 搜索无暂停哈希图......
Now UNTREEIFY_THRESHOLDcomes into play after re-hashing. At that point, some entries might move from this bins to others (they add one more bit to the (n-1)&hash
computation - and as such might move to otherbuckets) and it might reach this UNTREEIFY_THRESHOLD
. At this point it does not pay off to keep the bin as red-black tree node
, but as a LinkedList
instead, like
现在UNTREEIFY_THRESHOLD在重新散列后开始发挥作用。在这一点上,一些条目可能会从这个 bin 移动到其他 bin(它们向(n-1)&hash
计算中添加更多位- 因此可能会移动到其他桶)并且它可能会到达 this UNTREEIFY_THRESHOLD
。在这一点上,将 bin 保持为 并没有回报red-black tree node
,而是作为LinkedList
替代,例如
entry.next.next....
MIN_TREEIFY_CAPACITYis the minimum number of buckets before a certain bucket is transformed into a Tree.
MIN_TREEIFY_CAPACITY是将某个桶转换为树之前的最小桶数。
回答by Anton Krosnev
The change in HashMap implementation was was added with JEP-180. The purpose was to:
JEP-180添加了对 HashMap 实现的更改。目的是:
Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries. Implement the same improvement in the LinkedHashMap class
通过使用平衡树而不是链表来存储映射条目,提高了 java.util.HashMap 在高哈希冲突条件下的性能。在 LinkedHashMap 类中实现相同的改进
However pure performance is not the only gain. It will also preventHashDoS attack, in case a hash map is used to store user input, because the red-black treethat is used to store data in the bucket has worst case insertion complexity in O(log n). The tree is used after a certain criteria is met - see Eugene's answer.
然而,纯粹的性能并不是唯一的收获。它还可以防止HashDoS 攻击,以防使用哈希映射存储用户输入,因为用于存储桶中数据的红黑树在 O(log n) 中具有最坏情况下的插入复杂度。在满足特定条件后使用该树 - 请参阅Eugene 的回答。
回答by Avinash
To understand the internal implementation of hashmap, you need to understand the hashing. Hashing in its simplest form, is a way to assigning a unique code for any variable/object after applying any formula/algorithm on its properties.
要了解 hashmap 的内部实现,您需要了解散列。散列以其最简单的形式,是一种在将任何公式/算法应用于其属性后为任何变量/对象分配唯一代码的方法。
A true hash function must follow this rule –
一个真正的哈希函数必须遵循这个规则——
“Hash function should return the same hash code each and every time when the function is applied on same or equal objects. In other words, two equal objects must produce the same hash code consistently.”
“当函数应用于相同或相等的对象时,哈希函数每次都应该返回相同的哈希码。换句话说,两个相等的对象必须始终如一地产生相同的哈希码。”