为什么 Sun Java 中的 HashSet 实现使用 HashMap 作为其支持?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2235546/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 20:12:53  来源:igfitidea点击:

Why does HashSet implementation in Sun Java use HashMap as its backing?

javahashmaphashset

提问by Randy Sugianto 'Yuku'

Looking at the source of Java 6, HashSet<E>is actually implemented using HashMap<E,Object>, using dummy object instance on every entry of the Set.

查看Java 6的源代码,HashSet<E>实际上是使用 实现的HashMap<E,Object>,在Set的每个条目上使用虚拟对象实例。

I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.

我认为这浪费了条目本身的大小 4 个字节(在 32 位机器上)。

But, why is it still used? Is there any reason to use it besides making it easier to maintain the codes?

但是,为什么它仍然被使用?除了更容易维护代码之外,还有什么理由使用它吗?

采纳答案by JXG

Actually, it's not just HashSet. Allimplementations of the Setinterface in Java 6 are based on an underlying Map. This is not a requirement; it's just the way the implementation is. You can see for yourself by checking out the documentation for the various implementations of Set.

其实,不仅如此HashSet。 Java 6 中接口的所有实现Set都基于底层的Map. 这不是必需的;这只是实现的方式。您可以通过查看Set.

Your main questions are

您的主要问题是

But, why is it still used? Is there any reason to use it besides making it easier to maintain the codes?

但是,为什么它仍然被使用?除了更容易维护代码之外,还有什么理由使用它吗?

I assume that code maintenance is a big motivating factor. So is preventing duplication and bloat.

我认为代码维护是一个很大的激励因素。防止重复和膨胀也是如此。

Setand Mapare similar interfaces, in that duplicate elements are not allowed. (I think the only Setnotbacked by a Mapis CopyOnWriteArraySet, which is an unusual Collection, because it's immutable.)

SetMap是类似的接口,因为不允许重复的元素。(我认为唯一Set不受 a支持的MapCopyOnWriteArraySet,这是一个不寻常的 Collection,因为它是不可变的。)

Specifically:

具体来说:

From the documentation of Set:

文档Set

A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.

The Set interface places additional stipulations, beyond those inherited from the Collection interface, on the contracts of all constructors and on the contracts of the add, equals and hashCode methods. Declarations for other inherited methods are also included here for convenience. (The specifications accompanying these declarations have been tailored to the Set interface, but they do not contain any additional stipulations.)

The additional stipulation on constructors is, not surprisingly, that all constructors must create a set that contains no duplicate elements (as defined above).

不包含重复元素的集合。更正式地说,集合不包含一对元素 e1 和 e2,使得 e1.equals(e2),并且最多包含一个空元素。正如其名称所暗示的那样,该接口对数学集合抽象进行建模。

除了从 Collection 接口继承的那些之外,Set 接口对所有构造函数的契约以及 add、equals 和 hashCode 方法的契约进行了额外的规定。为方便起见,此处还包括其他继承方法的声明。(这些声明随附的规范已针对 Set 接口量身定制,但不包含任何其他规定。)

对构造函数的额外规定是,所有构造函数都必须创建一个不包含重复元素的集合(如上所定义),这并不奇怪。

And from Map:

来自Map

An object that maps keys to values. A map cannot contain duplicate keys; each key can map to at most one value.

将键映射到值的对象。地图不能包含重复的键;每个键最多可以映射到一个值。

If you can implement your Sets using existing code, any benefit (speed, for example) you can realize from existing code accrues to your Setas well.

如果您可以Set使用现有代码实现您的s,那么您可以从现有代码中获得的任何好处(例如速度)也会对您产生影响Set

If you choose to implement a Setwithout a Mapbacking, you have to duplicate code designed to prevent duplicate elements. Ah, the delicious irony.

如果您选择在Set没有Map支持的情况下实现 a ,则必须复制旨在防止重复元素的代码。啊,美味的讽刺。

That said, there's nothing preventing you from implementing your Sets differently.

也就是说,没有什么能阻止你Set以不同的方式实现你的s。

回答by Craig P. Motlin

My guess is that HashSet was originally implemented in terms of HashMap in order to get it done quickly and easily. In terms of lines of code, HashSet is a fraction of HashMap.

我的猜测是 HashSet 最初是根据 HashMap 实现的,以便快速轻松地完成它。就代码行数而言,HashSet 是 HashMap 的一小部分。

I would guess that the reason it still hasn't been optimized is fear of change.

我猜它仍然没有被优化的原因是害怕改变。

However, the waste is much worse than you think. On both 32-bit and 64-bit, HashSet is 4x larger than necessary, and HashMap is 2x larger than necessary. HashMap could be implemented with an array with keys and values in it (plus chains for collisions). That means two pointers per entry, or 16 bytes on a 64-bit VM. In fact, HashMap contains an Entry object per entry, which adds 8 bytes for the pointer to the Entry and 8 bytes for the Entry object header. HashSet also uses 32 bytes per element, but the waste is 4x instead of 2x since it only requires 8 bytes per element.

然而,浪费比你想象的要糟糕得多。在 32 位和 64 位上,HashSet 比所需大 4 倍,HashMap 比所需大 2 倍。HashMap 可以用一个包含键和值的数组来实现(加上用于冲突的链)。这意味着每个条目有两个指针,或者在 64 位 VM 上有 16 个字节。实际上,HashMap 每个条目包含一个 Entry 对象,它为指向 Entry 的指针增加了 8 个字节,为 Entry 对象标头增加了 8 个字节。HashSet 每个元素也使用 32 个字节,但浪费是 4x 而不是 2x,因为它每个元素只需要 8 个字节。

回答by Tom Hawtin - tackline

I am guessing that it has never turned up as a significant problem for real applications or important benchmarks. Why complicate the code for no real benefit?

我猜测它从未成为实际应用程序或重要基准测试的重大问题。为什么为了没有真正的好处而使代码复杂化?

Also note, that object sizes are rounded up in many JVM implementation, so there may not actually be an increase in size (I don't know for this example). Also the code for HashMapis likely to be compiled and in cache. Other things being equal, more code => more cache misses => lower performance.

还要注意,对象大小在许多 JVM 实现中被四舍五入,所以实际上可能没有增加大小(我不知道这个例子)。此外,代码HashMap很可能会被编译并在缓存中。在其他条件相同的情况下,更多的代码 => 更多的缓存未命中 => 更低的性能。

回答by Suraj Chandran

Yes you are right, a small amount of wastage is definetley there. Small because, for every entry it uses the same object PRESENT(which is declared final). Hence the only wastage is for every entry's value in the HashMap.

是的,您说得对,那里确实存在少量浪费。小是因为,对于每个条目,它使用相同的对象PRESENT(声明为 final)。因此唯一的浪费是 HashMap 中每个条目的值。

Mostly I think, they took this approach for maintainability and reusability. (The JCF developers would have thought, we have tested HashMap anyway, why not reuse it.)

大多数我认为,他们采用这种方法是为了可维护性和可重用性。(JCF 开发人员会想,反正我们已经测试了 HashMap,为什么不重用它。)

But if you are having huge collections, and you are a memory freak, then you may opt out for better alternatives like Troveor Google Collections.

但是,如果您拥有大量收藏,并且您是个记忆狂,那么您可以选择退出更好的替代品,例如TroveGoogle Collections

回答by Lombo

I looked at your question and it took me a while to think about what you said. So here's my opinion regarding the HashSetimplementation.

我看了你的问题,我花了一些时间来思考你说的话。所以这是我对HashSet实施的看法。

It is necessary to have the dummy instance to know if the value is or is not present in the set.

有必要让虚拟实例知道该值是否存在于集合中。

Take a look at the add method

看看add方法

public boolean add(E e) {
return map.put(e, PRESENT)==null;
}

Abd now let's take a look at the put return value

abd 现在我们来看看 put 的返回值

@returns the previous value associated with key, or null if there was no mapping for key. (A null return can also indicate that the map previously associated null with key.)

@returns 与 key 关联的前一个值,如果没有 key 的映射,则返回 null。(空返回也可以指示映射先前将空与键关联。)

So the PRESENTobject is just used to represent that the set contains the e value. I think you asked why not use nullinstead of PRESENT. But the, you would not be able to distinguish if the entry was previously on the map because map.put(key,value)would always return nulland you would not have way to know if the key existed.

所以PRESENT对象只是用来表示集合包含e值。我想你问为什么不使用null而不是PRESENT. 但是,您将无法区分该条目以前是否在地图上,因为它map.put(key,value)总是会返回null并且您无法知道该键是否存在。



That being said you could argue that they could have used an implementation like this

话虽如此,您可能会争辩说他们本可以使用这样的实现

   public boolean add(E e) {

        if( map.containsKey(e) ) {
            return false;
        }

        map.put(e, null);

        return true;

}

I guess they waste 4 bytes to avoid computing the hashCode, as it could be expensive, of the key two times (if the key is going to be added).

我猜他们浪费了 4 个字节来避免计算密钥两次的 hashCode,因为它可能很昂贵(如果要添加密钥)。



If you question referred to why they used a HashMapthat would waste 8 bytes (because of the Map.Entry) instead of some other data structure using a similar Entry of only 4, then yes, I would say they did it for the reasons you mentioned.

如果您质疑为什么他们使用HashMap会浪费 8 个字节的 8 个字节(因为Map.Entry)而不是其他一些使用仅 4 个类似条目的数据结构,那么是的,我会说他们这样做是出于您提到的原因。

回答by clwhisk

After searching through pages like this wondering why the mildly inefficient standard implementation, found com.carrotsearch.hppc.IntOpenHashSet

在搜索这样的页面后,想知道为什么标准实现效率稍低,找到了 com.carrotsearch.hppc.IntOpenHashSet

回答by Srujan Kumar Gulla

Your question: I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.

您的问题:我认为条目本身的大小浪费了 4 个字节(在 32 位机器上)。

Just one Object variable is created for the entire datastructure of hashset and doing that would save yourself from re-writing the entire hashMap kind of code again.

只为 hashset 的整个数据结构创建一个 Object 变量,这样做可以避免再次重写整个 hashMap 类型的代码。

private static final Object PRESENT = new Object();

private static final Object PRESENT = new Object();

All the keys are having one value i.e PRESENT object.

所有的键都有一个值,即 PRESENT 对象。