C++ std::unordered_map 是如何实现的

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31112852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 13:51:27  来源:igfitidea点击:

How std::unordered_map is implemented

c++c++11hashmapunordered-map

提问by ralzaul

c++ unordered_map collision handling , resize and rehash

c ++ unordered_map 碰撞处理,调整大小和重新散列

This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:

这是我之前提出的一个问题,我已经看到我对 unordered_map 的实现方式有很多困惑。我相信很多其他人跟我一样有这种困惑。根据我在没有阅读标准的情况下所知道的信息:

Every unordered_map implementation stores a linked list to external nodes in the array of buckets... No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

每个 unordered_map 实现都将一个链表存储到桶数组中的外部节点......不,这根本不是实现最常见用途的哈希映射的最有效方法。不幸的是, unordered_map 规范中的一个小“疏忽”几乎需要这种行为。所需的行为是元素的迭代器在插入或删除其他元素时必须保持有效

I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?

我希望有人可以解释实现以及它如何符合 C++ 标准定义(在性能要求方面),如果它真的不是实现哈希映射数据结构的最有效方法,如何改进它?

回答by Tony Delroy

The Standard effectively mandates std::unordered_setand std::unordered_mapimplementations that use open hashing, which means an array of buckets, each of which holds the head of a logical (and typically actual) list. That requirement is subtle: it's a consequence of the default max load factor being 1.0 and the guarantee that the table will not be rehashed unless grown beyond that load factor: that would be impractical without chaining, as the collisions with closed hashing become overwhelming as the load factor approaches 1:

该标准有效地强制要求std::unordered_setstd::unordered_map实现使用开放散列,这意味着一个桶数组,每个桶都包含一个逻辑(通常是实际)列表的头部。这个要求很微妙:这是默认最大负载因子为 1.0 的结果,并且保证除非超过该负载因子,否则表不会被重新散列:如果没有链接,这将是不切实际的,因为与封闭散列的冲突变得势不可挡,因为负载系数接近 1:

23.2.5/15: The insertand emplacemembers shall not affect the validity of iterators if (N+n) < z * B, where Nis the number of elements in the container prior to the insert operation, nis the number of elements inserted, Bis the container's bucket count, and zis the container's maximum load factor.

amongst the Effectsof the constructor at 23.5.4.2/1: max_load_factor()returns 1.0.

23.2.5 / 15:的insertemplace成员不应影响迭代器的有效性,如果(N+n) < z * B,在这里N是在之前的插入操作的容器中的元素的数量,n是插入的元素的数量,B是容器的桶计数,并且z是容器的最大负载系数。

在 23.5.4.2/1 处的构造函数的效果中:max_load_factor()返回1.0

(To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.)

(为了在不传递任何空桶的情况下实现最佳迭代,GCC 的实现将带有迭代器的桶填充到一个包含所有值的单向链表中:迭代器指向该桶元素之前的元素,因此下一个指针可以是如果擦除存储桶的最后一个值,则重新接线。)

Regarding the text you quote:

关于你引用的文字:

No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

不,这根本不是为大多数常见用途实现哈希映射的最有效方法。不幸的是, unordered_map 规范中的一个小“疏忽”几乎需要这种行为。所需的行为是元素的迭代器在插入或删除其他元素时必须保持有效

There is no "oversight"... what's done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erasepairs without gradually degrading performance the way many closed hashing implementations do.

没有“疏忽”……所做的一切都是经过深思熟虑的,并且是在充分意识到的情况下完成的。的确,其他的妥协可能已经被击中,但开放散列/链接方法对于一般用途来说是一个合理的妥协,它可以合理优雅地处理来自平庸散列函数的冲突,对于小的或大的键/值类型不会太浪费,并处理任意多个insert/erase对,而不会像许多封闭散列实现那样逐渐降低性能。

As evidence of the awareness, from Matthew Austern's proposal here:

作为意识的证据,来自Matthew Austern 的提议

I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:

? It's necessary to distinguish between a vacant position and an occupied one.

? It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.

? Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.

? Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.

? Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.

Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.

我不知道在通用框架中任何令人满意的开放寻址实现。开放寻址存在许多问题:

? 有必要区分空缺职位和被占用职位。

? 有必要将哈希表限制为具有默认构造函数的类型,并提前构造每个数组元素,或者维护一个数组,其中一些元素是对象,其他元素是原始内存。

? 开放寻址使冲突管理变得困难:如果您要插入一个哈希码映射到已占用位置的元素,则需要一个策略来告诉您下一步尝试的位置。这是一个已解决的问题,但最著名的解决方案是复杂的。

? 当允许擦除元素时,冲突管理尤其复杂。(有关讨论,请参阅 Knuth。)标准库的容器类应该允许擦除。

? 开放寻址的冲突管理方案倾向于假设一个固定大小的数组,最多可以容纳 N 个元素。标准库的容器类应该能够在插入新元素时根据需要增长,直至达到可用内存的限制。

解决这些问题可能是一个有趣的研究项目,但是,在缺乏 C++ 上下文中的实现经验的情况下,将开放寻址容器类标准化是不合适的。

Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.

特别是对于数据小到可以直接存储在桶中的仅插入表、未使用桶的方便标记值和良好的散列函数,封闭散列方法可能大约快一个数量级并使用更少的内存,但是这不是通用目的。

A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.

对哈希表设计选项及其含义的完整比较和阐述对于 SO 来说是题外话,因为它太宽泛而无法在此处正确解决。