Java hashmap 真的是 O(1) 吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1055243/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 22:52:04  来源:igfitidea点击:

Is a Java hashmap really O(1)?

javahashmapbig-otime-complexity

提问by paxdiablo

I've seen some interesting claims on SO re Java hashmaps and their O(1)lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.

我看到了一些关于 SO re Java 哈希图及其O(1)查找时间的有趣声明。有人可以解释为什么会这样吗?除非这些哈希图与我购买的任何哈希算法有很大不同,否则必须始终存在包含冲突的数据集。

In which case, the lookup would be O(n)rather than O(1).

在这种情况下,查找将是O(n)而不是O(1)

Can someone explain whether they areO(1) and, if so, how they achieve this?

有人可以解释它们是否O(1),如果是,它们是如何实现的?

采纳答案by SingleNegationElimination

A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.

HashMap 的一个特殊功能是与平衡树不同,它的行为是概率性的。在这些情况下,根据最坏情况发生的概率来讨论复杂性通常是最有帮助的。对于散列映射,这当然是与映射恰好有多满有关的冲突的情况。碰撞很容易估计。

pcollision= n / capacity

p碰撞= n / 容量

So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.

因此,即使元素数量很少的哈希映射也很可能会遇到至少一次冲突。大 O 表示法允许我们做一些更引人注目的事情。观察任何任意的固定常数 k。

O(n) = O(k * n)

O(n) = O(k * n)

We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.

我们可以利用这个特性来提高哈希映射的性能。相反,我们可以考虑最多发生 2 次碰撞的概率。

pcollision x 2= (n / capacity)2

p碰撞 x 2= (n / 容量) 2

This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to

这要低得多。由于处理一次额外碰撞的成本与 Big O 性能无关,因此我们找到了一种无需实际更改算法即可提高性能的方法!我们可以将其概括为

pcollision x k= (n / capacity)k

p碰撞 xk= (n / 容量) k

And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.

现在我们可以忽略一些任意数量的碰撞,最终得到比我们所考虑的更多碰撞的可能性微乎其微。您可以通过选择正确的 k 将概率提高到任意微小的水平,而这一切都不会改变算法的实际实现。

We talk about this by saying that the hash-map has O(1) access with high probability

我们通过说哈希映射具有高概率的O(1) 访问谈论这个

回答by FogleBird

In Java, HashMap works by using hashCode to locate a bucket. Each bucket is a list of items residing in that bucket. The items are scanned, using equals for comparison. When adding items, the HashMap is resized once a certain load percentage is reached.

在 Java 中,HashMap 通过使用 hashCode 来定位存储桶。每个存储桶都是驻留在该存储桶中的项目列表。扫描项目,使用等于进行比较。添加项目时,一旦达到特定负载百分比,HashMap 就会调整大小。

So, sometimes it will have to compare against a few items, but generally it's much closer to O(1) than O(n). For practical purposes, that's all you should need to know.

因此,有时它必须与几个项目进行比较,但通常它比 O(n) 更接近 O(1)。出于实际目的,这就是您需要知道的全部内容。

回答by Grey Panther

Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in everypossible case but it is in mostcases).

当然,hashmap 的性能将取决于给定对象的 hashCode() 函数的质量。但是,如果函数的实现使得冲突的可能性非常低,那么它将具有非常好的性能(这不是在所有可能的情况下严格为 O(1),但在大多数情况下是)。

For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.

例如,Oracle JRE 中的默认实现是使用一个随机数(它存储在对象实例中,因此它不会改变 - 但它也禁用了偏向锁定,但这是另一个讨论)所以碰撞的机会是非常低。

回答by Tobias Svensson

This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.

这基本上适用于大多数编程语言中的大多数哈希表实现,因为算法本身并没有真正改变。

If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).

如果表中不存在冲突,则只需进行一次查找,因此运行时间为 O(1)。如果存在冲突,您必须进行不止一次的查找,这会降低 O(n) 的性能。

回答by Daniel James

Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).

请记住,o(1) 并不意味着每次查找只检查单个项目 - 它意味着检查的平均项目数与容器中的项目数保持不变。因此,如果平均需要 4 次比较才能在具有 100 个物品的容器中找到一个物品,那么在具有 10000 个物品的容器中找到一个物品也应该平均需要 4 次比较,并且对于任何其他数量的物品(总是有一个有点差异,特别是在哈希表重新散列的点附近,以及当项目数量非常少时)。

So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.

所以冲突不会阻止容器进行 o(1) 操作,只要每个桶的平均键数保持在固定范围内。

回答by Konrad Rudolph

You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.

您似乎将最坏情况的行为与平均情况(预期的)运行时间混为一谈。对于一般的哈希表,前者确实是 O(n)(即不使用完美的哈希),但这在实践中很少相关。

Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.

任何可靠的哈希表实现,加上半体面的哈希,在预期的情况下具有 O(1) 的检索性能,在非常小的方差范围内具有非常小的因子(实际上为 2)。

回答by Nizar Grira

It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).

这取决于您选择避免冲突的算法。如果您的实现使用单独的链接,那么最坏的情况就是每个数据元素都被散列到相同的值(例如散列函数的选择不当)。在这种情况下,数据查找与链表上的线性搜索没有什么不同,即 O(n)。但是,这种情况发生的概率可以忽略不计,并且查找最佳和平均情况保持不变,即 O(1)。

回答by jtb

We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.

我们已经确定哈希表查找的标准描述为 O(1) 指的是平均情况的预期时间,而不是严格的最坏情况性能。对于使用链式解决冲突的哈希表(如 Java 的哈希图),这在技术上是 O(1+α),具有良好的哈希函数,其中 α 是表的加载因子。只要您存储的对象数量不超过表大小的常数因子,它就保持不变。

It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expectedtime, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n/ log log n) when α=1.

也有人解释说,严格来说,可以构造需要 O( n) 查找任何确定性散列函数的输入。但考虑最坏情况的预期时间也很有趣,这与平均搜索时间不同。使用链接是 O(1 + 最长链的长度),例如 Θ(log n/ log log n) 当 α=1 时。

If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashingwhich resolves collisions recursively with another hash table!

如果您对实现恒定时间预期最坏情况查找的理论方法感兴趣,您可以阅读动态完美散列,它以递归方式解决与另一个散列表的冲突!

回答by I. J. Kennedy

If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).

如果桶的数量(称为 b)保持不变(通常情况下),那么查找实际上是 O(n)。
随着 n 变大,每个桶中的元素数量平均为 n/b。如果冲突解决以一种常用方式(例如链表)完成,则查找为 O(n/b) = O(n)。

The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.

O 表示法是关于当 n 越来越大时会发生什么。当应用于某些算法时,它可能会产生误导,哈希表就是一个很好的例子。我们根据我们期望处理的元素数量来选择桶的数量。当 n 与 b 的大小大致相同时,查找时间大致为常数,但我们不能称其为 O(1),因为 O 是根据 n → ∞ 的极限定义的。

回答by Antti Huima

It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.

仅当您的散列函数非常好时才为 O(1)。Java 哈希表实现不能防止错误的哈希函数。

Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.

添加项目时是否需要扩大表格与问题无关,因为它与查找时间有关。