java 选择具有预期数量的唯一值和插入的 HashSet 的初始容量

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/564899/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 12:52:57  来源:igfitidea点击:

Choosing the initial capacity of a HashSet with an expected number of unique values and insertions

javaset

提问by Clarkey

Ok, here's my situation:

好的,这是我的情况:

I have an Array of States, which may contain duplicates. To get rid of the duplicates, I can add them all to a Set.

我有一个状态数组,其中可能包含重复项。为了摆脱重复,我可以将它们全部添加到一个集合中。

However when I create the Set, it wants the initial capacity and load factor to be defined, but what should they be set to?

但是,当我创建 Set 时,它希望定义初始容量和负载因子,但它们应该设置为什么?

From googling, I have come up with:

从谷歌搜索,我想出了:

String[] allStates = getAllStates();
Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);

The problem with this, is that allStates can contain anwhere between 1 and 5000 states. So the Set will have capacity of over 5000, but only containing at most 50.

问题在于 allStates 可以包含 1 到 5000 个状态。所以 Set 的容量将超过 5000,但最多只能包含 50。

So alternatively set the max size of the Set could be set to be the max number of states, and the load factor to be 1.

因此,也可以将 Set 的最大大小设置为最大状态数,并将负载因子设置为 1。

I guess my questions really are:

我想我的问题真的是:

  • What should you set the initial capacity to be when you don't know how many items are to be in the Set?
  • Does it really matter what it gets set to when the most it could contain is 50?
  • Should I even be worrying about it?
  • 当您不知道 Set 中有多少项目时,您应该将初始容量设置为多少?
  • 当它最多可以包含 50 时,它被设置为什么真的很重要吗?
  • 我应该担心吗?

采纳答案by Zarkonnen

Assuming that you know there won't be more than 50 states (do you mean US States?), the

假设您知道不会超过 50 个州(您是指美国吗?),则

Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);

quoted is definitely wrong. I'd suggest you go for an initial capacity of 50 / 0.75 = 67, or perhaps 68 to be on the safe side.

引用肯定是错误的。为了安全起见,我建议您选择 50 / 0.75 = 67 或 68 的初始容量。

I also feel the need to point out that you're probably overthinking this intensely. Resizing the arraylist twice from 16 up to 64 isn't going to give you a noticeable performance hit unless this is right in the most performance-critical part of the program.

我也觉得有必要指出你可能过度思考了这一点。将数组列表的大小从 16 次调整为 64 次不会给您带来明显的性能损失,除非这在程序中对性能最关键的部分是正确的。

So the best answer is probably to use:

所以最好的答案可能是使用:

new HashSet<String>();

That way, you won't come back a year later and puzzle over why you chose such strange constructor arguments.

这样,您就不会在一年后回来并为为什么选择如此奇怪的构造函数参数而感到困惑。

回答by starblue

Use the constructorwhere you don't need to specify these values, then reasonable defaults are chosen.

在不需要指定这些值的地方使用构造函数,然后选择合理的默认值。

回答by Erick G. Hagstrom

First, I'm going to say that in your case you're definitely overthinking it. However, there are probably situations where one would want to get it right. So here's what I understand:

首先,我要说的是,在您的情况下,您肯定是想多了。然而,在某些情况下,人们可能想要把它做对。所以这是我的理解:

1) The number of items you can hold in your HashSet = initial capacity x load factor. So if you want to be able to hold n items, you need to do what Zarkonnendid and divide n by the load factor.

1) 您可以在 HashSet 中保存的项目数 = 初始容量 x 负载因子。所以如果你想能够容纳n个项目,你需要做Zarkonnen所做的并将n除以负载因子。

2) Under the covers, the initial capacity is rounded up to a power of two per Oracle tutorial.

2) 在幕后,每个 Oracle 教程将初始容量四舍五入为 2 的幂。

3) Load factor should be no more than .80 to prevent excessive collisions, as noted by Tom Hawtin - tackline.

3) 负载系数不应超过 0.80,以防止过度碰撞,如Tom Hawtin-tackline所指出的。

If you just accept the default values (initial capacity = 16, load factor = .75), you'll end up doubling your set in size 3 times. (Initial max size = 12, first increase makes capacity 32 and max size 24 (32 * .75), second increase makes capacity 64 and max size 48 (64 * .75), third increase makes capacity 128 and max size 96 (128 * .75).)

如果您只接受默认值(初始容量 = 16,负载因子 = .75),您最终会将您的集合的大小增加 3 倍。(初始最大大小 = 12,第一次增加使容量为 32,最大大小为 24 (32 * .75),第二次增加使容量为 64,最大大小为 48 (64 * .75),第三次增加使容量为 128,最大大小为 96 (128 * .75))

To get your max size closer to 50, but keep the set as small as possible, consider an initial capacity of 64 (a power of two) and a load factor of .79 or more. 64 * .79 = 50.56, so you can get all 50 states in there. Specifying 32 < initial capacity < 64 will result in initial capacity being rounded up to 64, so that's the same as specifying 64 up front. Specifying initial capacity <= 32 will result in a size increase. Using a load factor < .79 will also result in a size increase unless your initial capacity > 64.

为了让您的最大尺寸接近 50,但保持尽可能小,请考虑初始容量为 64(2 的幂)和 0.79 或更大的负载因子。64 * .79 = 50.56,所以你可以得到所有 50 个状态。指定 32 < 初始容量 < 64 将导致初始容量向上取整为 64,因此这与预先指定 64 相同。指定初始容量 <= 32 将导致大小增加。使用负载因子 < .79 也会导致大小增加,除非您的初始容量 > 64。

So my recommendation is to specify initial capacity = 64 and load factor = .79.

所以我的建议是指定初始容量 = 64 和负载因子 = .79。

回答by Tom Hawtin - tackline

The safe bet is go for a size that is too small.

安全的赌注是选择太小的尺寸。

Because resizing is ameliorated by an exponential growth algorithm (see the stackoverflow podcast from a few weeks back), going small will never cost you that much. If you have lots of sets (lucky you), then it will matter to performance if they are oversize.

由于指数增长算法可以改善调整大小(请参阅几周前的 stackoverflow 播客),因此规模变小永远不会花费太多。如果你有很多套(幸运的是你),那么如果它们过大就会影响性能。

Load factor is a tricky one. I suggest leaving it at the default. I understand: Below about 0.70f you are making the array too large and therefore slower. Above 0.80f and you'll start getting to many key clashes. Presumably probing algorithms will require lower load factors than bucket algorithms.

负载因子是一个棘手的问题。我建议将其保留为默认值。我理解:低于 0.70f 时,您使数组过大,因此速度较慢。高于 0.80f,您将开始遇到许多键冲突。据推测,探测算法需要比桶算法更低的负载因子。

Also note that the "initial capacity" means something slightly different than it appears most people think. It refers to the number of entries in the array. To get the exact capacity for a number of elements, divide by the desired load factor (and round appropriately).

另请注意,“初始容量”的含义与大多数人认为的略有不同。它指的是数组中的条目数。要获得多个元素的确切容量,请除以所需的负载因子(并适当舍入)。

回答by tehvan

Make a good guess. There is no hard rule. If you know there's likely to be say 10-20 states, i'd start off with that number (20).

好好猜一猜。没有硬性规定。如果您知道可能有 10-20 个州,我将从该数字 (20) 开始。

回答by Jeroen van Bergen

I second Zarkonnen. Your last question is the most important one. If this happens to occur in a hotspot of your application it might be worth the effort to look at it and try to optimise, otherwise CPU cycles are cheaper than burning up your own neurons.

我第二个扎科宁。你的最后一个问题是最重要的。如果这恰好发生在您的应用程序的热点中,可能值得努力查看并尝试优化,否则 CPU 周期比烧毁您自己的神经元要便宜。

回答by Paul Draper

If you were to optimize this -- and it may be appropriate to do that -- some of your decision will depend on how many duplicates you expect the array to have.

如果您要对此进行优化——这样做可能是合适的——您的某些决定将取决于您希望数组具有多少重复项。

  • If there are very many duplicates, you will want a smaller initial capacity. Large, sparse hash tables are bad when iterating.

  • If there are not expected to be very many duplicates, you will want an initial capacity such that the entire array could fit without resizing.

  • 如果有很多重复项,您将需要较小的初始容量。迭代时,大的、稀疏的哈希表是不好的。

  • 如果预计不会有很多重复项,您将需要一个初始容量,以便整个阵列可以适应而无需调整大小。

My guess is that you want the latter, but this is something worth considering if you pursue this.

我的猜测是你想要后者,但如果你追求这个,这是值得考虑的。