java 不同初始容量和负载因子下HashMap的性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1324064/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Performance of HashMap with different initial capacity and load factor
提问by jW.
Here is my situation. I am using two java.util.HashMap to store some frequently used data in a Java web app running on Tomcat. I know the exact number of entries into each Hashmap. The keys will be strings, and ints respectively.
这是我的情况。我使用两个 java.util.HashMap 将一些常用数据存储在 Tomcat 上运行的 Java Web 应用程序中。我知道每个 Hashmap 的确切条目数。键将分别是字符串和整数。
My question is, what is the best way to set the initial capacity and loadfactor?
我的问题是,设置初始容量和负载因子的最佳方法是什么?
Should I set the capacity equal to the number of elements it will have and the load capacity to 1.0? I would like the absolute best performance without using too much memory. I am afraid however, that the table would not fill optimally. With a table of the exact size needed, won't there be key collision, causing a (usually short) scan to find the correct element?
我是否应该将容量设置为等于它将拥有的元素数量并将负载容量设置为 1.0?我希望在不使用太多内存的情况下获得绝对最佳性能。然而,我担心表格不会以最佳方式填充。使用所需的确切大小的表,会不会发生密钥冲突,导致(通常很短)扫描以找到正确的元素?
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys, wouldn't that mean that keys 5, 10, 15 would hit the same bucket and then cause a seek to fill the buckets next to them? Would a larger initial capacity increase performance?
假设(这是一个延伸)散列函数是整数键的简单 mod 5,这是否意味着键 5、10、15 会命中同一个桶,然后导致搜索填充旁边的桶他们?更大的初始容量会提高性能吗?
Also, if there is a better datastructure than a hashmap for this, I am completely open to that as well.
另外,如果有比哈希图更好的数据结构,我也完全开放。
采纳答案by Avi
In the absence of a perfect hashing function for your data, and assuming that this is really not a micro-optimization of something that really doesn't matter, I would try the following:
如果您的数据没有完美的散列函数,并且假设这真的不是对真正无关紧要的东西的微优化,我会尝试以下操作:
Assume the default load capacity (.75) used by HashMap is a good value in most situations. That being the case, you can use it, and set the initial capacity of your HashMap based on your own knowledge of how many items it will hold - set it so that initial-capacity x .75 = number of items (round up).
假设 HashMap 使用的默认负载容量 (.75) 在大多数情况下是一个很好的值。在这种情况下,您可以使用它,并根据您自己对它将容纳多少项目的了解来设置 HashMap 的初始容量 - 将其设置为初始容量 x .75 = 项目数(向上取整)。
If it were a larger map, in a situation where high-speed lookup was really critical, I would suggest using some sort of trierather than a hash map. For long strings, in large maps, you can save space, and some time, by using a more string-oriented data structure, such as a trie.
如果它是一个更大的地图,在高速查找非常重要的情况下,我会建议使用某种特里树而不是哈希图。对于长字符串,在大型映射中,您可以通过使用更面向字符串的数据结构(例如 trie)来节省空间和一些时间。
回答by Stephen C
Assuming that your hash function is "good", the best thing to do is to set the initial size to the expected number of elements, assuming that you can get a good estimate cheaply. It is a good idea to do this because when a HashMap resizes it has to recalculate the hash values for every key in the table.
假设您的哈希函数是“好”的,最好的做法是将初始大小设置为预期的元素数量,假设您可以廉价地获得一个好的估计。这样做是个好主意,因为当 HashMap 调整大小时,它必须重新计算表中每个键的哈希值。
Leave the load factor at 0.75. The value of 0.75has been chosen empirically as a good compromise between hash lookup performance and space usage for the primary hash array. As you push the load factor up, the average lookup time will increase significantly.
将负载因子保留为0.75。0.75已经根据经验选择了的值,作为主散列数组的散列查找性能和空间使用之间的良好折衷。当您提高负载因子时,平均查找时间将显着增加。
If you want to dig into the mathematics of hash table behaviour: Donald Knuth (1998). The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0.
如果你想深入研究哈希表行为的数学:Donald Knuth (1998)。计算机程序设计的艺术”。3:排序和搜索(第二版)。艾迪生-卫斯理。第 513–558 页。ISBN 0-201-89685-0。
回答by Fortyrunner
I find it best not to fiddle around with default settings unless I really really need to.
我发现最好不要摆弄默认设置,除非我真的需要。
Hotspot does a great job of doing optimizations for you.
Hotspot 在为您进行优化方面做得很好。
In any case; I would use a profiler (Say Netbeans Profiler) to measure the problem first.
任何状况之下; 我会先使用分析器(比如 Netbeans Profiler)来衡量问题。
We routinely store maps with 10000s of elements and if you have a good equals and hashcode implementation (and strings and Integers do!) this will be better than any load changes you may make.
我们通常会存储包含 10000 个元素的映射,如果您有一个很好的 equals 和 hashcode 实现(字符串和整数也是如此!),这将比您可能进行的任何加载更改都要好。
回答by Cowan
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys
假设(这是一个延伸)散列函数是整数键的简单模 5
It's not. From HashMap.java:
不是。来自 HashMap.java:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I'm not even going to pretend I understand that, but it looks like that's designed to handle just that situation.
我什至不会假装我理解这一点,但它看起来就是为处理这种情况而设计的。
Note also that the number of buckets is also always a power of 2, no matter what size you ask for.
另请注意,无论您要求什么大小,桶的数量也始终是 2 的幂。
回答by Tom Hawtin - tackline
Entries are allocated to buckets in a random-like way. So even if you as many buckets as entries, some of the buckets will have collisions.
条目以类似随机的方式分配给桶。因此,即使您的存储桶与条目一样多,某些存储桶也会发生冲突。
If you have more buckets, you'll have fewer collisions. However, more buckets means spreading out in memory and therefore slower. Generally a load factor in the range 0.7-0.8 is roughly optimal, so it is probably not worth changing.
如果你有更多的桶,你就会有更少的碰撞。然而,更多的桶意味着在内存中展开,因此速度更慢。通常,0.7-0.8 范围内的负载因子大致是最佳的,因此可能不值得更改。
As ever, it's probably worth profiling before you get hung up on microtuning these things.
与以往一样,在您沉迷于微调这些东西之前,可能值得分析一下。

