.net HashSet 与列表性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/150750/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 10:09:24  来源:igfitidea点击:

HashSet vs. List performance

.netperformancecollectionslisthash

提问by Michael Damatov

It's clear that a search performance of the generic HashSet<T>class is higher than of the generic List<T>class. Just compare the hash-based key with the linear approach in the List<T>class.

很明显,泛型HashSet<T>类的搜索性能高于泛型List<T>类。只需将基于哈希的密钥与类中的线性方法进行比较List<T>

However calculating a hash key may itself take some CPU cycles, so for a small amount of items the linear search can be a real alternative to the HashSet<T>.

然而,计算散列键本身可能需要一些 CPU 周期,因此对于少量项目,线性搜索可以真正替代HashSet<T>.

My question: where is the break-even?

我的问题:盈亏平衡点在哪里?

To simplify the scenario (and to be fair) let's assume that the List<T>class uses the element's Equals()method to identify an item.

为了简化场景(公平起见),我们假设List<T>该类使用元素的Equals()方法来标识一个项目。

回答by innominate227

A lot of people are saying that once you get to the size where speed is actually a concern that HashSet<T>will always beat List<T>, but that depends on what you are doing.

很多人都说,一旦你达到速度实际上是一个问题的规模,它HashSet<T>就会永远击败List<T>,但这取决于你在做什么。

Let's say you have a List<T>that will only ever have on average 5 items in it. Over a large number of cycles, if a single item is added or removed each cycle, you may well be better off using a List<T>.

假设您有一个List<T>平均只有 5 个项目的产品。在大量循环中,如果每个循环添加或删除单个项目,则最好使用List<T>.

I did a test for this on my machine, and, well, it has to be very very small to get an advantage from List<T>. For a list of short strings, the advantage went away after size 5, for objects after size 20.

我在我的机器上对此进行了测试,而且,它必须非常非常小才能从List<T>. 对于短字符串列表,在大小为 5 之后,对于大小为 20 之后的对象,优势就消失了。

1 item LIST strs time: 617ms
1 item HASHSET strs time: 1332ms

2 item LIST strs time: 781ms
2 item HASHSET strs time: 1354ms

3 item LIST strs time: 950ms
3 item HASHSET strs time: 1405ms

4 item LIST strs time: 1126ms
4 item HASHSET strs time: 1441ms

5 item LIST strs time: 1370ms
5 item HASHSET strs time: 1452ms

6 item LIST strs time: 1481ms
6 item HASHSET strs time: 1418ms

7 item LIST strs time: 1581ms
7 item HASHSET strs time: 1464ms

8 item LIST strs time: 1726ms
8 item HASHSET strs time: 1398ms

9 item LIST strs time: 1901ms
9 item HASHSET strs time: 1433ms

1 item LIST objs time: 614ms
1 item HASHSET objs time: 1993ms

4 item LIST objs time: 837ms
4 item HASHSET objs time: 1914ms

7 item LIST objs time: 1070ms
7 item HASHSET objs time: 1900ms

10 item LIST objs time: 1267ms
10 item HASHSET objs time: 1904ms

13 item LIST objs time: 1494ms
13 item HASHSET objs time: 1893ms

16 item LIST objs time: 1695ms
16 item HASHSET objs time: 1879ms

19 item LIST objs time: 1902ms
19 item HASHSET objs time: 1950ms

22 item LIST objs time: 2136ms
22 item HASHSET objs time: 1893ms

25 item LIST objs time: 2357ms
25 item HASHSET objs time: 1826ms

28 item LIST objs time: 2555ms
28 item HASHSET objs time: 1865ms

31 item LIST objs time: 2755ms
31 item HASHSET objs time: 1963ms

34 item LIST objs time: 3025ms
34 item HASHSET objs time: 1874ms

37 item LIST objs time: 3195ms
37 item HASHSET objs time: 1958ms

40 item LIST objs time: 3401ms
40 item HASHSET objs time: 1855ms

43 item LIST objs time: 3618ms
43 item HASHSET objs time: 1869ms

46 item LIST objs time: 3883ms
46 item HASHSET objs time: 2046ms

49 item LIST objs time: 4218ms
49 item HASHSET objs time: 1873ms

Here is that data displayed as a graph:

这是显示为图表的数据:

enter image description here

在此处输入图片说明

Here's the code:

这是代码:

static void Main(string[] args)
{
    int times = 10000000;


    for (int listSize = 1; listSize < 10; listSize++)
    {
        List<string> list = new List<string>();
        HashSet<string> hashset = new HashSet<string>();

        for (int i = 0; i < listSize; i++)
        {
            list.Add("string" + i.ToString());
            hashset.Add("string" + i.ToString());
        }

        Stopwatch timer = new Stopwatch();
        timer.Start();
        for (int i = 0; i < times; i++)
        {
            list.Remove("string0");
            list.Add("string0");
        }
        timer.Stop();
        Console.WriteLine(listSize.ToString() + " item LIST strs time: " + timer.ElapsedMilliseconds.ToString() + "ms");


        timer = new Stopwatch();
        timer.Start();
        for (int i = 0; i < times; i++)
        {
            hashset.Remove("string0");
            hashset.Add("string0");
        }
        timer.Stop();
        Console.WriteLine(listSize.ToString() + " item HASHSET strs time: " + timer.ElapsedMilliseconds.ToString() + "ms");
        Console.WriteLine();
    }


    for (int listSize = 1; listSize < 50; listSize+=3)
    {
        List<object> list = new List<object>();
        HashSet<object> hashset = new HashSet<object>();

        for (int i = 0; i < listSize; i++)
        {
            list.Add(new object());
            hashset.Add(new object());
        }

        object objToAddRem = list[0];

        Stopwatch timer = new Stopwatch();
        timer.Start();
        for (int i = 0; i < times; i++)
        {
            list.Remove(objToAddRem);
            list.Add(objToAddRem);
        }
        timer.Stop();
        Console.WriteLine(listSize.ToString() + " item LIST objs time: " + timer.ElapsedMilliseconds.ToString() + "ms");



        timer = new Stopwatch();
        timer.Start();
        for (int i = 0; i < times; i++)
        {
            hashset.Remove(objToAddRem);
            hashset.Add(objToAddRem);
        }
        timer.Stop();
        Console.WriteLine(listSize.ToString() + " item HASHSET objs time: " + timer.ElapsedMilliseconds.ToString() + "ms");
        Console.WriteLine();
    }

    Console.ReadLine();
}

回答by Eloff

You're looking at this wrong. Yes a linear search of a List will beat a HashSet for a small number of items. But the performance difference usually doesn't matter for collections that small. It's generally the large collections you have to worry about, and that's where you think in terms of Big-O. However, if you've measured a real bottleneck on HashSet performance, then you can try to create a hybrid List/HashSet, but you'll do that by conducting lots of empirical performance tests - not asking questions on SO.

你看错了。是的,对于少量项目,List 的线性搜索将击败 HashSet。但是对于这么小的集合,性能差异通常无关紧要。通常,您必须担心大型集合,这就是您在 Big-O 方面考虑的地方。但是,如果您已经测量到 HashSet 性能的真正瓶颈,那么您可以尝试创建一个混合 List/HashSet,但您将通过进行大量经验性能测试来实现这一点 - 而不是就 SO 提出问题。

回答by nawfal

It's essentially pointless to compare two structures for performancethat behave differently. Use the structure that conveys the intent. Even if you say your List<T>wouldn't have duplicates and iteration order doesn't matter making it comparable to a HashSet<T>, its still a poor choice to use List<T>because its relatively less fault tolerant.

比较两种行为不同的结构的性能本质上是没有意义的。使用传达意图的结构。即使您说您List<T>不会有重复项并且迭代顺序与 a 相比并不重要,但HashSet<T>它仍然是一个糟糕的选择,List<T>因为它的容错性相对较低。

That said, I will inspect some other aspectsof performance,

也就是说,我将检查性能的其他一些方面

+------------+--------+-------------+-----------+----------+----------+-----------+
| Collection | Random | Containment | Insertion | Addition |  Removal | Memory    |
|            | access |             |           |          |          |           |
+------------+--------+-------------+-----------+----------+----------+-----------+
| List<T>    | O(1)   | O(n)        | O(n)      | O(1)*    | O(n)     | Lesser    |
| HashSet<T> | O(n)   | O(1)        | n/a       | O(1)     | O(1)     | Greater** |
+------------+--------+-------------+-----------+----------+----------+-----------+
  • Even though addition is O(1) in both cases, it will be relatively slower in HashSet since it involves cost of precomputing hash code before storing it.

  • The superior scalability of HashSet has a memory cost. Every entry is stored as a new object along with its hash code. This articlemight give you an idea.

  • 尽管在这两种情况下加法都是 O(1),但在 HashSet 中它会相对较慢,因为它涉及在存储之前预先计算哈希码的成本。

  • HashSet 优越的可扩展性是有内存成本的。每个条目都存储为一个新对象及其哈希码。这篇文章可能会给你一个想法。

回答by core

Whether to use a HashSet<> or List<> comes down to how you need to access your collection. If you need to guarantee the order of items, use a List. If you don't, use a HashSet. Let Microsoft worry about the implementation of their hashing algorithms and objects.

是使用 HashSet<> 还是 List<> 取决于您需要如何访问您的集合。如果您需要保证项目的顺序,请使用 List。如果不这样做,请使用 HashSet。让微软担心他们的散列算法和对象的实现。

A HashSet will access items without having to enumerate the collection (complexity of O(1)or near it), and because a List guarantees order, unlike a HashSet, some items will have to be enumerated (complexity of O(n)).

HashSet 将访问项目而不必枚举集合(O(1)或接近它的复杂度),并且由于 List 保证顺序,与 HashSet 不同,某些项目将必须被枚举(O(n) 的复杂度)。

回答by drzaus

Just thought I'd chime in with some benchmarks for different scenarios to illustrate the previous answers:

只是想我会针对不同场景加入一些基准来说明以前的答案:

  1. A few (12 - 20) small strings (length between 5 and 10 characters)
  2. Many (~10K) small strings
  3. A few long strings (length between 200 and 1000 characters)
  4. Many (~5K) long strings
  5. A few integers
  6. Many (~10K) integers
  1. 一些 (12 - 20) 个小字符串(长度在 5 到 10 个字符之间)
  2. 许多(~10K)小字符串
  3. 几个长字符串(长度在 200 到 1000 个字符之间)
  4. 许多(~5K)长字符串
  5. 几个整数
  6. 许多(~10K)整数

And for each scenario, looking up values which appear:

对于每个场景,查找出现的值:

  1. In the beginning of the list ("start", index 0)
  2. Near the beginning of the list ("early", index 1)
  3. In the middle of the list ("middle", index count/2)
  4. Near the end of the list ("late", index count-2)
  5. At the end of the list ("end", index count-1)
  1. 在列表的开头(“开始”,索引 0)
  2. 靠近列表的开头(“早期”,索引 1)
  3. 在列表的中间(“中间”,索引计数/2)
  4. 接近列表末尾(“迟到”,索引计数为 2)
  5. 在列表的末尾(“结束”,索引计数-1)

Before each scenario I generated randomly sized lists of random strings, and then fed each list to a hashset. Each scenario ran 10,000 times, essentially:

在每个场景之前,我生成随机大小的随机字符串列表,然后将每个列表输入到一个哈希集。每个场景运行 10,000 次,本质上是:

(test pseudocode)

(测试伪代码)

stopwatch.start
for X times
    exists = list.Contains(lookup);
stopwatch.stop

stopwatch.start
for X times
    exists = hashset.Contains(lookup);
stopwatch.stop

Sample Output

样本输出

Tested on Windows 7, 12GB Ram, 64 bit, Xeon 2.8GHz

在 Windows 7、12GB 内存、64 位、至强 2.8GHz 上测试

---------- Testing few small strings ------------
Sample items: (16 total)
vgnwaloqf diwfpxbv tdcdc grfch icsjwk
...

Benchmarks:
1: hashset: late -- 100.00 % -- [Elapsed: 0.0018398 sec]
2: hashset: middle -- 104.19 % -- [Elapsed: 0.0019169 sec]
3: hashset: end -- 108.21 % -- [Elapsed: 0.0019908 sec]
4: list: early -- 144.62 % -- [Elapsed: 0.0026607 sec]
5: hashset: start -- 174.32 % -- [Elapsed: 0.0032071 sec]
6: list: middle -- 187.72 % -- [Elapsed: 0.0034536 sec]
7: list: late -- 192.66 % -- [Elapsed: 0.0035446 sec]
8: list: end -- 215.42 % -- [Elapsed: 0.0039633 sec]
9: hashset: early -- 217.95 % -- [Elapsed: 0.0040098 sec]
10: list: start -- 576.55 % -- [Elapsed: 0.0106073 sec]


---------- Testing many small strings ------------
Sample items: (10346 total)
dmnowa yshtrxorj vthjk okrxegip vwpoltck
...

Benchmarks:
1: hashset: end -- 100.00 % -- [Elapsed: 0.0017443 sec]
2: hashset: late -- 102.91 % -- [Elapsed: 0.0017951 sec]
3: hashset: middle -- 106.23 % -- [Elapsed: 0.0018529 sec]
4: list: early -- 107.49 % -- [Elapsed: 0.0018749 sec]
5: list: start -- 126.23 % -- [Elapsed: 0.0022018 sec]
6: hashset: early -- 134.11 % -- [Elapsed: 0.0023393 sec]
7: hashset: start -- 372.09 % -- [Elapsed: 0.0064903 sec]
8: list: middle -- 48,593.79 % -- [Elapsed: 0.8476214 sec]
9: list: end -- 99,020.73 % -- [Elapsed: 1.7272186 sec]
10: list: late -- 99,089.36 % -- [Elapsed: 1.7284155 sec]


---------- Testing few long strings ------------
Sample items: (19 total)
hidfymjyjtffcjmlcaoivbylakmqgoiowbgxpyhnrreodxyleehkhsofjqenyrrtlphbcnvdrbqdvji...
...

Benchmarks:
1: list: early -- 100.00 % -- [Elapsed: 0.0018266 sec]
2: list: start -- 115.76 % -- [Elapsed: 0.0021144 sec]
3: list: middle -- 143.44 % -- [Elapsed: 0.0026201 sec]
4: list: late -- 190.05 % -- [Elapsed: 0.0034715 sec]
5: list: end -- 193.78 % -- [Elapsed: 0.0035395 sec]
6: hashset: early -- 215.00 % -- [Elapsed: 0.0039271 sec]
7: hashset: end -- 248.47 % -- [Elapsed: 0.0045386 sec]
8: hashset: start -- 298.04 % -- [Elapsed: 0.005444 sec]
9: hashset: middle -- 325.63 % -- [Elapsed: 0.005948 sec]
10: hashset: late -- 431.62 % -- [Elapsed: 0.0078839 sec]


---------- Testing many long strings ------------
Sample items: (5000 total)
yrpjccgxjbketcpmnvyqvghhlnjblhgimybdygumtijtrwaromwrajlsjhxoselbucqualmhbmwnvnpnm
...

Benchmarks:
1: list: early -- 100.00 % -- [Elapsed: 0.0016211 sec]
2: list: start -- 132.73 % -- [Elapsed: 0.0021517 sec]
3: hashset: start -- 231.26 % -- [Elapsed: 0.003749 sec]
4: hashset: end -- 368.74 % -- [Elapsed: 0.0059776 sec]
5: hashset: middle -- 385.50 % -- [Elapsed: 0.0062493 sec]
6: hashset: late -- 406.23 % -- [Elapsed: 0.0065854 sec]
7: hashset: early -- 421.34 % -- [Elapsed: 0.0068304 sec]
8: list: middle -- 18,619.12 % -- [Elapsed: 0.3018345 sec]
9: list: end -- 40,942.82 % -- [Elapsed: 0.663724 sec]
10: list: late -- 41,188.19 % -- [Elapsed: 0.6677017 sec]


---------- Testing few ints ------------
Sample items: (16 total)
7266092 60668895 159021363 216428460 28007724
...

Benchmarks:
1: hashset: early -- 100.00 % -- [Elapsed: 0.0016211 sec]
2: hashset: end -- 100.45 % -- [Elapsed: 0.0016284 sec]
3: list: early -- 101.83 % -- [Elapsed: 0.0016507 sec]
4: hashset: late -- 108.95 % -- [Elapsed: 0.0017662 sec]
5: hashset: middle -- 112.29 % -- [Elapsed: 0.0018204 sec]
6: hashset: start -- 120.33 % -- [Elapsed: 0.0019506 sec]
7: list: late -- 134.45 % -- [Elapsed: 0.0021795 sec]
8: list: start -- 136.43 % -- [Elapsed: 0.0022117 sec]
9: list: end -- 169.77 % -- [Elapsed: 0.0027522 sec]
10: list: middle -- 237.94 % -- [Elapsed: 0.0038573 sec]


---------- Testing many ints ------------
Sample items: (10357 total)
370826556 569127161 101235820 792075135 270823009
...

Benchmarks:
1: list: early -- 100.00 % -- [Elapsed: 0.0015132 sec]
2: hashset: end -- 101.79 % -- [Elapsed: 0.0015403 sec]
3: hashset: early -- 102.08 % -- [Elapsed: 0.0015446 sec]
4: hashset: middle -- 103.21 % -- [Elapsed: 0.0015618 sec]
5: hashset: late -- 104.26 % -- [Elapsed: 0.0015776 sec]
6: list: start -- 126.78 % -- [Elapsed: 0.0019184 sec]
7: hashset: start -- 130.91 % -- [Elapsed: 0.0019809 sec]
8: list: middle -- 16,497.89 % -- [Elapsed: 0.2496461 sec]
9: list: end -- 32,715.52 % -- [Elapsed: 0.4950512 sec]
10: list: late -- 33,698.87 % -- [Elapsed: 0.5099313 sec]

回答by Walden Leverich

The breakeven will depend on the cost of computing the hash. Hash computations can be trivial, or not... :-) There is always the System.Collections.Specialized.HybridDictionary class to help you not have to worry about the breakeven point.

盈亏平衡将取决于计算散列的成本。哈希计算可以是微不足道的,也可以不是... :-) 总有 System.Collections.Specialized.HybridDictionary 类可以帮助您不必担心盈亏平衡点。

回答by Robert P

The answer, as always, is "It depends". I assume from the tags you're talking about C#.

答案一如既往,是“这取决于”。我从标签中假设您正在谈论 C#。

Your best bet is to determine

你最好的办法是确定

  1. A Set of data
  2. Usage requirements
  1. 一组数据
  2. 使用要求

and write some test cases.

并编写一些测试用例。

It also depends on how you sort the list (if it's sorted at all), what kind of comparisons need to be made, how long the "Compare" operation takes for the particular object in the list, or even how you intend to use the collection.

它还取决于您如何对列表进行排序(如果它已排序)、需要进行什么样的比较、对列表中的特定对象执行“比较”操作需要多长时间,甚至您打算如何使用收藏。

Generally, the best one to choose isn't so much based on the size of data you're working with, but rather how you intend to access it. Do you have each piece of data associated with a particular string, or other data? A hash based collection would probably be best. Is the order of the data you're storing important, or are you going to need to access all of the data at the same time? A regular list may be better then.

一般来说,最好的选择并不是基于您正在处理的数据的大小,而是您打算如何访问它。您是否拥有与特定字符串或其他数据相关联的每条数据?基于散列的集合可能是最好的。您存储的数据的顺序是否重要,或者您是否需要同时访问所有数据?一个常规的列表可能会更好。

Additional:

额外的:

Of course, my above comments assume 'performance' means data access. Something else to consider: what are you looking for when you say "performance"? Is performance individual value look up? Is it management of large (10000, 100000 or more) value sets? Is it the performance of filling the data structure with data? Removing data? Accessing individual bits of data? Replacing values? Iterating over the values? Memory usage? Data copying speed? For example, If you access data by a string value, but your main performance requirement is minimal memory usage, you might have conflicting design issues.

当然,我上面的评论假设“性能”意味着数据访问。还有一点需要考虑:当你说“性能”时,你在寻找什么?绩效个人价值查找?是否管理大型(10000、100000 或更多)值集?是用数据填充数据结构的性能吗?删除数据?访问单个数据位?替换值?迭代值?内存使用情况?数据复制速度?例如,如果您通过字符串值访问数据,但您的主要性能要求是最少的内存使用,则您可能会遇到冲突的设计问题。

回答by Muis

You can use a HybridDictionary which automaticly detects the breaking point, and accepts null-values, making it essentialy the same as a HashSet.

您可以使用 HybridDictionary 自动检测断点并接受空值,使其本质上与 HashSet 相同。

回答by Adam Rosenfield

It depends. If the exact answer really matters, do some profiling and find out. If you're sure you'll never have more than a certain number of elements in the set, go with a List. If the number is unbounded, use a HashSet.

这取决于。如果确切的答案真的很重要,请进行一些分析并找出答案。如果您确定集合中的元素永远不会超过一定数量,请使用 List。如果数字无界,则使用 HashSet。

回答by Peter

Depends on what you're hashing. If your keys are integers you probably don't need very many items before the HashSet is faster. If you're keying it on a string then it will be slower, and depends on the input string.

取决于你在散列什么。如果您的键是整数,则在 HashSet 更快之前您可能不需要很多项目。如果您在字符串上键入它,那么它会更慢,并且取决于输入字符串。

Surely you could whip up a benchmark pretty easily?

你当然可以很容易地建立一个基准吗?