C# 为什么我们在 HashTable 中使用 Hash Code 而不是 Index?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/901034/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 22:13:05  来源:igfitidea点击:

Why we use Hash Code in HashTable instead of an Index?

c#algorithmhash

提问by Jaywith.7

  • How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?

  • In string, it is overridden to make sure that there exists only one hash code for a particular string. How to do that?

  • How searching for specific key in a hash table is speeded up using hash code?

  • What are the advantages of using hash code over using an index directly in the collection (like in arrays)?

  • GetHashCode() 函数如何生成整数哈希?它是不是唯一的随机值吗?

  • 在字符串中,它被覆盖以确保特定字符串只存在一个哈希码。怎么做?

  • 如何使用哈希码加快在哈希表中搜索特定键的速度?

  • 与在集合中直接使用索引(如在数组中)相比,使用哈希码有什么优势?

Can someone help?

有人可以帮忙吗?

回答by BobMcGee

Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.

基本上,散列函数使用一些通用函数来消化数据并为该数据生成指纹(此处为整数)。与索引不同,此指纹仅取决于数据,并且不应基于数据进行任何可预测的排序。对单个数据位的任何更改也会显着改变指纹。

Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.

请注意,这并不能保证不同的数据不会给出相同的哈希值。事实上,恰恰相反:这种情况经常发生,称为碰撞。但是,对于整数,与此相反的概率大约为 40 亿分之一(2^32 中的 1 个)。如果发生碰撞,您只需比较您正在散列的实际对象,看看它们是否匹配。

This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.

然后,该指纹可以用作存储值的数组(或数组列表)的索引。因为指纹仅依赖于数据,所以您可以计算某些东西的哈希值,然后检查该哈希值的数组元素以查看它是否已经存储。否则,您必须检查整个数组是否与项目匹配。

You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.

您还可以通过使用 2 个数组非常快速地创建关联数组,一个带有键值(由哈希索引),第二个带有映射到这些键的值。如果您使用散列,您只需要知道键的散列即可找到该键的匹配值。这比对排序的键列表进行二分搜索或扫描整个数组以查找匹配键要快得多。

There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.

生成散列的方法有很多种,它们都有各种优点,但很少有简单的方法。我建议查阅有关哈希函数的维基百科页面以获取更多信息。

回答by Henk Holterman

A HashCode is a pseudo unique key. We would like to have a really unique key but that's not feasible. We settle for a fast and safe (no exceptions) function.

HashCode 是一个伪唯一键。我们想要一个真正唯一的密钥,但这是不可行的。我们满足于快速安全(无例外)的功能。

A HashTableuses the HashCode to do a lookup in O(1) time initially. Any indexing scheme requires O(log(n)) time. But with an inefficient HashCode function the collision handling can make the HashTable a lot slower.

哈希表使用的哈希码最初执行查找在O(1)时间。任何索引方案都需要 O(log(n)) 时间。但是对于一个低效的 HashCode 函数,冲突处理会​​使 HashTable 变慢很多。

In .NET there is a default implementation for GetHashCode, but types can override this.

在 .NET 中有 GetHashCode 的默认实现,但类型可以覆盖它。

the System.String overrides GetHashCode() because it overrides Equals() and then GetHashCode has to be kept consistent.

System.String 覆盖 GetHashCode() 因为它覆盖 Equals() 然后 GetHashCode 必须保持一致。

回答by Andrew Shepherd

Answering each one of your questions directly:

直接回答您的每一个问题:

How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?

GetHashCode() 函数如何生成整数哈希?它是不是唯一的随机值吗?

An integer hash is generated by whatever method is appropriate for the object. The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.

整数散列由任何适合对象的方法生成。生成方法不是随机的,但必须遵循一致的规则,确保为一个特定对象生成的散列将等于为等效对象生成的散列。例如,整数的哈希函数将简单地返回该整数。

In string, it is overridden to make sure that there exists only one hash code for a particular string. How to do that?

在字符串中,它被覆盖以确保特定字符串只存在一个哈希码。怎么做?

There are many ways this can be done. Here's an example I'm thinking of on the spot:

有很多方法可以做到这一点。这是我在现场想到的一个例子:

int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
    hash ^= theString[i];
}

This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a goodhash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A goodhash algorithm will make a chance of two differing objects producing the same number extremely unlikely.

这是一个有效的散列算法,因为相同的字符序列将始终产生相同的散列数。这不是一个好的散列算法(极端轻描淡写),因为许多字符串会产生相同的散列。有效的哈希算法不必保证唯一性。一个好的散列算法将使两个不同的对象产生相同数字的可能性极小。

How searching for specific key in a hash table is speeded up using hash code? What are the advantages of using hash code over using an index directly in the collection (like in arrays)?

如何使用哈希码加快在哈希表中搜索特定键的速度?与在集合中直接使用索引(如在数组中)相比,使用哈希码有什么优势?

A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate

哈希码通常用于哈希表中。哈希表是一个数组,但数组中的每个条目都是一个项目的“桶”,而不仅仅是一个项目。如果你有一个对象,你想知道它属于哪个桶,计算

 hash_value MOD hash_table_size. 

Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.

然后您只需将对象与存储桶中的每个项目进行比较。因此,哈希表查找很可能具有 O(1) 的搜索时间,而不是排序列表的 O(log(N)) 或未排序列表的 O(N)。

回答by Andrew Shepherd

A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.

哈希码是一个索引,而哈希表在其最低层是一个数组。但是对于给定的键值,我们以不同的方式确定哈希表中的索引,以加快数据检索速度。

Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.

示例:您有 1,000 个单词及其定义。您希望存储它们,以便您可以非常非常快速地检索单词的定义——比二进制搜索更快,而这正是您必须对数组进行的操作。

So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).

所以你创建了一个哈希表。您从一个比 1,000 个条目大得多的数组开始——比如 5,000 个(越大,时间效率越高)。

The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.

使用表格的方式是,查找单词,然后将其转换为 0 到 4,999 之间的数字。您选择执行此操作的算法;这就是哈希算法。但是你无疑可以写出非常快的东西。

Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've createdthe index directly from the search word.

然后,您将转换后的数字用作 5,000 元素数组的索引,并在该索引处插入/查找您的定义。根本没有搜索:您直接从搜索词创建了索引。

All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)

我描述的所有操作都是恒定时间;当我们增加条目数量时,它们都不会花费更长的时间。我们只需要确保散列中有足够的空间来最小化“冲突”的机会,即两个不同的词将转换为相同的整数索引的机会。因为任何散列算法都可能发生这种情况,所以我们需要添加检查以查看是否存在冲突,并做一些特殊的事情(如果“hello”和“world”都散列到 1,234 并且“hello”已经在表中,那么我们会用“world”来做吗?最简单的方法是把它放在 1,235 中,然后调整我们的查找逻辑以允许这种可能性。)

Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.

编辑:重新阅读您的帖子后:哈希算法绝对不是随机的,它必须是确定性的。在我的示例中为“hello”生成的索引每次必须为 1,234;这是查找工作的唯一方式。