Java 哈希:它在内部是如何工作的?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4453476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Hash : How does it work internally?
提问by Rachel
This might sound as an very vague question upfront but it is not. I have gone through Hash Functiondescription on wiki but it is not very helpful to understand.
这听起来可能是一个非常模糊的问题,但事实并非如此。我已经在维基上浏览了哈希函数描述,但理解起来并不是很有帮助。
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
我正在寻找像哈希这样相当复杂的主题的简单答案。以下是我的问题:
- What do we mean by hashing? How does it work internally?
- What algorithm does it follow ?
- What is the difference between
HashMap
,HashTable
andHashList
? - What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
- Lastly, why in most interview questions
Hash
andLinkedList
are asked, is there any specific logic for it from testing interviewee's knowledge?
- 我们所说的哈希是什么意思?它在内部如何运作?
- 它遵循什么算法?
- 之间有什么区别
HashMap
,HashTable
和HashList
? - 我们所说的“恒定时间复杂度”是什么意思,为什么不同的哈希实现会提供恒定时间操作?
- 最后,为什么在大多数面试问题
Hash
和LinkedList
被问到的问题中,从测试受访者的知识来看有什么具体的逻辑吗?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.
我知道我的问题清单很大,但如果我能得到这些问题的明确答案,我将不胜感激,因为我真的很想了解这个话题。
采纳答案by Enrique
Hereis a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location.
myHashFunction(key: "Rachel" value: "Rachel") --> 10
. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just callGetmyHashFunction("Rachel")
and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.Hereare some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n.
h(k) = k mod n
. For example if your array size isn = 100
and your key is an integerk = 15
thenh(k) = 10
.Hashtable is synchronised and Hashmap is not. Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)
这是关于散列的一个很好的解释。例如,您想存储字符串“Rachel”,您可以对该字符串应用哈希函数以获取内存位置。
myHashFunction(key: "Rachel" value: "Rachel") --> 10
. 该函数可能会为输入“Rachel”返回 10,因此假设您有一个大小为 100 的数组,您将“Rachel”存储在索引 10 处。如果您想检索该元素,您只需调用GetmyHashFunction("Rachel")
它,它将返回 10。请注意,对于此示例键是“Rachel”,值是“Rachel”,但您可以为该键使用另一个值,例如生日或对象。您的哈希函数可能会为两个不同的输入返回相同的内存位置,在这种情况下,如果您正在实现自己的哈希表,您将遇到冲突,您必须使用链表或其他技术来处理这个问题。下面是一些常用的哈希函数。一个好的散列函数满足:每个键同样可能散列到 n 个内存槽中的任何一个,而与任何其他键散列到的位置无关。其中一种方法称为除法。我们通过将 k 的余数除以 n 将密钥 k 映射到 n 个槽之一。
h(k) = k mod n
. 例如,如果您的数组大小是n = 100
并且您的键是一个整数,k = 15
那么h(k) = 10
.Hashtable 是同步的,而 Hashmap 不是。Hashmap 允许空值作为键,但 Hashtable 不允许。
哈希表的目的是在添加和获取元素时具有 O(c) 恒定的时间复杂度。在大小为 N 的链表中,如果您想获取最后一个元素,则必须遍历所有列表直到得到它,因此复杂度为 O(N)。使用哈希表,如果你想检索一个元素,你只需传递键,哈希函数就会返回你想要的元素。如果散列函数实现得很好,它将在恒定时间 O(c) 这意味着您不必遍历存储在散列表中的所有元素。您将“立即”获得元素。
当然,程序员/开发人员计算机科学家需要了解数据结构和复杂性 =)
回答by SLaks
- Hashing means generating a (hopefully) unique number that represents a value.
- Different types of values (
Integer
,String
, etc) use different algorithms to compute a hashcode. - HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSetis a set of unique values. - Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string. - These are things that people ought to know.
- 散列意味着生成一个(希望)代表一个值的唯一数字。
- 不同类型的值(
Integer
、String
等)使用不同的算法来计算哈希码。 - HashMap 和 HashTable 是映射;它们是 unqiue 键的集合,每个键都与一个值相关联。
Java 没有 HashList 类。哈希集是一组唯一值。 - 从哈希表中获取项目的时间与表的大小有关。
就被散列的值而言,计算散列不一定是恒定时间。
例如,计算字符串的哈希值涉及对字符串进行迭代,并且就字符串的大小而言不是常数时间。 - 这些都是人们应该知道的。
回答by Bozho
Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for
java.lang.Object
by getting some memory address by the JVM.The JVM address thing is unimportant detail. Each class can override the
hashCode()
method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.
散列是将给定的实体(在 Java 术语中 - 一个对象)转换为某个数字(或序列)。散列函数是不可逆的——即你不能从散列中获得原始对象。它在内部实现(
java.lang.Object
通过 JVM 获取一些内存地址。JVM 地址是不重要的细节。每个类都可以
hashCode()
使用自己的算法覆盖该方法。现代 Java IDE 允许生成良好的 hashCode 方法。Hashtable 和 hashmap 是一回事。它们是键值对,其中键被散列。哈希列表和哈希集不存储值——只存储键。
恒定时间意味着无论哈希表(或任何其他集合)中有多少条目,通过键找到给定对象所需的操作数都是恒定的。即 - 1,或接近于 1
这是基本的计算机科学材料,应该每个人都熟悉。我认为谷歌已经指定哈希表是计算机科学中最重要的数据结构。
回答by Enrique
What do we mean by Hashing, how does it work internally ?
我们所说的哈希是什么意思,它在内部是如何工作的?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
散列是对表示原始字符串的较短的固定长度值或键的字符串的转换。它不是索引。散列的核心是散列表。它包含项目数组。哈希表包含来自数据项键的索引,并使用该索引将数据放入数组中。
What algorithm does it follow ?
它遵循什么算法?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
简而言之,大多数哈希算法都适用于逻辑“index = f(key, arrayLength)”
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge ?
最后,为什么在大多数面试问题中都会问Hash和LinkedList,从测试受访者的知识来看,它有什么特定的逻辑吗?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.
它是关于你在逻辑推理方面有多好。这是每个程序员都知道的最重要的数据结构。
回答by ruslik
I'll try to give simple explanations of hashing and of its purpose.
我将尝试对散列及其目的进行简单的解释。
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
首先,考虑一个简单的列表。此类列表上的每个操作(插入、查找、删除)的复杂度为 O(n),这意味着您必须解析整个列表(或平均一半)才能执行此类操作。
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
散列是一种非常简单有效的加速方法:考虑我们将整个列表拆分为一组小列表。如此小的列表中的项目会有一些共同点,而这可以从密钥中推导出来。例如,通过有一个名字列表,我们可以使用第一个字母作为选择在哪个小列表中查找的质量。这样,通过按键的第一个字母对数据进行分区,我们得到了一个简单的散列,它可以将整个列表分成约 30 个较小的列表,这样每个操作都需要 O(n)/30 次.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y
or Z
will be much smaller that the set with A
. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
但是,我们可以注意到结果并不是那么完美。首先,只有 30 个,我们无法更改。其次,有些字母比其他字母更频繁地使用,因此带有Y
或的集合Z
将比带有 的集合小得多A
。为了获得更好的结果,最好找到一种方法将项目划分为大致相同大小的集合。我们怎么能解决这个问题?这是您使用哈希函数的地方。它是这样一个函数,它能够创建任意数量的分区,每个分区的项目数量大致相同。在我们的名称示例中,我们可以使用类似
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).
这将确保相当均匀的分布和可配置数量的集合(也称为桶)。