Java 哈希表的基本原理?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/282712/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 12:27:50  来源:igfitidea点击:

The fundamentals of Hash tables?

javahashtable

提问by kylex

I'm quite confused about the basic concepts of a Hash table. If I were to code a hash how would I even begin? What is the difference between a Hash table and just a normal array?

我对哈希表的基本概念很困惑。如果我要编码一个散列,我将如何开始?哈希表和普通数组有什么区别?

Basically if someone answered this question I think all my questions would be answered: If I had 100 randomly generated numbers (as keys), how would I implement a hash table and why would that be advantageous over an array?

基本上,如果有人回答了这个问题,我想我所有的问题都会得到解答:如果我有 100 个随机生成的数字(作为键),我将如何实现哈希表,为什么这比数组更有利?

Psuedo-code or Java would be appreciated as a learning tool...

伪代码或 Java 将作为学习工具受到赞赏......

采纳答案by Adam Liss

The answers so far have helped to define hash tables and explain some theory, but I think an example may help you get a better feeling for them.

到目前为止的答案有助于定义哈希表并解释一些理论,但我认为一个例子可能会帮助您更好地了解它们。

What is the difference between a hash table and just a normal array?

哈希表和普通数组有什么区别?

A hash table and an array are both structures that allow you to store and retrieve data. Both allow you to specify an indexand retrieve a value associated with it. The difference, as Daniel Spiewak noted, is that the indices of an array are sequential, while those of a hash table are based on the value of the dataassociated with them.

哈希表和数组都是允许您存储和检索数据的结构。两者都允许您指定索引并检索与其关联的值。正如 Daniel Spiewak 所指出的,区别在于数组的索引是连续的,而哈希表的索引基于与其关联的数据

Why would I use a hash table?

为什么要使用哈希表?

A hash table can provide a very efficient way to search for items in large amounts of data, particularly data that is not otherwise easily searchable. ("Large" here means ginormous, in the sense that it would take a long time to perform a sequential search).

哈希表可以提供一种非常有效的方式来搜索大量数据中的项目,尤其是那些不容易搜索的数据。(这里的“大”意味着极大,从某种意义上说,执行顺序搜索需要很长时间)。

If I were to code a hash how would I even begin?

如果我要编码一个散列,我将如何开始?

No problem. The simplest way is to invent an arbitrary mathematical operation that you can perform on the data, that returns a number N(usually an integer). Then use that number as the index into an array of "buckets" and store your data in bucket #N. The trick is in selecting an operation that tends to place values in different buckets in a way that makes it easy for your to find them later.

没问题。最简单的方法是发明一种可以对数据执行的任意数学运算,它返回一个数字N(通常是一个整数)。然后使用该数字作为“存储桶”数组的索引,并将您的数据存储在存储桶 # 中N。诀窍在于选择一个倾向于将值放在不同存储桶中的操作,以便您以后轻松找到它们。

Example:A large mall keeps a database of its patrons' cars and parking locations, to help shoppers remember where they parked. The database stores make, color, license plate, and parking location. On leaving the store a shopper finds his car by entering the its make and color. The database returns a (relatively short) list of license plates and parking spaces. A quick scan locates the shopper's car.

示例:一家大型购物中心保留了顾客汽车和停车位置的数据库,以帮助购物者记住他们的停车位置。数据库存储makecolorlicense plate,和parking location。在离开商店时,购物者通过输入其品牌和颜色来找到他的汽车。数据库返回一个(相对较短的)车牌和停车位列表。快速扫描即可找到购物者的汽车。

You could implement this with an SQL query:

您可以使用 SQL 查询来实现这一点:

SELECT license, location FROM cars WHERE make="$(make)" AND color="$(color)"

If the data were stored in an array, which is essentially just a list, you can imagine implementing the query by scanning an array for all matching entries.

如果数据存储在一个数组中,它本质上只是一个列表,您可以想象通过扫描数组中所有匹配条目来实现查询。

On the other hand, imagine a hash rule:

另一方面,想象一个哈希规则:

Add the ASCII character codes of all the letters in the make and color, divide by 100, and use the remainder as the hash value.

将make和color中所有字母的ASCII字符码相加,除以100,余数作为hash值。

This rule will convert each item to a number between 0 and 99, essentially sortingthe data into 100 buckets. Each time a customer needs to locate a car, you can hash the make and color to find the onebucket out of 100 that contains the information. You've immediately reduced the search by a factor of 100!

此规则将每个项目转换为数字0到99之间,基本上排序的数据为100桶。每次客户需要定位汽车时,您都可以对品牌和颜色进行哈希处理,以从 100桶中找到包含该信息的那个桶。您立即将搜索减少了 100 倍!

Now scale the example to huge amounts of data, say a database with millions of entries that is searched based on tens of criteria. A "good" hash function will distribute the data into buckets in a way that minimizes any additional searching, saving a significant amount of time.

现在将示例扩展到大量数据,比如一个包含数百万个条目的数据库,该数据库基于数十个条件进行搜索。“好的”散列函数将以最小化任何额外搜索的方式将数据分布到桶中,从而节省大量时间。

回答by Daniel Spiewak

Hashtables are associative. This is a huge difference from arrays, which are just linear data structures. With an array, you might do something like this:

哈希表是关联的。这与数组有很大的不同,数组只是线性数据结构。使用数组,您可能会执行以下操作:

int[] arr = ...
for (int i = 0; i < arr.length; i++) {
    System.out.println(arr[i] + 1);
}

Notice how you are getting an element out of the array by specifying an exact memory offset (i). This contrasts with hashtables, which allow you to store key/value pairs, later retrieving the value based on the key:

请注意如何通过指定精确的内存偏移量 ( i)从数组中获取元素。这与哈希表形成对比,哈希表允许您存储键/值对,稍后根据键检索值:

Hashtable<String, Integer> table = new Hashtable<String, Integer>();
table.put("Daniel", 20);
table.put("Chris", 18);
table.put("Joseph", 16);

With the above table, we can make the following call:

有了上表,我们可以进行如下调用:

int n = table.get("Chris");

...and be assured that nwill be valued at 18.

...并放心,n将被重视18

I think this will probably answer most of your questions. The implementation of a hashtable is a fairly interesting topic, one which Wikipedia addresses passably well.

我想这可能会回答你的大部分问题。哈希表的实现是一个相当有趣的话题,维基百科很好地解决了这个话题。

回答by SoapBox

You wouldn't want to use a hash table for 100 randomly generated numbers.

您不会希望对 100 个随机生成的数字使用哈希表。

A good way to think about hash tables is to think about value pairs. Let's use students, and say everyone has a student ID number. In your program you store information on students (names, phone numbers, bills, etc). You want to find all of the information about a student using only basic information (name or student ID, for example).

考虑哈希表的一个好方法是考虑值对。让我们以学生为例,假设每个人都有一个学生证号。在您的程序中,您存储有关学生的信息(姓名、电话号码、账单等)。您想仅使用基本信息(例如姓名或学生 ID)查找有关学生的所有信息。

Let's say you have 10,000 students. If you store them all in an array, then you have to loop through the entire array comparing each entry's student ID with the one you are looking for.

假设您有 10,000 名学生。如果您将它们全部存储在一个数组中,那么您必须遍历整个数组,将每个条目的学生 ID 与您要查找的学生 ID 进行比较。

If, instead, you "hash" (see below) their student ID number to a position in the array, then you only have to search student's who's numbers have the same hash. Much less work to find what you wanted.

相反,如果您将他们的学生 ID 号“散列”(见下文)到数组中的某个位置,那么您只需搜索学生编号具有相同散列值的学生。找到你想要的东西的工作要少得多。

In this example, let's say student IDs are just 6 digit numbers. Our hash function could be use only the bottom 3 digits of the number as the "hash key". Thus 232145 is hashed to array location 145. So then you only need an array of 999 element (each element being a list of students).

在此示例中,假设学生 ID 只是 6 位数字。我们的哈希函数只能使用数字的后 3 位数字作为“哈希键”。因此 232145 被散列到数组位置 145。那么你只需要一个包含 999 个元素的数组(每个元素都是一个学生列表)。

That should be a good start for you. You should, of course, read a text book or wikipedia for this kind of info. But I assume you've already done that and are tired of reading.

这对你来说应该是一个好的开始。当然,您应该阅读教科书或维基百科以获取此类信息。但我假设你已经这样做了并且厌倦了阅读。

回答by Jason Coco

I'll answer that part about the difference between a hash table and an array... but since I've never implemented a hashing algorithm of any import before, I'll leave that to somebody more knowledgeable :)

我将回答关于哈希表和数组之间区别的那部分...但由于我以前从未实现过任何导入的哈希算法,我将把它留给更有知识的人:)

An array is just an ordered list of objects. The object itself doesn't really matter... what's important is that if you want to list the objects in order of insertion, it is always the same (meaning that the first element alwayshas an index of 0).

数组只是对象的有序列表。对象本身并不重要……重要的是,如果您想按插入顺序列出对象,它总是相同的(意味着第一个元素的索引总是为 0)。

As for a hashtable, that's indexed by keys, not order... I think that a basic search on hashing algorithms will give you a lot more insight than I can... Wikipedia has a very decent one... that determines "bucket" that the keys go into for quick retrieval on arbitrary objects used as keys.

至于哈希表,它是由键索引的,而不是顺序……我认为对哈希算法的基本搜索会给你比我能提供的更多的洞察力……维基百科有一个非常不错的……它决定了“桶" 键用于快速检索用作键的任意对象。

As for advantages: If order of insertion is important, an array or some kind of ordered list is necessary. If fast look-up by arbitrary key (keyed by various hash functions) is important, then a hash table makes sense.

至于优点:如果插入顺序很重要,则需要一个数组或某种有序列表。如果通过任意键(由各种散列函数键控)进行快速查找很重要,那么散列表是有意义的。

回答by ashokgelal

[This is the reply to a comment made by me.yahoo.com/a above]

[这是对上面me.yahoo.com/a评论的回复]

That depends on your hash function. Lets suppose that your hash function hashes a word as per the length of your word, the key for chris will be 5. Similarly, key for yahoo will also be 5. Now, both values (chris and yahoo) will go under 5 (i.e. in a 'bucket' keyed by 5). This way you don't have to make an array equal to the size of your data.

这取决于您的哈希函数。假设您的哈希函数根据单词的长度对单词进行哈希处理,chris 的 key 为 5。同样,yahoo 的 key 也为 5。现在,两个值(chris 和 yahoo)都将低于 5(即在由 5) 键控的“桶”中。这样您就不必使数组等于数据的大小。

回答by S.Lott

"I'm more interested in the way Hash Tables look up the key and how the key is generated."

“我对哈希表查找密钥的方式以及密钥的生成方式更感兴趣。”

  1. Hashing transforms a key object to a number. This is called "hashing" -- it makes a hash out of the object. See Hash Function. Summing the bytes of a string, for example, is a standard hash technique. You compute the sum modulo 232to keep the hash to a manageable size. Hash always gives the same answer. This is O(1).

  2. The number gives you a "slot" in the HashTable. Given an arbitrary key object, the hash value computes a hash value. The hash value then gives you the slot in table. Usually mod( hash, table size ). This is O(1), also.

  1. 散列将键对象转换为数字。这称为“散列”——它从对象中生成散列。请参阅哈希函数。例如,对字符串的字节求和是一种标准的散列技术。您计算总和模 2 32以将散列保持在可管理的大小。哈希总是给出相同的答案。这是O(1)。

  2. 该数字在 HashTable 中为您提供了一个“槽”。给定任意键对象,散列值计算散列值。哈希值然后为您提供表中的插槽。通常mod( hash, table size )。这也是O(1)。

That's the general solution. Two numeric calculations and you've gone from arbitrary object as key to arbitrary object as value. Few things can be as fast.

这是通用的解决方案。两个数字计算,你已经从任意对象作为键到任意对象作为值。很少有事情可以这么快。

The transformation from object to hash value happens in one of these common ways.

从对象到哈希值的转换以这些常见方式之一发生。

  1. If it's a "primitive" object of 4 bytes, then the object's native value is a number.

  2. The object's address is 4 bytes, then the object's address can be used as a hash value.

  3. A simple hash function(MD5, SHA1, whatever) accumulates the bytes of the object to create a 4-byte number. The advanced hashes aren't simple sums of bytes, a simple sum doesn't reflect all the original input bits fairly enough.

  1. 如果它是一个 4 字节的“原始”对象,则该对象的本机值是一个数字。

  2. 对象的地址是4个字节,那么对象的地址就可以作为一个哈希值。

  3. 一个简单的散列函数(MD5、SHA1 等)累积对象的字节以创建一个 4 字节的数字。高级散列不是简单的字节总和,简单的总和不能充分反映所有原始输入位。

The slot in the hash table is mod( number, size of table ).

哈希表中的槽位是 mod( number, size of table )。

If that slot has the desired value, you're done. If that's not the desired value, you need to look somewhere else. There are several popular probing algorithms to look for a free spot in the table. Linear is a simple search for the next free spot. Quadratic is a non-linear hopping around looking for a free slot. A random number generator (with a fixed seed) can be used to generate a series of probes that will spread data evenly but arbitrarily.

如果该插槽具有所需的值,您就完成了。如果这不是所需的值,则需要查看其他地方。有几种流行的探测算法可以在表中寻找空闲位置。线性是对下一个空闲位置的简单搜索。Quadratic 是一种寻找空闲时隙的非线性跳跃。随机数生成器(具有固定种子)可用于生成一系列探针,这些探针将均匀但任意地传播数据。

The probing algorithms are not O(1). If the table's big enough, the odds of collision are low, and probes don't matter. If the table's too small, then collisions happen and probing happens. At that point, it becomes a matter of "tuning and tweaking" to balance probing and table size to optimize performance. Usually we just make the table bigger.

探测算法不是O(1)。如果桌子足够大,碰撞的几率就会很低,探针也无关紧要。如果表太小,则会发生冲突并进行探测。在这一点上,平衡探测和表大小以优化性能就变成了“调整和调整”的问题。通常我们只是把桌子变大。

See Hash Table.

请参阅哈希表

回答by gnud

First, you have to understand a what a hash function is. A hash function is a function that takes a key (for example, an string of arbritrary length) and returns a number as unique as possible. The same key must always return the same hash. A really simple string hashing function in java might look like

首先,您必须了解哈希函数是什么。散列函数是一个函数,它接受一个键(例如,任意长度的字符串)并返回一个尽可能唯一的数字。相同的键必须始终返回相同的哈希值。java中一个非常简单的字符串散列函数可能看起来像

public int stringHash(String s) {
    int h = s.length();
    for(char c : s.toCharArray()) {
        h ^= c;
    }
    return h;
}

You can study a good hash function at http://www.azillionmonkeys.com/qed/hash.html

你可以在http://www.azillionmonkeys.com/qed/hash.html学习一个很好的哈希函数

Now, the hash map uses this hash value to place the value into an array. Simplistic java method:

现在,哈希映射使用此哈希值将该值放入一个数组中。简单的java方法:

public void put(String key, Object val) {
    int hash = stringHash(s) % array.length;
    if(array[hash] == null) {
        array[hash] = new LinkedList<Entry<String, Object> >();
    }
    for(Entry e : array[hash]) {
        if(e.key.equals(key)){
            e.value = val;
            return;
        }
    }
    array[hash].add(new Entry<String, Object>(key, val));
}

(This map enforces unique keys. Not all maps do.)

(此映射强制执行唯一键。并非所有映射都如此。)

It is possible for two different keys to hash to the same value, or two different hashes to map to the same array index. There exists many techniques for dealing with this. The simplest is to use a linked list (or binary tree) for each array index. If the hash function is good enough, you will never need a linear search.

两个不同的键可能散列到相同的值,或者两个不同的散列映射到相同的数组索引。有许多技术可以解决这个问题。最简单的方法是为每个数组索引使用一个链表(或二叉树)。如果哈希函数足够好,您将永远不需要线性搜索。

Now to look up a key:

现在查找一个键:

public Object get(String key) {
    int hash = stringHash(key) % array.length;
    if(array[hash] != null) {
        for(Entry e : array[hash]) {
            if(e.key.equals(key))
                return e.value;
        }
    }

    return null;
}

回答by SquareCog

Here is, in short, how a hash table works.

简而言之,这是哈希表的工作原理。

Imagine you have a library, full of books. If you were to store the books in an array, you would put each book on a spot on a shelf, and then when someone asked you to find a book, you'd look through all the shelves -- pretty slow. If someone said "book #12345", you could find it pretty easily, though.

想象一下,你有一个图书馆,里面满是书。如果你把书放在一个数组里,你会把每本书放在书架上的一个位置,然后当有人让你找一本书时,你会翻遍所有书架——非常慢。不过,如果有人说“book #12345”,你可以很容易地找到它。

Let's say instead you say, if the book title starts with 'A', it goes in row 1. If the second letter is 'B', it goes in row 1, rack 2. If the third letter is 'C', it goes in row 1, rack 2, shelf 3... and so on until you identify the book position. Then, based on the title of the book, you could know exactly where it should be.

假设您说,如果书名以“A”开头,则放在第 1 行。如果第二个字母是“B”,则放在第 1 行,第 2 行。如果第三个字母是“C”,则它进入第 1 行、第 2 架、第 3 架……依此类推,直到您确定书的位置。然后,根据书名,您可以确切地知道它应该在哪里。

Now, there are some problems in the simplistic "hashing" algorithm I described -- some shelves are going to be way overloaded while others stand empty, some books will be assigned to the same slot.. so the real hash functions are carefully constructed to try to avoid such problems.

现在,我所描述的简单的“散列”算法存在一些问题——一些书架会超载而另一些书架是空的,一些书将被分配到同一个插槽......所以真正的散列函数被仔细构造为尽量避免此类问题。

But this is the basic idea.

但这是基本思想。

回答by CodingWithSpike

Something I didn't see specifically noted yet:

我还没有看到特别指出的东西:

The point of using a hash table over an array is performance.

在数组上使用哈希表的重点是性能。

Iterating through an array would typically take anywhere from O(1) to O(x) where x is the number of items in the array. However the time to find your item will be extremely variable, expecially if we are talking about hundreds of thousands of items in the array.

遍历数组通常需要从 O(1) 到 O(x) 的任何地方,其中 x 是数组中的项目数。然而,找到您的项目的时间将是非常可变的,特别是如果我们正在谈论数组中的数十万个项目。

A properly weighted hash table typically has an almost constantaccess time of just over O(1), no matter how many items are in the hash table.

无论哈希表中有多少项,适当加权的哈希表通常几乎恒定的访问时间刚好超过 O(1)。

回答by Omer Akhter

The question, I believe, is answered quite clearly and in many different ways by now.

我相信,这个问题现在已经以许多不同的方式得到了非常清楚的回答。

I would just like to add another perspective (which may confuse a new reader as well)

我只想添加另一个观点(这也可能会使新读者感到困惑)

At a level of least abstraction, arrays are just contiguous block of memory. Given the starting address (startAddress), size (sizeOfElement) and the indexof a single element, the address of element is computed as:

在最低抽象级别,数组只是连续的内存块。给定单个元素的起始地址 ( startAddress)、大小 ( sizeOfElement) 和 ,index元素的地址计算如下:

elementAddress = startAddress + sizeOfElement * index

The interesting thing to note here is that arrays can be abstracted/viewed as hash tables with indexas key and the above function as a hash function which calculates the location of a value in O(1)

这里要注意的有趣的事情是,可以将数组抽象/查看为具有indexas 键的哈希表,上面的函数作为哈希函数计算值在O(1) 中的位置