java 一个很好的散列函数用于整数、字符串的面试?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6082915/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 14:15:19  来源:igfitidea点击:

A good hash function to use in interviews for integer numbers, strings?

javahash-function

提问by phoenix


I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"


我在面试中遇到过需要对整数或字符串使用哈希函数的情况。在这种情况下我们应该选择哪些?在这些情况下我错了,因为我最终选择了那些会产生大量冲突的函数,但是哈希函数往往是数学的,你在面试中无法回忆它们。是否有任何一般性建议,至少面试官对您的整数或字符串输入方法感到满意?在“面试情况”中,哪些功能适用于两种输入

回答by Premraj

Here is a simple recipe from Effective java page 33:

这是Effective java 第 33 页的一个简单配方:

  1. Store some constant nonzero value, say, 17, in an int variable called result.
  2. For each significant field f in your object (each field taken into account by the equals method, that is), do the following:
    1. Compute an int hash code c for the field:
      • If the field is a boolean, compute (f ? 1 : 0).
      • If the field is a byte, char, short, or int, compute (int) f.
      • If the field is a long, compute (int) (f ^ (f >>> 32)).
      • If the field is a float, compute Float.floatToIntBits(f).
      • If the field is a double, compute Double.doubleToLongBits(f), and then hash the resulting long as in step 2.1.iii.
      • If the field is an object reference and this class's equals method compares the field by recursively invoking equals, recursively invoke hashCode on the field. If a more complex comparison is required, compute a “canonical representation” for this field and invoke hashCode on the canonical representation. If the value of the field is null, return 0 (or some other constant, but 0 is traditional). 48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
      • If the field is an array, treat it as if each element were a separate field. That is, compute a hash code for each significant element by applying these rules recursively, and combine these values per step 2.b. If every element in an array field is significant, you can use one of the Arrays.hashCode methods added in release 1.5.
    2. Combine the hash code c computed in step 2.1 into result as follows: result = 31 * result + c;
  3. Return result.
  4. When you are finished writing the hashCode method, ask yourself whether equal instances have equal hash codes. Write unit tests to verify your intuition! If equal instances have unequal hash codes, figure out why and fix the problem.
  1. 将一些恒定的非零值(例如 17)存储在名为 result 的 int 变量中。
  2. 对于对象中的每个重要字段 f(即 equals 方法考虑的每个字段),执行以下操作:
    1. 计算该字段的 int 哈希码 c:
      • 如果该字段是布尔值,则计算 (f ? 1 : 0)。
      • 如果字段是字节、字符、短整型或整数,则计算 (int) f。
      • 如果该字段是 long,则计算 (int) (f ^ (f >>> 32))。
      • 如果该字段是浮点数,则计算 Float.floatToIntBits(f)。
      • 如果该字段是双精度型,则计算 Double.doubleToLongBits(f),然后按照步骤 2.1.iii 中的方法对结果长整型进行散列。
      • 如果该字段是一个对象引用并且该类的equals方法通过递归调用equals来比较该字段,则对该字段递归调用hashCode。如果需要更复杂的比较,请计算此字段的“规范表示”并在规范表示上调用 hashCode。如果该字段的值为空,则返回 0(或其他一些常量,但 0 是传统的)。48 第 3 章所有对象通用的方法
      • 如果该字段是一个数组,则将其视为每个元素都是一个单独的字段。也就是说,通过递归地应用这些规则来计算每个重要元素的哈希码,并在每个步骤 2.b 中组合这些值。如果数组字段中的每个元素都很重要,您可以使用 1.5 版中添加的 Arrays.hashCode 方法之一。
    2. 将步骤 2.1 中计算出的哈希码 c 合并为 result 如下: result = 31 * result + c;
  3. 返回结果。
  4. 完成 hashCode 方法的编写后,问问自己相等的实例是否具有相等的哈希码。编写单元测试来验证您的直觉!如果相等的实例具有不相等的哈希码,找出原因并解决问题。

回答by mikera

You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.

你应该问面试官散列函数是做什么用的——这个问题的答案将决定什么样的散列函数是合适的。

  • If it's for use in hashed data structureslike hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer

  • If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.

  • If you want fast pseudo-random-ishhash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:

  • 如果它用于哈希映射等哈希数据结构,您希望它尽可能简单(执行速度快)并避免冲突(最常见的值映射到不同的哈希值)。一个很好的例子是一个整数散列到同一个整数 - 这是 java.lang.Integer 中的标准 hashCode() 实现

  • 如果出于安全目的,您将需要使用加密散列函数。这些主要是为了很难逆转散列函数或发现冲突。

  • 如果您想要快速的伪随机哈希值(例如用于模拟),那么您通常可以修改伪随机数生成器来创建这些值。我个人最喜欢的是:

public static final int hash(int a) {         
      a ^= (a << 13);
      a ^= (a >>> 17);        
      a ^= (a << 5);
      return a;   
}
public static final int hash(int a) {         
      a ^= (a << 13);
      a ^= (a >>> 17);        
      a ^= (a << 5);
      return a;   
}

If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:

如果您正在为某种形式的复合结构(例如具有多个字符的字符串、或数组或具有多个字段的对象)计算散列,那么您可以使用多种技术来创建组合散列函数。我建议对组成部分的旋转哈希值进行异或运算,例如:

public static <T> int hashCode(T[] data) {
    int result=0;
    for(int i=0; i<data.length; i++) {
        result^=data[i].hashCode();
        result=Integer.rotateRight(result, 1);
    }
    return result;
}

Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)

请注意,上述内容在加密方面并不安全,但可以用于大多数其他目的。您显然会遇到冲突,但在将大型结构散列为整数时这是不可避免的:-)

回答by Tony Delroy

For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago

对于整数,我通常使用 k % p ,其中 p = 哈希表的大小并且是素数,对于字符串,我从 String 类中选择哈希码。这足以面试一家大型科技公司吗?– 凤凰 2 天前

Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..

也许不会。需要为您不知道其实现的哈希表提供哈希函数的情况并不少见。此外,如果您以依赖于使用质数桶的实​​现的方式进行散列,那么如果实现由于新库、编译器、操作系统端口等而发生变化,您的性能可能会降低。

Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.

就我个人而言,我认为面试中重要的事情是清楚地了解通用哈希算法的理想特性,这基本上是对于任何两个输入键,其值仅变化一位,每一位在输出有大约 50/50 的机会翻转。我发现这很违反直觉,因为我第一次看到的很多散列函数都使用位移和异或,而翻转的输入位通常翻转一个输出位(通常在另一个位位置,所以 1-input-bit-affects-many -output-bits 当我在 Knuth 的一本书中读到它时,是一个小小的启示时刻。有了这些知识,你至少能够测试和评估特定的实现,而不管它们是如何实现的。

One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b]is a great sizeof inthash value - indeed "perfect" if inputs are randomly scattered through the intvalues (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..

我要提到的一种方法是因为它实现了这个理想并且很容易记住,尽管内存使用可能使它比数学方法慢(也可能更快,取决于硬件),是简单地使用输入中的每个字节来查找随机整数表。例如,给定一个 24 位的 RGB 值int table[3][256]table[0][r] ^ table[1][g] ^ table[2][b]是一个很好的sizeof int哈希值 - 如果输入随机分散在int值中(而不是说递增 - 见下文),则确实是“完美的” 。这种方法对于长键或任意长度的键并不理想,但您可以开始重新访问表并对值进行位移等。

All that said, you can sometimesdo better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without anycollisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

说了这么多,你能有时做得比对,你是知道的输入键和/或参与桶(例如数量模式的特定情况下,这种随机化的方法,你可以知道输入的键连续从1到100 并且有 128 个存储桶,因此您可以将密钥通过而无需任何碰撞)。但是,如果输入不再满足您的期望,您可能会遇到可怕的碰撞问题,而“随机化”方法永远不会比负载(大小()/桶)所暗示的更糟糕。另一个有趣的见解是,当您想要一个快速而平庸的散列时,您不必在生成散列时合并所有输入数据:例如,上次我查看 Visual C++ 的字符串散列代码时,它选择了十个均匀间隔的字母沿着文本用作输入....