Java 我应该如何在 hashCode() 中将 long 映射到 int?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4045063/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 10:54:40  来源:igfitidea点击:

How should I map long to int in hashCode()?

javaalgorithmhash

提问by Hanno Fietz

I have a range of objects that have a longfield whose value uniquely identifies a particular object across my entire system, much like a GUID. I have overriden Object.equals()to use this id for comparison, beause I want it to work with copies of the object. Now I want to override Object.hashCode(), too, which basically means mapping my longto some intreturn value.

我有一系列对象,这些对象具有一个long字段,该字段的值唯一标识整个系统中的特定对象,很像 GUID。我已覆盖Object.equals()使用此 ID 进行比较,因为我希望它与对象的副本一起使用。现在我也想覆盖Object.hashCode(),这基本上意味着将 my 映射long到某个int返回值。

If I understood the purpose of hashCodecorrectly, it is mainly used in hash tables, so a uniform distribution would be desirable. This would mean, simply returning id % 2^32would suffice. Is that all, or should I be aware of something else?

如果我理解hashCode正确的目的 ,它主要用于哈希表,因此需要均匀分布。这意味着,简单地返回id % 2^32就足够了。这就是全部,还是我应该知道其他事情?

采纳答案by TofuBeer

Since Java 8 you can use

从 Java 8 开始,您可以使用

Long.hashCode(guid);

For older versions of Java you can use the following:

对于旧版本的 Java,您可以使用以下内容:

Long.valueOf(guid).hashCode();

Note that this solution creates a new Object for the stack, while the first doesn't (although it is likely that Java optimizes the object creation away..)

请注意,此解决方案为堆栈创建了一个新对象,而第一个则没有(尽管 Java 很可能优化了对象创建......)

Looking at the docs, both ways just use the following algorithm:

查看文档,两种方式都只使用以下算法:

(int)(this.longValue()^(this.longValue()>>>32))

These are decent solutions since they make use of the Java library - always better to leverage off of something that has been tested already.

这些都是不错的解决方案,因为它们使用了 Java 库——总是更好地利用已经测试过的东西。

回答by Grodriguez

You have understood the purpose of hashCodecorrectly. Yes, an uniform distribution is desirable (although not an actual requirement).

你已经hashCode正确理解了目的。是的,均匀分布是可取的(尽管不是实际要求)。

I would suggest ((id >> 32) ^ id).

我会建议((id >> 32) ^ id)

The above expression:

上面的表达式:

  • Uses all bits of the original value, does not discard any information upfront. For example, depending on how you are generating the IDs, the upper bits could change more frequently (or the opposite).
  • Does not introduce any bias towards values with more ones (zeros), as it would be the case if the two halves were combined with an OR (AND) operation.
  • 使用原始值的所有位,不会预先丢弃任何信息。例如,根据您生成 ID 的方式,高位可能会更频繁地更改(或相反)。
  • 不会对具有更多 1(零)的值引入任何偏差,因为如果将两半与 OR (AND) 运算组合就会出现这种情况。

回答by codymanix

int result = (int)((longVal >> 32) ^ longVal);

will be more well distributed, because modulo will not return different value if only upper bits of your long value has changed.

分布会更均匀,因为如果只有 long 值的高位发生变化,模不会返回不同的值。

回答by ColinD

It's a bit of a minor thing if you're not using Guavaalready, but Guava can do this for younicely:

如果您还没有使用Guava是一件小事,但是 Guava 可以很好地为您做到这一点

public int hashCode() {
  return Longs.hashCode(id);
}

That gives you the equivalent of Long.valueOf(id).hashCode():

这给你相当于Long.valueOf(id).hashCode()

return (int) (value ^ (value >>> 32));

Additionally, if you were to have other values or objects that were part of the hashcode, you could just write

此外,如果你有其他值或对象是哈希码的一部分,你可以只写

return Objects.hashCode(longValue, somethingElse, ...);

The longwould be autoboxed into a Longso you'd get the correct hashcode for it as part of the overall hashcode.

long会autoboxed成Long这样你会得到正确的哈希码作为整体哈希码的一部分。

回答by Mark Peters

(l >> 32) ^ lis a good hashcode in most cases; particularly when the long has a uniform distribution.

(l >> 32) ^ l在大多数情况下是一个很好的哈希码;特别是当 long 具有均匀分布时。

Since it was the accepted answer, I'm posting this to clarify some of my comments about when it's NOT a good hashcode for a long.

由于这是公认的答案,因此我发布此信息是为了澄清我对何时它不是一个好的哈希码的一些评论。

The example I gave was a Point class like this:

我给出的例子是一个像这样的 Point 类:

public class Point {
    private final long coords; //x in high-bits, y in low
    public int getX() {
        return (int)(coords >> 32);
    }
    public int getY() {
        return (int)coords;
    }
    public int hashCode() {
        return (int)((coords >> 32) ^ (coords));
    }
}

It may seem contrived, but occasionally you have multiple "fields" packed into a long.

它可能看起来很人为,但有时您会将多个“字段”打包成一个 long。

So the coordsfield represents 32 bits of x and 32 bits of y. So why is this a problem? Well, it's not if each of x and y are evenly distributed over their respective 32 bits. But that's unlikely in practice. What is more likely is that X and Y are bounded by some number. Let's say 1024 since it's 2^10. This means that at most the lower 10 bits of each X and Y are set:

因此该coords字段表示 x 的 32 位和 y 的 32 位。那么为什么这是一个问题呢?好吧,如果 x 和 y 中的每一个都均匀分布在它们各自的 32 位上,那就不是了。但这在实践中不太可能。更有可能的是 X 和 Y 受某个数字的限制。假设是 1024,因为它是 2^10。这意味着最多设置每个 X 和 Y 的低 10 位:

00000000 00000000 000000XX XXXXXXXX 00000000 00000000 000000YY YYYYYYYY

There are 2^20 (1024*1024) possible combinations. But what's the operation hashCode is doing?

有 2^20 (1024*1024) 种可能的组合。但是 hashCode 的操作是什么?

  00000000 00000000 000000XX XXXXXXXX 
^ 00000000 00000000 000000YY YYYYYYYY
-------------------------------------
= 00000000 00000000 000000?? ????????

There are at most 2^10 (1024) possible hashCode values since only the lower 10 bits can ever be anything other than zero. The ratio of hash values to real values is 1024:(1024*1024)or 1:1024. So right off the bat there is a 1/1024 probability that two numbers have the same hash.

最多有 2^10 (1024) 个可能的 hashCode 值,因为只有低 10 位可以是零以外的任何值。哈希值与实际值的比率是1024:(1024*1024)1:1024。因此,立即有 1/1024 的概率两个数字具有相同的哈希值。

Now let's calculate the probability of a collision by applying math from the birthday problem. Let p(n) be the probability that with n values there will be at least one collision. We know that p(1025+) = 1 since there are only 1024 values.

现在让我们通过应用生日问题中的数学来计算碰撞的概率。设 p(n) 是 n 个值时至少发生一次碰撞的概率。我们知道 p(1025+) = 1 因为只有 1024 个值。

p(n) = 1 - (n! * (1024 choose n))/1024^n

This works out to the following:

这适用于以下内容:

n: p(n)
1: 0.00000
2: 0.00098
3: 0.00293
4: 0.00585
5: 0.00973
6: 0.01457
...
38: 0.50096
...
79: 0.95444
...
148: 0.99999

With just 38 items, there is probably a collision. With 148 items, there is a 99.999% chance of (at least one) collision. With 148 items, each item has a 7% chance of colliding with another item. With a proper hashing function, taking knowledge of the domain, these numbers could easily go down to 0.

只有 38 个项目,可能会发生冲突。有 148 个项目,有 99.999% 的几率(至少一个)碰撞。有 148 件物品,每件物品有 7% 的几率与另一件物品发生碰撞。通过适当的散列函数,了解域,这些数字很容易下降到 0。

In other words, knowing your domain and how things happen in practice are key to making a performant hash. Library functions try to do as good a job as possible knowing nothing about your domain, and to be performant typically rely on a distribution of data that won't occur in practice.

换句话说,了解您的域以及实际情况是如何创建高性能哈希的关键。库函数试图尽可能地做好工作,对您的领域一无所知,并且通常依赖于在实践中不会发生的数据分布。

回答by Nathan

Java 8 adds Long.hashCode(long)to the JDK.

Java 8 将Long.hashCode(long)添加到 JDK。

The following code could yield higher performance. This code reduces the calculation to 32-bit intinstead of computing with 64-bit long. This can make a difference on 32-bit and smaller architectures. 32-bit processes on x86 machines could optimize this into a single instruction which simply XORs 2 registers.

以下代码可以产生更高的性能。此代码将计算减少到 32 位,int而不是使用 64 位计算long。这会对 32 位和更小的架构产生影响。x86 机器上的 32 位进程可以将其优化为一条指令,该指令只需对 2 个寄存器进行异或运算。

return (int)(value ^ (value >>> 32));

return (int)(value ^ (value >>> 32));

As noted in other answers, this does nothave a good avalanche effectand hence could lead to collisions. One could go with cryptographic hash functions to ensure high avalanche effect. However, there are other algorithms such as Murmur Hash(more information) which have very good avalanche effect but don't consume as much CPU time.

在其他的答案指出,这并不能有很好的雪崩效应,从而可能导致冲突。人们可以使用加密哈希函数来确保高雪崩效应。但是,还有其他算法,例如Murmur Hash(更多信息),它们具有非常好的雪崩效果,但不会消耗太多 CPU 时间。