什么是 Java 中用于文本字符串的良好 64 位哈希函数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1660501/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is a good 64bit hash function in Java for textual strings?
提问by ripper234
I'm looking for a hash function that:
我正在寻找一个哈希函数:
- Hashes textual stringswell (e.g. few collisions)
- Is written in Java, and widely used
- Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
- Bonus: Has a 128-bit variant.
- Bonus: Not CPU intensive.
- 很好地散列文本字符串(例如很少的冲突)
- 用Java编写,并被广泛使用
- 奖励:适用于多个字段(而不是我将它们连接起来并在连接的字符串上应用哈希)
- 奖励:具有 128 位变体。
- 奖励:不是 CPU 密集型。
采纳答案by sfussenegger
Why don't you use a long
variant of the default String.hashCode()
(where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?
为什么不使用long
默认的变体String.hashCode()
(一些非常聪明的人肯定会努力使其高效 - 更不用说已经看过此代码的数千名开发人员的眼睛)?
// adapted from String.hashCode()
public static long hash(String string) {
long h = 1125899906842597L; // prime
int len = string.length();
for (int i = 0; i < len; i++) {
h = 31*h + string.charAt(i);
}
return h;
}
If you're looking for even more bits, you could probably use a Edit: BigInteger
如果您正在寻找更多位,您可能可以使用编辑:BigInteger
As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:
正如我在对@brianegge 的回答的评论中提到的,超过 32 位的散列没有太多用例,而且很可能没有一个用于超过 64 位的散列:
I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.
我可以想象一个巨大的哈希表分布在数十台服务器上,可能存储数百亿个映射。对于这种情况,@brianegge 在这里仍然有一个有效的观点:32 位允许 2^32(约 43 亿)个不同的哈希键。假设一个强大的算法,你应该仍然有相当多的碰撞。使用 64 位(18,446,744,073 亿个不同的密钥)当然可以节省您的费用,无论您需要什么疯狂的场景。考虑 128 位密钥(340,282,366,920,938,463,463,374,607,4310 亿可能的密钥)的用例几乎是不可能的。
To combine the hash for several fields, simply do an XORmultiply one with a prime and add them:
要结合哈希几场,只是做一个异或乘一个与原并将它们添加:
long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);
The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.
小素数在那里是为了避免切换值的哈希码相等,即 {'foo','bar'} 和 {'bar','foo'} 不相等,应该有不同的哈希码。XOR 很糟糕,因为如果两个值相等,它会返回 0。因此,{'foo','foo'} 和 {'bar','bar'} 将具有相同的哈希码。
回答by Adamski
DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.
免责声明:如果您希望有效地散列单个自然语言单词,则此解决方案适用。散列较长的文本或包含非字母字符的文本效率低下。
I'm not aware of a function but here's an idea that might help:
我不知道一个函数,但这里有一个可能有帮助的想法:
- Dedicate 52 of the 64 bits to representing which letters are present in the String. For example, if 'a' were present you'd set bit[0], for 'b' set bit 1, for 'A' set bit[26]. That way, only text containing exactly the same set of letters would have the same "signature".
- 将 64 位中的 52 位用于表示字符串中存在哪些字母。例如,如果存在 'a',则设置位 [0],为 'b' 设置位1,为 'A' 设置位 [26]。这样,只有包含完全相同字母集的文本才会具有相同的“签名”。
You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.
然后,您可以使用剩余的 12 位对字符串长度(或其模值)进行编码以进一步减少冲突,或使用传统的散列函数生成 12 位 hashCode。
Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions- It is based off the Anagram Detector described in Programming Pearls (see here).
假设您的输入是纯文本,我可以想象这会导致很少的冲突并且计算成本(O(n))。 与迄今为止的其他解决方案不同,这种方法考虑了问题域以减少冲突- 它基于 Programming Pearls 中描述的 Anagram Detector(请参阅此处)。
回答by Aaron Digulla
Create an SHA-1 hashand then mask out the lowest 64bits.
创建一个 SHA-1 哈希,然后屏蔽掉最低的 64 位。
回答by St.Shadow
Do you look at Apache commons lang?
But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long instead of int). For 128 bit you need additional hacks...
但是对于 64 位(和 128 位),您需要一些技巧:Joshua Bloch 在 Effective Java 一书中列出的规则可帮助您轻松创建 64 位哈希(只需使用 long 而不是 int)。对于 128 位,您需要额外的技巧...
回答by brianegge
long hash = string.hashCode();
Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.
是的,前 32 位将为 0,但在遇到散列冲突问题之前,您可能会耗尽硬件资源。String 中的 hashCode 非常有效且经过良好测试。
UpdateI think the above satisfies the simplest thing which could possibly work, however, I agree with @sfussenegger idea of extending the existing String hashCode.
更新我认为以上满足了可能工作的最简单的事情,但是,我同意@sfussenegger 扩展现有 String hashCode 的想法。
In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.
除了为您的 String 设置一个好的 hashCode 之外,您可能还需要考虑在您的实现中重新散列哈希码。如果您的存储被其他开发人员使用,或与其他类型一起使用,这有助于分发您的密钥。比如Java的HashMap是基于长度为2的幂的哈希表,所以增加了这个函数来保证低位的充分分布。
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
回答by Peter Tillemans
Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.
为什么不使用 CRC64 多项式。这些是相当有效和优化的,以确保所有位都被计数并分布在结果空间中。
There are plenty of implementations available on the net if you google "CRC64 Java"
如果你谷歌“CRC64 Java”,网上有很多可用的实现
回答by jasonmp85
Do something like this:
做这样的事情:
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class Test {
public static void main(String[] args) throws NoSuchAlgorithmException,
IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
try {
MessageDigest md = MessageDigest.getInstance("MD5");
SomeObject testObject = new SomeObject();
dos.writeInt(testObject.count);
dos.writeLong(testObject.product);
dos.writeDouble(testObject.stdDev);
dos.writeUTF(testObject.name);
dos.writeChar(testObject.delimiter);
dos.flush();
byte[] hashBytes = md.digest(baos.toByteArray());
BigInteger testObjectHash = new BigInteger(hashBytes);
System.out.println("Hash " + testObjectHash);
} finally {
dos.close();
}
}
private static class SomeObject {
private int count = 200;
private long product = 1235134123l;
private double stdDev = 12343521.456d;
private String name = "Test Name";
private char delimiter = '\n';
}
}
DataOutputStreamlets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStreamin it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.
DataOutputStream允许您编写原语和字符串并将它们作为字节输出。在其中包装一个ByteArrayOutputStream将允许您写入一个字节数组,该数组与MessageDigest很好地集成。您可以从此处列出的任何算法中进行选择。
Finally BigIntegerwill let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.
最后BigInteger会让你把输出字节变成一个更容易使用的数字。MD5 和 SHA1 算法都产生 128 位哈希值,因此如果您需要 64 位哈希值,则只需截断即可。
SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the DataOutputStream
and you're good to go. You could even do it with reflection and annotations (maybe @HashComponent(order=1)
to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.
SHA1 几乎可以很好地散列任何东西,并且很少发生冲突(它是 128 位)。这适用于 Java,但我不确定它是如何实现的。它实际上可能相当快。它适用于我的实现中的几个领域:只需将它们全部推到 上DataOutputStream
就可以了。你甚至可以用反射和注释来做(也许是@HashComponent(order=1)
为了显示哪些字段进入散列以及以什么顺序)。它有一个 128 位的变体,我想你会发现它使用的 CPU 没有你想象的那么多。
I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call MessageDigest.getInstance()
once and then clone()
from then on: IIRC the cloning is a lot faster.
我已经使用这样的代码来获取庞大数据集(现在可能有数十亿个对象)的哈希值,以便能够将它们分片到许多后端存储中。它应该适用于您需要的任何用途。请注意,我认为您可能只想调用MessageDigest.getInstance()
一次,然后clone()
从那时起:IIRC 克隆要快得多。
回答by user2020240
Reverse the string to get another 32-bit hashcode and then combine the two:
反转字符串以获得另一个 32 位哈希码,然后将两者结合起来:
String s = "astring";
long upper = ( (long) s.hashCode() ) << 32;
long lower = ( (long) s.reverse().hashCode() ) - ( (long) Integer.MIN_VALUE );
long hash64 = upper + lower;
This is pseudocode; the String.reverse()
method doesn't exist and will need to be implemented some other way.
这是伪代码;该String.reverse()
方法不存在,需要以其他方式实现。
回答by Scott Carey
An answer for today (2018). SipHash.
今天的答案(2018)。哈希。
It will be much faster than most of the answers here, and significantly higher quality than all of them.
它将比这里的大多数答案快得多,并且质量明显高于所有答案。
The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--
Guava 库有一个:https: //google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--