.net 如何计算字符串列表的良好哈希码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2730865/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 14:13:26  来源:igfitidea点击:

How do I calculate a good hash code for a list of strings?

.netdatabase-designhashcode

提问by Ian Ringrose

Background:

背景:

  • I have a short list of strings.
  • The number of strings is not always the same, but are nearly always of the order of a “handful”
  • In our database will store these strings in a 2nd normalised table
  • These strings are neverchanged once they are written to the database.
  • 我有一个简短的字符串列表。
  • 字符串的数量并不总是相同,但几乎总是“少数”的数量级
  • 在我们的数据库中将这些字符串存储在第二个规范化表中
  • 这些字符串一旦写入数据库就永远不会更改。

We wish to be able to match on these strings quickly in a query without the performance hit of doing lots of joins.

我们希望能够在查询中快速匹配这些字符串,而不会因为执行大量连接而影响性能。

So I am thinking of storing a hash code of all these strings in the main table and including it in our index, so the joins are only processed by the database when the hash code matches.

所以我想在主表中存储所有这些字符串的哈希码并将其包含在我们的索引中,因此只有当哈希码匹配时,数据库才会处理连接。

So how do I get a good hashcode? I could:

那么如何获得一个好的哈希码呢?我可以:

  • Xor the hash codes of all the string together
  • Xor with multiply the result after each string (say by 31)
  • Cat all the string together then get the hashcode
  • Some other way
  • 将所有字符串的哈希码异或在一起
  • Xor 与每个字符串后的结果相乘(例如乘以 31)
  • 将所有字符串放在一起,然后获取哈希码
  • 其他方式

So what do people think?

那么人们是怎么想的呢?



In the end I just concatenate the strings and compute the hashcode for the concatenation, as it is simple and worked well enough.

最后,我只是连接字符串并计算连接的哈希码,因为它很简单并且工作得很好。

(If you care we are using .NET and SqlServer)

(如果您关心我们使用的是 .NET 和 SqlServer)



Bug!, Bug!

臭虫!臭虫!

Quoting from Guidelines and rules for GetHashCodeby Eric Lippert

引用Eric Lippert 的GetHashCode 指南和规则

The documentation for System.String.GetHashCode notes specifically that two identical strings can have different hash codes in different versions of the CLR, and in fact they do. Don't store string hashes in databases and expect them to be the same forever, because they won't be.

System.String.GetHashCode 的文档特别指出,两个相同的字符串在不同版本的 CLR 中可以具有不同的哈希码,事实上它们确实如此。不要在数据库中存储字符串哈希并期望它们永远相同,因为它们不会。

So String.GetHashcode() should not be used for this.

因此不应为此使用 String.GetHashcode()。

回答by Geoff

Standard java practise, is to simply write

标准的java实践,就是简单的写

final int prime = 31;
int result = 1;
for( String s : strings )
{
    result = result * prime + s.hashCode();
}
// result is the hashcode.

回答by leonbloy

Your first option has the only inconvenience of (String1, String2)producing the same hashcode of (String2, String1). If that's not a problem (eg. because you have a fix order) it's fine.

您的第一个选项唯一不便的是(String1, String2)生成(String2, String1). 如果这不是问题(例如,因为您有固定订单),那就没问题了。

"Cat all the string together then get the hashcode" seems the more natural and secure to me.

“将所有字符串放在一起,然后获取哈希码”对我来说似乎更自然和安全。

Update: As a comment points out, this has the drawback that the list ("x", "yz") and ("xy","z") would give the same hash. To avoid this, you could join the strings with a string delimiter that cannot appear inside the strings.

更新:正如评论指出的那样,这有一个缺点,即列表 ("x", "yz") 和 ("xy","z") 会给出相同的哈希值。为避免这种情况,您可以使用不能出现在字符串内部的字符串分隔符来连接字符串。

If the strings are big, you might prefer to hash each one, cat the hashcodes and rehash the result. More CPU, less memory.

如果字符串很大,您可能更喜欢对每个字符串进行哈希处理,对哈希码进行分类并重新对结果进行哈希处理。更多的CPU,更少的内存。

回答by Andreas Brinck

I see no reason not to concatenate the strings and compute the hashcode for the concatenation.

我认为没有理由不连接字符串并计算连接的哈希码。

As an analogy, say that I wanted to compute a MD5 checksum for a memory block, I wouldn't split the block up into smaller pieces and compute individual MD5 checksums for them and then combine them with some ad hoc method.

打个比方,假设我想为一个内存块计算一个 MD5 校验和,我不会将块分成更小的部分,并为它们计算单独的 MD5 校验和,然后将它们与一些特别的方法结合起来。

回答by fortran

Another way that pops in my head, chain xors with rotated hashes based on index:

我脑海中出现的另一种方式,基于索引使用旋转哈希链接异或:

int shift = 0;
int result = 1;
for(String s : strings)
{
    result ^= (s.hashCode() << shift) | (s.hashCode() >> (32-shift)) & (1 << shift - 1);
    shift = (shift+1)%32;
}

edit: reading the explanation given in effective java, I think geoff's code would be much more efficient.

编辑:阅读有效 java 中给出的解释,我认为杰夫的代码会更有效率。

回答by spoulson

Using the GetHashCode()is not ideal for combining multiple values. The problem is that for strings, the hashcode is just a checksum. This leaves little entropy for similar values. e.g. adding hashcodes for ("abc", "bbc") will be the same as ("abd", "abc"), causing a collision.

使用GetHashCode()不适合组合多个值。问题是对于字符串,哈希码只是一个校验和。这对于相似的值几乎没有熵。例如,为 ("abc", "bbc") 添加哈希码将与 ("abd", "abc") 相同,从而导致冲突。

In cases where you need to be absolutely sure, you'd use a real hash algorithm, like SHA1, MD5, etc. The only problem is that they are block functions, which is difficult to quickly compare hashes for equality. Instead, try a CRC or FNV1hash. FNV1 32-bit is super simple:

在需要绝对确定的情况下,您会使用真正的散列算法,如 SHA1、MD5 等。唯一的问题是它们是块函数,很难快速比较散列的相等性。相反,请尝试使用 CRC 或FNV1哈希。FNV1 32 位超级简单:

public static class Fnv1 {
    public const uint OffsetBasis32 = 2166136261;
    public const uint FnvPrime32 = 16777619;

    public static int ComputeHash32(byte[] buffer) {
        uint hash = OffsetBasis32;

        foreach (byte b in buffer) {
            hash *= FnvPrime32;
            hash ^= b;
        }

        return (int)hash;
    }
}

回答by Philip Kelley

A SQL-based solution could be based on the checksum and checksum_agg functions. If I'm following it right, you have something like:

基于 SQL 的解决方案可以基于 checksum 和 checksum_agg 函数。如果我按照正确的方式进行操作,您会遇到以下情况:

MyTable
  MyTableId
  HashCode

MyChildTable
  MyTableId  (foreign key into MyTable)
  String

with the various strings for a given item (MyTableId) stored in MyChildTable. To calculate and store a checksum reflecting these (never-to-be-changed) strings, something like this should work:

使用存储在 MyChildTable 中的给定项目 (MyTableId) 的各种字符串。要计算和存储反映这些(永远不会被更改的)字符串的校验和,这样的事情应该可以工作:

UPDATE MyTable
 set HashCode = checksum_agg(checksum(string))
 from MyTable mt
  inner join MyChildTable ct
   on ct.MyTableId = mt.MyTableId
 where mt.MyTableId = @OnlyForThisOne

I believe this is order-independant, so strings "The quick brown" would produce the same checksum as "brown The quick".

我相信这是与订单无关的,因此字符串“The quick brown”将产生与“brown The quick”相同的校验和。

回答by CPerkins

I hope this is unnecessary, but since you don't mention anything which sounds like you're only using the hashcodes for a first check and then later verifying that the strings are actually equal, I feel the need to warn you:

我希望这是不必要的,但由于您没有提到任何听起来像是您只是使用哈希码进行第一次检查然后验证字符串实际上相等的内容,我觉得有必要警告您:

Hashcode equality != value equality

哈希码相等 != 值相等

There will be lots of sets of strings which yield the identical hashcode, but won't always be equal.

会有很多字符串集产生相同的哈希码,但并不总是相等。

回答by Neil Coffey

So I understand, you effectively have some set of strings that you need to identify by hash code, and that set of strings that you need to identify among will never change?

所以我理解,您实际上有一些需要通过哈希码识别的字符串集,而您需要在其中识别的那组字符串永远不会改变?

If that's the case, it doesn't particularly matter, so long as the scheme you use gives you unique numbers for the different strings/combinations of strings. I would start by just concatenating the strings and calculating the String.hashCode() and seeing if you end up with unique numbers. If you don't, then you could try:

如果是这种情况,这并不重要,只要您使用的方案为您提供不同字符串/字符串组合的唯一编号即可。我将首先连接字符串并计算 String.hashCode() 并查看您是否最终得到唯一的数字。如果你不这样做,那么你可以尝试:

  • instead of concatenating strings, concatenate hash codes of the component strings, and try different multipliers (e.g. if you want to identify combiantions of two-string sequences, try HC1 + 17 * HC2, if that doesn't give unique numbers, try HC1 + 31 * HC2, then try 19, then try 37 etc -- essentially any small-ish odd number will do fine).
  • if you don't get unique numbers in this way-- or if you need to cope with the set of possibilities expanding-- then consider a stronger hash code. A 64-bit hash code is a good compromise between ease of comparison and likelihood of hashes being unique.
  • 而不是连接字符串,连接组件字符串的哈希码,并尝试不同的乘数(例如,如果您想识别两个字符串序列的组合,请尝试 HC1 + 17 * HC2,如果没有给出唯一数字,请尝试 HC1 + 31 * HC2,然后尝试 19,然后尝试 37 等等——基本上任何小的奇数都可以)。
  • 如果您没有以这种方式获得唯一的数字——或者如果您需要应对扩展的可能性集——那么请考虑使用更强的哈希码。64 位哈希码是在易于比较和哈希唯一的可能性之间的良好折衷。

A possible scheme for a 64-bit hash code is as follows:

64 位哈希码的可能方案如下:

  • generate an array of 256 64-bit random numbers using a fairly strong scheme (you could use SecureRandom, though the XORShiftscheme would work fine)
  • pick "m", another "random" 64-bit, odd number with more or less half of its bits set
  • to generate a hash code, go through each byte value, b, making up the string, and take the bth number from your array of random numbers; then XOR or add that with the current hash value, multiplied by "m"
  • 使用相当强的方案生成 256 个 64 位随机数的数组(您可以使用 SecureRandom,尽管XORShift方案可以正常工作)
  • 选择“m”,另一个“随机”的 64 位奇数,设置了或多或少一半的位
  • 生成哈希码,遍历每个字节值 b,组成字符串,并从随机数数组中取出第 b 个数字;然后异或或将其与当前哈希值相加,乘以“m”

So an implementation based on values suggested in Numerical Recipes would be:

因此,基于数值食谱中建议的值的实现将是:

  private static final long[] byteTable;
  private static final long HSTART = 0xBB40E64DA205B064L;
  private static final long HMULT = 7664345821815920749L;

  static {
    byteTable = new long[256];
    long h = 0x544B2FBACAAF1684L;
    for (int i = 0; i < 256; i++) {
      for (int j = 0; j < 31; j++) {
        h = (h >>> 7) ^ h;
        h = (h << 11) ^ h;
        h = (h >>> 10) ^ h;
      }
      byteTable[i] = h;
    }
  }

The above is initialising our array of random numbers. We use an XORShift generator, but we could really use any fairly good-quality random number generator (creating a SecureRandom() with a particular seed then calling nextLong() would be fine). Then, to generate a hash code:

以上是初始化我们的随机数数组。我们使用 XORShift 生成器,但我们真的可以使用任何质量相当好的随机数生成器(使用特定种子创建 SecureRandom() 然后调用 nextLong() 就可以了)。然后,生成哈希码:

  public static long hashCode(String cs) {
    if (cs == null) return 1L;
    long h = HSTART;
    final long hmult = HMULT;
    final long[] ht = byteTable;
    for (int i = cs.length()-1; i >= 0; i--) {
      char ch = cs.charAt(i);
      h = (h * hmult) ^ ht[ch & 0xff];
      h = (h * hmult) ^ ht[(ch >>> 8) & 0xff];
    }
    return h;
  }

A guide to consider is that given a hash code of n bits, on average you'd expect to have to generate hashes of in the order of 2^(n/2) strings before you get a collision. Or put another way, with a 64-bit hash, you'd expect a collision after around 4 billion strings (so if you're dealing with up to, say, a couple of million strings, the chances of a collision are pretty negligible).

一个需要考虑的指南是,给定 n 位的哈希码,平均而言,您希望在发生冲突之前必须生成 2^(n/2) 个字符串的哈希值。或者换句话说,对于 64 位哈希,您预计会在大约 40 亿个字符串之后发生冲突(因此,如果您要处理多达几百万个字符串,则发生冲突的可能性可以忽略不计) )。

Another option would be MD5, which is a very strong hash (practically secure), but it is a 128-bit hash, so you have the slight disadvantage of having to deal with 128-bit values. I would say MD5 is overkill for these purposes-- as I say, with a 64-bit hash, you can deal fairly safely with in the order of a few million strings.

另一种选择是 MD5,它是一个非常强大的散列(实际上是安全的),但它是一个 128 位散列,因此您必须处理 128 位值的轻微缺点。对于这些目的,我会说 MD5 是矫枉过正的——正如我所说,使用 64 位哈希,您可以相当安全地处理数百万个字符串。

(Sorry, I should clarify -- MD5 was designed as a secure hash, it's just that it's since found not to be secure. A "secure" hash is one where given a particular hash it's not feasible to deliberately construct input that would lead to that hash. In some circumstances-- but not as I understand in yours-- you would need this property. You might need it, on the other hand, if the strings you're dealing with a user-input data-- i.e. a malicious user could deliberately try to confuse your system. You might also be interetsed in the following I've written in the past:

(抱歉,我要澄清一下——MD5 被设计为安全散列,只是因为它被发现不安全。“安全”散列是在给定特定散列的情况下,故意构建会导致以下结果的输入是不可行的那个哈希值。在某些情况下——但不像我理解的那样——你需要这个属性。另一方面,如果你正在处理用户输入数据的字符串——即一个恶意用户可能会故意尝试混淆您的系统。您可能还对我过去写过的以下内容感兴趣:

回答by Eran Betzalel

回答by Toby

If you happen to use Java, you can create an array of strings (or convert a collection to an array), and then use Arrays.hashCode()as documented here.

如果您碰巧使用 Java,则可以创建一个字符串数组(或将集合转换为数组),然后Arrays.hashCode()按照此处的说明使用。