C# 无论顺序如何,获取字符串列表的哈希值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/670063/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 12:38:42  来源:igfitidea点击:

Getting hash of a list of strings regardless of order

c#.netvb.netstringhash

提问by MaxK

I would like to write a function GetHashCodeOfList()which returns a hash-code of a list of strings regardless of order. Given 2 lists with the same strings should return the same hash-code.

我想编写一个函数GetHashCodeOfList(),无论顺序如何,它都会返回字符串列表的哈希码。给定 2 个具有相同字符串的列表应该返回相同的哈希码。

ArrayList list1 = new ArrayList()    
list1.Add("String1");
list1.Add("String2");
list1.Add("String3");    

ArrayList list2 = new ArrayList()    
list2.Add("String3");    
list2.Add("String2"); 
list2.Add("String1");

GetHashCodeOfList(list1) = GetHashCodeOfList(list2) //this should be equal.

I had a few thoughts:

我有几个想法:

  1. I can first sort the list, then combine the sorted list into 1 long string and then call GetHashCode(). However sorting is a slow operation.

  2. I can get the hash of each individual string (by calling string.GetHashCode()) in the list, then multiplying all hashes and calling Mod UInt32.MaxValue. For example: "String1".GetHashCode() * "String2".GetHashCode * … MOD UInt32.MaxValue. But this results in a number overflow.

  1. 我可以先对列表进行排序,然后将排序后的列表组合成 1 个长字符串,然后调用GetHashCode(). 然而,排序是一个缓慢的操作。

  2. 我可以获取列表中每个单独字符串的散列(通过调用string.GetHashCode()),然后将所有散列相乘并调用 Mod UInt32.MaxValue。例如:"String1".GetHashCode() * "String2".GetHashCode * … MOD UInt32.MaxValue。但这会导致数字溢出。

Does anyone have any thoughts?

有人有想法吗?

Thanks in advance for your help.

在此先感谢您的帮助。

采纳答案by Jon Skeet

There are various different approaches here the under two main categories, each typically with their own benefits and disadvantages, in terms of effectiveness and performance. It is probably best to choose the simplest algorithm for whatever application and only use the more complex variants if necessary for whatever situation.

这里有各种不同的方法,主要分为两大类,在有效性和性能方面,每种方法通常都有自己的优点和缺点。最好为任何应用程序选择最简单的算法,并且仅在任何情况下必要时才使用更复杂的变体。

Note that these examples use EqualityComparer<T>.Defaultsince that will deal with null elements cleanly. You could do better than zero for null if desired. If T is constrained to struct it is also unnecessary. You can hoist the EqualityComparer<T>.Defaultlookup out of the function if so desired.

请注意,这些示例使用EqualityComparer<T>.Default因为这将干净地处理空元素。如果需要,您可以为 null 做得比零更好。如果 T 被约束为 struct ,它也是不必要的。EqualityComparer<T>.Default如果需要,您可以将查找提升到函数之外。

Commutative Operations

交换操作

If you use operations on the hashcodes of the individual entries which are commutativethen this will lead to the same end result regardless of order.

如果您对可交换的单个条目的哈希码使用操作,那么无论顺序如何,这都将导致相同的最终结果。

There are several obvious options on numbers:

数字有几个明显的选项:

XOR

异或

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source)
    {
        hash = hash ^ EqualityComparer<T>.Default.GetHashCode(element);
    }
    return hash;
}

One downside of that is that the hash for { "x", "x" } is the same as the hash for { "y", "y" }. If that's not a problem for your situation though, it's probably the simplest solution.

一个缺点是 { "x", "x" } 的散列与 { "y", "y" } 的散列相同。如果这对您的情况来说不是问题,那么这可能是最简单的解决方案。

Addition

添加

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source)
    {
        hash = unchecked (hash + 
            EqualityComparer<T>.Default.GetHashCode(element));
    }
    return hash;
}

Overflow is fine here, hence the explicit uncheckedcontext.

溢出在这里很好,因此是明确的unchecked上下文。

There are still some nasty cases (e.g. {1, -1} and {2, -2}, but it's more likely to be okay, particularly with strings. In the case of lists that may contain such integers, you could always implement a custom hashing function (perhaps one that takes the index of recurrence of the specific value as a parameter and returns a unique hash code accordingly).

仍然存在一些令人讨厌的情况(例如 {1, -1} 和 {2, -2},但它更有可能没问题,尤其是对于字符串。对于可能包含此类整数的列表,您始终可以实现自定义散列函数(可能将特定值的重复索引作为参数并相应地返回唯一的散列代码)。

Here is an example of such an algorithm that gets around the aforementioned problem in a fairly efficient manner. It also has the benefit of greatly increasing the distribution of the hash codes generated (see the article linked at the end for some explanation). A mathematical/statistical analysis of exactly how this algorithm produces "better" hash codes would be quite advanced, but testing it across a large range of input values and plotting the results should verify it well enough.

下面是这种算法的一个例子,它以一种相当有效的方式解决了上述问题。它还具有大大增加生成的哈希码分布的好处(请参阅末尾链接的文章以获得一些解释)。对该算法究竟如何产生“更好”的哈希码进行数学/统计分析将是非常先进的,但是在大范围的输入值上测试它并绘制结果应该足以验证它。

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    int curHash;
    int bitOffset = 0;
    // Stores number of occurences so far of each value.
    var valueCounts = new Dictionary<T, int>();

    foreach (T element in source)
    {
        curHash = EqualityComparer<T>.Default.GetHashCode(element);
        if (valueCounts.TryGetValue(element, out bitOffset))
            valueCounts[element] = bitOffset + 1;
        else
            valueCounts.Add(element, bitOffset);

        // The current hash code is shifted (with wrapping) one bit
        // further left on each successive recurrence of a certain
        // value to widen the distribution.
        // 37 is an arbitrary low prime number that helps the
        // algorithm to smooth out the distribution.
        hash = unchecked(hash + ((curHash << bitOffset) |
            (curHash >> (32 - bitOffset))) * 37);
    }

    return hash;
}

Multiplication

乘法

Which has few if benefits over addition: small numbers and a mix of positive and negative numbers they may lead to a better distribution of hash bits. As a negative to offset this "1" becomes a useless entry contributing nothing and any zero element results in a zero. You can special-case zero not to cause this major flaw.

与加法相比,这几乎没有什么好处:小数和正负数的混合可能会导致更好的散列位分布。作为抵消这个“1”的负数,它变成了一个没有贡献的无用条目,任何零元素都会导致零。您可以将特殊情况设为零,以免造成此重大缺陷。

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 17;
    foreach (T element in source)
    {
        int h = EqualityComparer<T>.Default.GetHashCode(element);
        if (h != 0)
            hash = unchecked (hash * h);
    }
    return hash;
}

Order first

先下单

The other core approach is to enforce some ordering first, then use any hash combination function you like. The ordering itself is immaterial so long as it is consistent.

另一种核心方法是先强制执行某些排序,然后使用您喜欢的任何散列组合函数。只要顺序一致,它本身就无关紧要。

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source.OrderBy(x => x, Comparer<T>.Default))
    {
        // f is any function/code you like returning int
        hash = f(hash, element);
    }
    return hash;
}

This has some significant benefits in that the combining operations possible in fcan have significantly better hashing properties (distribution of bits for example) but this comes at significantly higher cost. The sort is O(n log n)and the required copy of the collection is a memory allocation you can't avoid given the desire to avoid modifying the original. GetHashCodeimplementations should normally avoid allocations entirely. One possible implementation of fwould be similar to that given in the last example under the Addition section (e.g. any constant number of bit shifts left followed by a multiplication by a prime - you could even use successive primes on each iteration at no extra cost, since they only need be generated once).

这有一些显着的好处,因为可能的组合操作f可以具有明显更好的散列属性(例如位分布),但这会带来更高的成本。排序是O(n log n)并且集合的所需副本是您无法避免的内存分配,因为您希望避免修改原始内容。GetHashCode实现通常应该完全避免分配。的一种可能实现f类似于加法部分下最后一个示例中给出的实现(例如,向左移动任何恒定数量的位移,然后与素数相乘 - 您甚至可以在每次迭代中使用连续素数而无需额外成本,因为它们只需要生成一次)。

That said, if you were dealing with cases where you could calculate and cache the hash and amortize the cost over many calls to GetHashCodethis approach may yield superior behaviour. Also the latter approach is even more flexible since it can avoid the need to use the GetHashCodeon the elements if it knows their type and instead use per byte operations on them to yield even better hash distribution. Such an approach would likely be of use only in cases where the performance was identified as being a significant bottleneck.

也就是说,如果您正在处理可以计算和缓存散列并通过多次调用GetHashCode这种方法来分摊成本的情况,则可能会产生更好的行为。此外,后一种方法更加灵活,因为GetHashCode如果知道元素的类型,它可以避免在元素上使用 ,而是对它们使用按字节操作以产生更好的散列分布。这种方法可能仅在性能被确定为重大瓶颈的情况下才有用。

Finally, if you want a reasonably comprehensive and fairly non-mathematical overview of the subject of hash codes and their effectiveness in general, these blog postswould be worthwhile reads, in particular the Implementing a simple hashing algorithm (pt II)post.

最后,如果您想要对哈希码的主题及其一般有效性进行相当全面且相当非数学的概述,那么这些博客文章将值得一读,尤其是实现简单的哈希算法 (pt II)文章。

回答by Guffa

An alternative to sorting the string lists would be to get the hash codes of the strings and then sort the hash codes. (Comparing ints is less expensive than comparing strings.) You can then use an algorithm to merge the hash codes that (hopefully) gives a better distribution.

对字符串列表进行排序的另一种方法是获取字符串的哈希码,然后对哈希码进行排序。(比较整数比比较字符串便宜。)然后您可以使用算法来合并(希望)提供更好分布的哈希码。

Example:

例子:

GetHashCodeOfList<T>(IEnumerable<T> list) {
   List<int> codes = new List<int>();
   foreach (T item in list) {
      codes.Add(item.GetHashCode());
   }
   codes.Sort();
   int hash = 0;
   foreach (int code in codes) {
      unchecked {
         hash *= 251; // multiply by a prime number
         hash += code; // add next hash code
      }
   }
   return hash;
}

回答by dbasnett

    Dim list1 As ArrayList = New ArrayList()
    list1.Add("0")
    list1.Add("String1")
    list1.Add("String2")
    list1.Add("String3")
    list1.Add("abcdefghijklmnopqrstuvwxyz")

    Dim list2 As ArrayList = New ArrayList()
    list2.Add("0")
    list2.Add("String3")
    list2.Add("abcdefghijklmnopqrstuvwxyz")
    list2.Add("String2")
    list2.Add("String1")
    If GetHashCodeOfList(list1) = GetHashCodeOfList(list2) Then
        Stop
    Else
        Stop
    End If
    For x As Integer = list1.Count - 1 To 0 Step -1
        list1.RemoveAt(list1.Count - 1)
        list2.RemoveAt(list2.Count - 1)
        Debug.WriteLine(GetHashCodeOfList(list1).ToString)
        Debug.WriteLine(GetHashCodeOfList(list2).ToString)
        If list1.Count = 2 Then Stop
    Next


Private Function GetHashCodeOfList(ByVal aList As ArrayList) As UInt32
    Const mask As UInt16 = 32767, hashPrime As Integer = Integer.MaxValue
    Dim retval As UInt32
    Dim ch() As Char = New Char() {}
    For idx As Integer = 0 To aList.Count - 1
        ch = DirectCast(aList(idx), String).ToCharArray
        For idCH As Integer = 0 To ch.Length - 1
            retval = (retval And mask) + (Convert.ToUInt16(ch(idCH)) And mask)
        Next
    Next
    If retval > 0 Then retval = Convert.ToUInt32(hashPrime \ retval) 'Else ????
    Return retval
End Function

回答by Matthew Kane

A lot less code but maybe the performance isn't as good as the other answers:

代码少了很多,但性能可能不如其他答案:

public static int GetOrderIndependentHashCode<T>(this IEnumerable<T> source)    
    => source == null ? 0 : HashSet<T>.CreateSetComparer().GetHashCode(new HashSet<T>(source));

回答by Theodor Zoulias

Here is a hybrid approach. It combines the three commutative operations (XOR, addition and multiplication), applying each one in different ranges of the 32bit number. The bit-range of each operation is adjustable.

这是一种混合方法。它结合了三种交换运算(XOR、加法和乘法),将每一种运算应用于 32 位数字的不同范围。每个操作的位范围是可调的。

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    var comparer = EqualityComparer<T>.Default;
    const int XOR_BITS = 10;
    const int ADD_BITS = 11;
    const int MUL_BITS = 11;
    Debug.Assert(XOR_BITS + ADD_BITS + MUL_BITS == 32);
    int xor_total = 0;
    int add_total = 0;
    int mul_total = 17;
    unchecked
    {
        foreach (T element in source)
        {
            var hashcode = comparer.GetHashCode(element);
            int xor_part = hashcode >> (32 - XOR_BITS);
            int add_part = hashcode << XOR_BITS >> (32 - ADD_BITS);
            int mul_part = hashcode << (32 - MUL_BITS) >> (32 - MUL_BITS);
            xor_total = xor_total ^ xor_part;
            add_total = add_total + add_part;
            if (mul_part != 0) mul_total = mul_total * mul_part;
        }
        xor_total = xor_total % (1 << XOR_BITS); // Compact
        add_total = add_total % (1 << ADD_BITS); // Compact
        mul_total = mul_total - 17; // Subtract initial value
        mul_total = mul_total % (1 << MUL_BITS); // Compact
        int result = (xor_total << (32 - XOR_BITS)) + (add_total << XOR_BITS) + mul_total;
        return result;
    }
}

The performance is almost identical with the simple XOR method, because the call to GetHashCodeof each element dominates the CPU demand.

性能与简单的 XOR 方法几乎相同,因为GetHashCode每个元素的调用支配了 CPU 需求。