如何从 C# 中的字节数组生成哈希码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16340/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-01 08:52:49  来源:igfitidea点击:

How do I generate a hashcode from a byte array in C#?

提问by Andrew

Say I have an object that stores a byte array and I want to be able to efficiently generate a hashcode for it. I've used the cryptographic hash functions for this in the past because they are easy to implement, but they are doing a lot more work than they should to be cryptographically oneway, and I don't care about that (I'm just using the hashcode as a key into a hashtable).

假设我有一个存储字节数组的对象,我希望能够有效地为它生成一个哈希码。我过去曾为此使用过加密哈希函数,因为它们易于实现,但是它们所做的工作比单向加密要多得多,我不在乎(我只是使用哈希码作为哈希表的键)。

Here's what I have today:

这是我今天所拥有的:

struct SomeData : IEquatable<SomeData>
{
    private readonly byte[] data;
    public SomeData(byte[] data)
    {
        if (null == data || data.Length <= 0)
        {
            throw new ArgumentException("data");
        }
        this.data = new byte[data.Length];
        Array.Copy(data, this.data, data.Length);
    }

    public override bool Equals(object obj)
    {
        return obj is SomeData && Equals((SomeData)obj);
    }

    public bool Equals(SomeData other)
    {
        if (other.data.Length != data.Length)
        {
            return false;
        }
        for (int i = 0; i < data.Length; ++i)
        {
            if (data[i] != other.data[i])
            {
                return false;
            }
        }
        return true;
    }
    public override int GetHashCode()
    {
        return BitConverter.ToInt32(new MD5CryptoServiceProvider().ComputeHash(data), 0);
    }
}

Any thoughts?

有什么想法吗?



dp: You are right that I missed a check in Equals, I have updated it. Using the existing hashcode from the byte array will result in reference equality (or at least that same concept translated to hashcodes). for example:

dp:你说得对,我错过了 Equals 的检查,我已经更新了。使用字节数组中的现有哈希码将导致引用相等(或至少将相同的概念转换为哈希码)。例如:

byte[] b1 = new byte[] { 1 };
byte[] b2 = new byte[] { 1 };
int h1 = b1.GetHashCode();
int h2 = b2.GetHashCode();

With that code, despite the two byte arrays having the same values within them, they are referring to different parts of memory and will result in (probably) different hash codes. I need the hash codes for two byte arrays with the same contents to be equal.

使用该代码,尽管两个字节数组在其中具有相同的值,但它们指的是内存的不同部分,并将导致(可能)不同的哈希码。我需要具有相同内容的两个字节数组的哈希码相等。

回答by fulmicoton

If you are looking for performance, I tested a few hash keys, and I recommend Bob Jenkin's hash function. It is both crazy fast to compute and will give as few collisions as the cryptographic hash you used until now.

如果您正在寻找性能,我测试了一些散列键,我推荐Bob Jenkin 的散列函数。它的计算速度非常快,并且会产生与您迄今为止使用的加密哈希一样少的冲突。

I don't know C# at all, and I don't know if it can link with C, but here is its implementation in C.

我根本不懂 C#,也不知道它是否可以与 C 链接,但这是它在 C 中的实现

回答by Keith

The hash code of an object does not need to be unique.

对象的哈希码不需要是唯一的。

The checking rule is:

检查规则是:

  • Are the hash codes equal? Then call the full (slow) Equalsmethod.
  • Are the hash codes not equal? Then the two items are definitely not equal.
  • 哈希码是否相等?然后调用完整(慢)Equals方法。
  • 哈希码不相等吗?那么这两个项目肯定不相等。

All you want is a GetHashCodealgorithm that splits up your collection into roughly even groups - it shouldn't form the key as the HashTableor Dictionary<>will need to use the hash to optimise retrieval.

您所需要的只是一种GetHashCode将您的集合大致分为均匀组的算法 - 它不应该形成密钥,因为HashTableDictionary<>将需要使用散列来优化检索。

How long do you expect the data to be? How random? If lengths vary greatly (say for files) then just return the length. If lengths are likely to be similar look at a subset of the bytes that varies.

您预计数据会持续多久?有多随机?如果长度变化很大(比如文件),那么只需返回长度。如果长度可能相似,请查看变化的字节子集。

GetHashCodeshould be a lot quicker than Equals, but doesn't need to be unique.

GetHashCode应该比 快很多Equals,但不需要是唯一的。

Two identical things must neverhave different hash codes. Two different objects should nothave the same hash code, but some collisions are to be expected (after all, there are more permutations than possible 32 bit integers).

两个相同的东西永远不能有不同的哈希码。两个不同的对象不应该具有相同的哈希码,但是会出现一些冲突(毕竟,排列比可能的 32 位整数更多)。

回答by denis phillips

Is using the existing hashcode from the byte array field not good enough? Also note that in the Equals method you should check that the arrays are the same size before doing the compare.

使用字节数组字段中的现有哈希码不够好?另请注意,在 Equals 方法中,您应该在进行比较之前检查数组的大小是否相同。

回答by Lee

Generating a good hash is easier said than done. Remember, you're basically representing n bytes of data with m bits of information. The larger your data set and the smaller m is, the more likely you'll get a collision ... two pieces of data resolving to the same hash.

生成一个好的散列说起来容易做起来难。请记住,您基本上是用 m 位信息表示 n 个字节的数据。数据集越大,m 越小,发生冲突的可能性就越大……解析为相同散列的两条数据。

The simplest hash I ever learned was simply XORing all the bytes together. It's easy, faster than most complicated hash algorithms and a halfway decent general-purpose hash algorithm for small data sets. It's the Bubble Sort of hash algorithms really. Since the simple implementation would leave you with 8 bits, that's only 256 hashes ... not so hot. You could XOR chunks instead of individal bytes, but then the algorithm gets much more complicated.

我学过的最简单的散列就是将所有字节异或在一起。它比大多数复杂的散列算法和用于小数据集的半体面的通用散列算法更容易、更快。这真的是一种冒泡排序的哈希算法。因为简单的实现会给你留下 8 位,所以只有 256 个哈希......不是那么热。您可以 XOR 块而不是单个字节,但是算法会变得更加复杂。

So certainly, the cryptographic algorithms are maybe doing some stuff you don't need ... but they're also a huge step up in general-purpose hash quality. The MD5 hash you're using has 128 bits, with billions and billions of possible hashes. The only way you're likely to get something better is to take some representative samples of the data you expect to be going through your application and try various algorithms on it to see how many collisions you get.

所以当然,加密算法可能正在做一些你不需要的事情……但它们在通用哈希质量方面也是一个巨大的进步。您使用的 MD5 散列有 128 位,可能有数十亿个散列。您可能会得到更好的东西的唯一方法是获取一些您希望通过应用程序的数据的代表性样本,并在其上尝试各种算法以查看您遇到的碰撞次数。

So until I see some reason to not use a canned hash algorithm (performance, perhaps?), I'm going to have to recommend you stick with what you've got.

因此,在我看到不使用固定哈希算法的某些理由(也许是性能?)之前,我将不得不建议您坚持使用现有的算法。

回答by Jon Galloway

Have you compared with the SHA1CryptoServiceProvider.ComputeHashmethod? It takes a byte array and returns a SHA1 hash, and I believe it's pretty well optimized. I used it in an Identicon Handlerthat performed pretty well under load.

您是否与SHA1CryptoServiceProvider.ComputeHash方法进行了比较?它接受一个字节数组并返回一个 SHA1 哈希值,我相信它已经得到了很好的优化。我在Identicon 处理程序中使用了它,该处理程序在负载下表现良好。

回答by jfs

RuntimeHelpers.GetHashCodemight help:

RuntimeHelpers.GetHashCode可能会有所帮助:

From Msdn:

Serves as a hash function for a particular type, suitable for use in hashing algorithms and data structures such as a hash table.

来自msdn:

用作特定类型的散列函数,适用于散列算法和数据结构,如散列表。

回答by Oskar

Whether you want a perfect hashfunction (different value for each object that evaluates to equal) or just a pretty good one is always a performance tradeoff, it takes normally time to compute a good hashfunction and if your dataset is smallish you're better of with a fast function. The most important (as your second post points out) is correctness, and to achieve that all you need is to return the Length of the array. Depending on your dataset that might even be ok. If it isn't (say all your arrays are equally long) you can go with something cheap like looking at the first and last value and XORing their values and then add more complexity as you see fit for your data.

无论您想要一个完美的散列函数(每个评估为相等的对象的不同值)还是一个非常好的散列函数总是一个性能权衡,通常需要时间来计算一个好的散列函数,如果您的数据集很小,你最好使用一个快速的功能。最重要的(正如您的第二篇文章指出的)是正确性,要实现这一点,您只需要返回数组的长度。取决于您的数据集,甚至可能没问题。如果不是(假设您的所有数组都一样长),您可以使用一些便宜的方法,例如查看第一个和最后一个值并对它们的值进行异或,然后在您认为适合您的数据时增加更多的复杂性。

A quick way to see how your hashfunction performs on your data is to add all the data to a hashtable and count the number of times the Equals function gets called, if it is too often you have more work to do on the function. If you do this just keep in mind that the hashtable's size needs to be set bigger than your dataset when you start, otherwise you are going to rehash the data which will trigger reinserts and more Equals evaluations (though possibly more realistic?)

查看散列函数对数据的执行情况的一种快速方法是将所有数据添加到一个散列表中,并计算调用 Equals 函数的次数,如果调用次数过多,则您有更多的工作要做。如果您这样做,请记住,开始时哈希表的大小需要设置为大于数据集,否则您将重新哈希数据,这将触发重新插入和更多的 Equals 评估(尽管可能更现实?)

For some objects (not this one) a quick HashCode can be generated by ToString().GetHashCode(), certainly not optimal, but useful as people tend to return something close to the identity of the object from ToString() and that is exactly what GetHashcode is looking for

对于某些对象(不是这个对象),可以通过 ToString().GetHashCode() 生成快速 HashCode,这当然不是最佳的,但很有用,因为人们倾向于从 ToString() 返回接近对象身份的东西,而这正是GetHashcode 正在寻找什么

Trivia: The worst performance I have ever seen was when someone by mistake returned a constant from GetHashCode, easy to spot with a debugger though, especially if you do lots of lookups in your hashtable

琐事:我见过的最糟糕的表现是有人错误地从 GetHashCode 返回了一个常量,但使用调试器很容易发现,尤其是当你在哈希表中进行大量查找时

回答by Oskar

Borrowing from the code generated by JetBrains software, I have settled on this function:

借用JetBrains软件生成的代码,我定下了这个功能:

    public override int GetHashCode()
    {
        unchecked
        {
            var result = 0;
            foreach (byte b in _key)
                result = (result*31) ^ b;
            return result;
        }
    }

The problem with just XOring the bytes is that 3/4 (3 bytes) of the returned value has only 2 possible values (all on or all off). This spreads the bits around a little more.

仅对字节进行异或的问题在于,返回值的 3/4(3 个字节)只有 2 个可能的值(全部打开或全部关闭)。这将比特散布得更多一点。

Setting a breakpoint in Equals was a good suggestion. Adding about 200,000 entries of my data to a Dictionary, sees about 10 Equals calls (or 1/20,000).

在 Equals 中设置断点是一个很好的建议。将大约 200,000 个我的数据条目添加到字典中,会看到大约 10 个 Equals 调用(或 1/20,000)。

回答by Oskar

Don't use cryptographic hashes for a hashtable, that's ridiculous/overkill.

不要对哈希表使用加密哈希,这很荒谬/矫枉过正。

Here ya go... Modified FNV Hash in C#

来吧……在 C# 中修改 FNV 哈希

http://bretm.home.comcast.net/hash/6.html

http://bretm.home.comcast.net/hash/6.html

    public static int ComputeHash(params byte[] data)
    {
        unchecked
        {
            const int p = 16777619;
            int hash = (int)2166136261;

            for (int i = 0; i < data.Length; i++)
                hash = (hash ^ data[i]) * p;

            hash += hash << 13;
            hash ^= hash >> 7;
            hash += hash << 3;
            hash ^= hash >> 17;
            hash += hash << 5;
            return hash;
        }
    }

回答by Tono Nam

I found interesting results:

我发现了有趣的结果:

I have the class:

我有课:

public class MyHash : IEquatable<MyHash>
{        
    public byte[] Val { get; private set; }

    public MyHash(byte[] val)
    {
        Val = val;
    }

    /// <summary>
    /// Test if this Class is equal to another class
    /// </summary>
    /// <param name="other"></param>
    /// <returns></returns>
    public bool Equals(MyHash other)
    {
        if (other.Val.Length == this.Val.Length)
        {
            for (var i = 0; i < this.Val.Length; i++)
            {
                if (other.Val[i] != this.Val[i])
                {
                    return false;
                }
            }

            return true;
        }
        else
        {
            return false;
        }            
    }

    public override int GetHashCode()
    {            
        var str = Convert.ToBase64String(Val);
        return str.GetHashCode();          
    }
}

Then I created a dictionary with keys of type MyHash in order to test how fast I can insert and I can also know how many collisions there are. I did the following

然后我用 MyHash 类型的键创建了一个字典,以测试我可以插入的速度,我还可以知道有多少冲突。我做了以下

        // dictionary we use to check for collisions
        Dictionary<MyHash, bool> checkForDuplicatesDic = new Dictionary<MyHash, bool>();

        // used to generate random arrays
        Random rand = new Random();



        var now = DateTime.Now;

        for (var j = 0; j < 100; j++)
        {
            for (var i = 0; i < 5000; i++)
            {
                // create new array and populate it with random bytes
                byte[] randBytes = new byte[byte.MaxValue];
                rand.NextBytes(randBytes);

                MyHash h = new MyHash(randBytes);

                if (checkForDuplicatesDic.ContainsKey(h))
                {
                    Console.WriteLine("Duplicate");
                }
                else
                {
                    checkForDuplicatesDic[h] = true;
                }
            }
            Console.WriteLine(j);
            checkForDuplicatesDic.Clear(); // clear dictionary every 5000 iterations
        }

        var elapsed = DateTime.Now - now;

        Console.Read();

Every time I insert a new item to the dictionary the dictionary will calculate the hash of that object. So you can tell what method is most efficient by placing several answers found in here in the method public override int GetHashCode()The method that was by far the fastest and had the least number of collisions was:

每次我向字典中插入一个新项目时,字典都会计算该对象的哈希值。因此,您可以通过在方法中放置此处找到的几个答案来判断哪种方法最有效。public override int GetHashCode()迄今为止最快且冲突次数最少的方法是:

    public override int GetHashCode()
    {            
        var str = Convert.ToBase64String(Val);
        return str.GetHashCode();          
    }

that took 2 seconds to execute. The method

执行需要 2 秒。方法

    public override int GetHashCode()
    {
        // 7.1 seconds
        unchecked
        {
            const int p = 16777619;
            int hash = (int)2166136261;

            for (int i = 0; i < Val.Length; i++)
                hash = (hash ^ Val[i]) * p;

            hash += hash << 13;
            hash ^= hash >> 7;
            hash += hash << 3;
            hash ^= hash >> 17;
            hash += hash << 5;
            return hash;
        }
    }

had no collisions also but it took 7 seconds to execute!

也没有碰撞,但执行需要 7 秒!