C# 更快的 MD5 替代品？

Question

提问by Paul Beesley

I'm working on a program that searches entire drives for a given file. At the moment, I calculate an MD5 hash for the known file and then scan all files recursively, looking for a match.

我正在开发一个程序，该程序在整个驱动器中搜索给定文件。目前，我计算已知文件的 MD5 哈希，然后递归扫描所有文件，寻找匹配项。

The only problem is that MD5 is painfully slow on large files. Is there a faster alternative that I can use while retaining a very small probablity of false positives?

唯一的问题是 MD5 在处理大文件时非常缓慢。是否有更快的替代方案可以使用，同时保留非常小的误报概率？

All code is in C#.

所有代码都在 C# 中。

Thank you.

谢谢你。

Update

更新

I've read that even MD5 can be pretty quick and that disk I/O should be the limiting factor. That leads me to believe that my code might not be optimal. Are there any problems with this approach?

我读过即使 MD5 也可以很快，并且磁盘 I/O 应该是限制因素。这让我相信我的代码可能不是最佳的。这种方法有什么问题吗？

        MD5 md5 = MD5.Create();
        StringBuilder sb = new StringBuilder();
        try
        {
            using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read))
            {
                foreach (byte b in md5.ComputeHash(fs))
                    sb.Append(b.ToString("X2"));
            }
            return sb.ToString();
        }
        catch (Exception)
        {
            return "";
        }

Answer 1

采纳答案by Michael Burr

I hope you're checking for an MD5 match only if the file size already matches.

我希望您仅在文件大小已经匹配时才检查 MD5 匹配。

Another optimization is to do a quick checksum of the first 1K (or some other arbitrary, but reasonably small number) and make sure those match before working the whole file.

另一个优化是对第一个 1K（或其他任意但相当小的数字）进行快速校验和，并在处理整个文件之前确保它们匹配。

Of course, all this assumes that you're just looking for a match/nomatch decision for a particular file.

当然，所有这些都假设您只是在寻找特定文件的匹配/不匹配决定。

Answer 2

回答by jalf

just read the file linearly? It seems pretty pointless to read the entire file, compute a md5 hash, and then compare the hash.

只是线性读取文件？读取整个文件，计算 md5 散列，然后比较散列似乎毫无意义。

Reading the file sequentially, a few bytes at a time, would allow you to discard the vast majority of files after reading, say, 4 bytes. And you'd save all the processing overhead of computing a hashing function which doesn't give you anything in your case.

顺序读取文件，一次几个字节，将允许您在读取后丢弃绝大多数文件，例如 4 个字节。并且您将节省计算散列函数的所有处理开销，该函数在您的情况下不会给您任何东西。

If you already had the hashes for all the files in the drive, it'd make sense to compare them, but if you have to compute them on the fly, there just doesn't seem to be any advantage to the hashing.

如果您已经拥有驱动器中所有文件的哈希值，比较它们是有意义的，但是如果您必须动态计算它们，那么哈希值似乎没有任何优势。

Am I missing something here? What does hashing buy you in this case?

我在这里错过了什么吗？在这种情况下，散列会给你带来什么？

Answer 3

回答by Adam Byrtek

First consider what is really your bottleneck: the hash function itself or rather a disk access speed? If you are bounded by disk, changing hashing algorithm won't give you much. From your description I imply that you are always scanning the whole disk to find a match - consider building the index first and then only match a given hash against the index, this will be much faster.

首先考虑真正的瓶颈是什么：哈希函数本身还是磁盘访问速度？如果您受磁盘限制，更改散列算法不会给您带来太多好处。从您的描述中，我暗示您总是扫描整个磁盘以查找匹配项 - 考虑先构建索引，然后仅将给定的散列与索引进行匹配，这会快得多。

Answer 4

回答by CesarB

There is one small problem with using MD5 to compare files: there are known pairs of files which are differentbut have the sameMD5.

使用 MD5 比较文件有一个小问题：已知的文件对不同但具有相同的MD5。

This means you can use MD5 to tell if the files are different(if the MD5 is different, the files must be different), but you cannot use MD5 to tell if the files are equal(if the files are equal, the MD5 must be the same, but if the MD5 is equal, the files might or might not be equal).

这意味着您可以使用 MD5 来判断文件是否不同（如果 MD5 不同，则文件必须不同），但不能使用 MD5 来判断文件是否相等（如果文件相等，则 MD5 必须是相同，但如果 MD5 相等，则文件可能相等，也可能不相等）。

You should either use a hash function which has not been broken yet (like SHA-1), or (as @SoapBox mentioned) use MD5 only as a fast way to find candidates for a deeper comparison.

您应该使用尚未被破坏的哈希函数（如 SHA-1），或者（如@SoapBox 所述）仅使用 MD5 作为查找候选对象以进行更深入比较的快速方法。

References:

参考：

http://www.win.tue.nl/hashclash/SoftIntCodeSign/

http://www.win.tue.nl/hashclash/SoftIntCodeSign/

Answer 5

回答by Rich.

Regardless of cryptographic requirements, the possibility of a hash collision exists, so no hashing function can be used to guaranteethat two files are identical.

无论加密要求如何，都存在哈希冲突的可能性，因此无法使用哈希函数来保证两个文件相同。

I wrote similar code a while back which I got to run pretty fast by indexing all the files first, and discarding any with a different size. A fast hash comparison (on part of each file) was then performed on the remaining entries (comparing bytes for this step was proved to be less useful - many file types have common headers which have identical bytes at the start of the file). Any files that were left after this stage were then checked using MD5, and finally a byte comparison of the whole file if the MD5 matched, just to ensure that the contents were the same.

不久前我编写了类似的代码，通过首先索引所有文件，然后丢弃任何不同大小的文件，我才能运行得非常快。然后对其余条目执行快速哈希比较（在每个文件的一部分上）（事实证明，此步骤的比较字节不太有用 - 许多文件类型具有共同的标头，在文件的开头具有相同的字节）。然后使用 MD5 检查此阶段后留下的任何文件，如果 MD5 匹配，最后对整个文件进行字节比较，以确保内容相同。

Answer 6

回答by Sergey Grischenko

Use MD5CryptoServiceProvider and BufferedStream

使用 MD5CryptoServiceProvider 和 BufferedStream

        using (FileStream stream = File.OpenRead(filePath))
        {
            using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
            {
                var sha = new MD5CryptoServiceProvider();
                byte[] checksum = sha.ComputeHash(bufferedStream);
                return BitConverter.ToString(checksum).Replace("-", String.Empty);
            }
        }

C# 更快的 MD5 替代品？

提问by Paul Beesley

采纳答案by Michael Burr

回答by jalf

回答by Adam Byrtek

回答by CesarB

回答by Rich.

回答by Sergey Grischenko

相关推荐

最近更新

标签

C# 更快的 MD5 替代品？

提问by Paul Beesley

采纳答案by Michael Burr

回答by jalf

回答by Adam Byrtek

回答by CesarB

回答by Rich.

回答by Sergey Grischenko

相关推荐

如何在 C# 中构建 XML？

在 C# 中打开目录选择器

C# .NET 中最简单的键/值对文件解析

多客户端，c# 中的异步套接字，最佳实践？

相关推荐

最近更新

标签