java 检查两个图像文件是否相同..校验和或哈希?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6382116/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
To check if two image files are same..Checksum or Hash?
提问by Abhishek
I am doing some image processing code where in I download some images(as BufferedImage) from URLs and pass it on to a image processor.
我正在做一些图像处理代码,其中我从 URL 下载一些图像(作为 BufferedImage)并将其传递给图像处理器。
I want to avoid passing of the same image more than once to the image processor(as the image processing operation is of high cost). The URL end points of the images(if they are same images) may vary and hence I can prevent this by the URL. So I was planning to do a checksum or hash to identify if the code is encountering the same image again.
我想避免将同一图像多次传递给图像处理器(因为图像处理操作的成本很高)。图像的 URL 端点(如果它们是相同的图像)可能会有所不同,因此我可以通过 URL 防止这种情况。所以我打算做一个校验和或哈希来确定代码是否再次遇到相同的图像。
For md5 I tried Fast MD5, and it generated a 20K+ character length hex checksum value for the image(some sample). Obviously storing this 20K+ character hash would be an issue when it comes to database storage. Hence I tried the CRC32(from java.util.zip.CRC32). And it did generate quite smaller length check sum than the hash.
对于 md5,我尝试了Fast MD5,它为图像(一些示例)生成了一个 20K+ 字符长度的十六进制校验和值。显然,当涉及到数据库存储时,存储这个 20K+ 字符的哈希将是一个问题。因此我尝试了 CRC32(来自 java.util.zip.CRC32)。它确实生成了比散列小得多的长度校验和。
I do understand checksum and hash are for different purposes. For the purpose explained above can I just use the CRC32? Would it solve the purpose or I have to try something more than these two?
我确实理解校验和和哈希用于不同的目的。出于上述目的,我可以只使用 CRC32 吗?它会解决目的还是我必须尝试比这两个更多的东西?
Thanks, Abi
谢谢,阿比
回答by SJuan76
The difference between CRC and, say, MD5, is that it is more difficult to tamper a file to match a "target" MD5 than to tamper it to match a "target" checksum. Since this does not seem a problem for your program, it should not matter which method do you use. Maybe MD5 might be a little more CPU intensive, but I do not know if that different will matter.
CRC 和 MD5 之间的区别在于,篡改文件以匹配“目标”MD5 比篡改文件以匹配“目标”校验和更困难。由于这对您的程序来说似乎不是问题,因此您使用哪种方法应该无关紧要。也许 MD5 可能会占用更多的 CPU,但我不知道这种不同是否重要。
The main question should be the number of bytes of the digest.
主要问题应该是摘要的字节数。
If you are doing a checksum in an integer will mean that, for a file of 2K size, you are fitting 2^2048 combinations into 2^32 combinations --> for every CRC value, you will have 2^64 possible files that match it. If you have a 128 bits MD5, then you have 2^16 possible collisions.
如果您在整数中进行校验和将意味着,对于 2K 大小的文件,您将 2^2048 个组合拟合为 2^32 个组合 --> 对于每个 CRC 值,您将有 2^64 个可能的文件匹配它。如果你有一个 128 位的 MD5,那么你就有 2^16 个可能的冲突。
The bigger the code that you compute, the less possible collisions (given that the codes computed are distributed evenly), so the safer the comparation.
您计算的代码越大,冲突的可能性就越小(假设计算的代码分布均匀),因此比较越安全。
Anyway, in order to minimice possible errors, I think the first classification should be using file size... first compare file sizes, if they match then compare checksums/hash.
无论如何,为了尽量减少可能的错误,我认为第一个分类应该使用文件大小......首先比较文件大小,如果它们匹配然后比较校验和/哈希。
回答by GolezTrol
A checksum and a hash are basically the same. You should be able to calculate any kind of hash. A regular MD5 would normally suffice. If you like, you could store the size and the md5 hash (which is 16 bytes, I think).
校验和和散列基本相同。您应该能够计算任何类型的哈希。一个普通的 MD5 通常就足够了。如果您愿意,可以存储大小和 md5 哈希值(我认为是 16 字节)。
If two files have different sizes, thay are different files. You will not even need to calculate a hash over the data. If it is unlikely that you have many duplicate files, and the files are of the larger kind (like, JPG pictures taken with a camera), this optimization may spare you a lot of time.
如果两个文件的大小不同,则它们是不同的文件。您甚至不需要计算数据的散列。如果您不太可能有很多重复文件,并且文件类型较大(例如,用相机拍摄的 JPG 图片),则此优化可能会为您节省大量时间。
If two or more files have the same size, you can calculate the hashes and compare them.
如果两个或多个文件的大小相同,您可以计算哈希值并进行比较。
If two hashes are the same, you could compare the actual data to see if this is different after all. This is very, very unlikely, but theoretically possible. The larger your hash (md5 is 16 bytes, while CR32 is only 4), the less likely that two different files will have the same hash. It will take only 10 minutes of programming to perform this extra check though, so I'd say: better safe than sorry. :)
如果两个哈希值相同,您可以比较实际数据,看看这到底是不是不同。这是非常非常不可能的,但理论上是可能的。哈希越大(md5 是 16 个字节,而 CR32 只有 4 个),两个不同文件具有相同哈希的可能性就越小。不过,执行这个额外检查只需要 10 分钟的编程时间,所以我想说:安全总比抱歉好。:)
To further optimize this, if exactly two files have the same size, you can just compare their data. You will need to read the files anyway to calculate their hashes, so why not compare them directly if they are the only two with that specific size.
为了进一步优化这一点,如果恰好两个文件具有相同的大小,您可以只比较它们的数据。您无论如何都需要读取文件来计算它们的哈希值,所以如果它们是唯一具有特定大小的两个,为什么不直接比较它们。