java HashCode 与 SHA-1
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/853332/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HashCode vs SHA-1
提问by LB40
I'd like to compare some large objects representing trees and cache somethingto avoid comparing each time the new object with one already existing...
我想比较一些代表树的大对象并缓存一些东西以避免每次将新对象与已经存在的对象进行比较...
The question is what would be the best something ? (a compromise between performance and collisions...).
问题是什么是最好的东西?(性能和冲突之间的妥协......)。
On the one hand, I have a regular hashCode function based on the value of various fields (following the chapter 3 of effective Java. But I'm not able to evaluate the potential collisions entailed by such an approach.
一方面,我有一个基于各种字段值的常规 hashCode 函数(遵循有效 Java的第 3 章。但我无法评估这种方法所带来的潜在冲突。
On the other hand, I have the MessageDigest approach from the standard java distribution with SHA-1 algorithm. I presume it's not going to be efficient but I may have less collision. Am I right ? Is it a correct solution in my context or am I completely wrong ?
另一方面,我有来自带有 SHA-1 算法的标准 java 发行版的 MessageDigest 方法。我认为它不会有效率,但我可能会减少碰撞。我对吗 ?在我的上下文中这是一个正确的解决方案还是我完全错了?
The thing is that I don't know what would be the size of the objects. Please also note that the value computed is not going to be used in a HashTable.
问题是我不知道对象的大小。另请注意,计算出的值不会在 HashTable 中使用。
thx...
谢谢...
回答by Jeff Ferland
See the following:
请参阅以下内容:
- http://www.javapractices.com/topic/TopicAction.do?Id=28
- https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#hashCode--
- http://www.ibm.com/developerworks/java/library/j-jtp05273.html
- http://www.javapractices.com/topic/TopicAction.do?Id=28
- https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#hashCode--
- http://www.ibm.com/developerworks/java/library/j-jtp05273.html
Keep in mind the following:
请记住以下几点:
- An object may be unequal, yet have the same hash code
- Your collisions potential depends on how many objects you encounter.
- How useful hash codes will be depends on how you implement checking
- 一个对象可能不相等,但具有相同的哈希码
- 您的碰撞潜力取决于您遇到的物体数量。
- 哈希码的有用程度取决于您如何实施检查
Generally, you can determine the chance of a collision based upon the number of expected objects and the number of possible hashes (max hash value). See http://en.wikipedia.org/wiki/Birthday_paradoxfor the detailed explanation.
通常,您可以根据预期对象的数量和可能的散列数(最大散列值)来确定发生冲突的机会。有关详细说明,请参阅http://en.wikipedia.org/wiki/Birthday_paradox。
Personally? Java objects (instantiated classes) < 10,000? Hash code. Representing files / blobs / lots of data? SHA-1. I use SHA-1 hashing in my database to keep people from doing ETL work on the same file more than once. I then use SHA-1 hashing again at a second level to keep people from ETLing the same section in more than once file (e.g., different files but the same order shows up twice).
亲自?Java 对象(实例化类)< 10,000?哈希码。代表文件/blob/大量数据?SHA-1。我在我的数据库中使用 SHA-1 散列来防止人们对同一个文件多次进行 ETL 工作。然后我在第二级再次使用 SHA-1 散列,以防止人们在多个文件中对同一部分进行 ETL(例如,不同的文件但相同的顺序出现两次)。
回答by matt b
Personally I would use hashCode()for the objects until it's been proven that any possible collisions are an actual problem to avoid preemptively optimizing a problem which you might not actually have.
就我个人而言,我会使用hashCode()这些对象,直到证明任何可能的碰撞都是一个实际问题,以避免抢先优化您实际上可能没有的问题。
回答by erickson
Because of the birthday problem,the chance of a collision depends on how many items you are working with.
由于生日问题,发生碰撞的可能性取决于您正在处理的项目数量。
The 160-bit space of SHA-1 is so large that I doubt you could ever have enough items to see a collision.
SHA-1 的 160 位空间太大了,我怀疑您是否有足够的项目来查看碰撞。
The 32-bit space of hashCode()should not have a significant number of collisions until you have over 50,000 items. However, this depends on using a good hash algorithm.
的 32 位空间不hashCode()应该有大量的冲突,直到您有超过 50,000 个项目。然而,这取决于使用好的散列算法。
In order to apply a cryptographic digest like SHA-1, you'll need to convert your graph to a string of bytes, which is likely to be computationally expensive, and could be complicated.
为了应用 SHA-1 之类的加密摘要,您需要将图形转换为字节字符串,这可能在计算上很昂贵,并且可能很复杂。
回答by Neil Coffey
Usually for duplicate file/data detection, MD5 is a good tradeoff between speed and chance of collision. MD5 is inappropriate if somebody could be deliberately crafting files to fool your program (it is slightly vulnerable to collision attacks). But if you're just worried about collisions by chance, then its 128-bit width is practically always sufficient at present.
通常对于重复文件/数据检测,MD5 是速度和碰撞机会之间的一个很好的权衡。如果有人可能故意制作文件来欺骗您的程序(它有点容易受到碰撞攻击),那么 MD5 是不合适的。但是如果你只是担心偶然的碰撞,那么它的 128 位宽度目前实际上总是足够的。
SHA-1 and SHA-256 give you some protection against deliberate collision attacks (theoretical but no practical attacks with SHA-1 are known; for keying data, it's rarely worth going beyon a 160-bit hash code width). SHA-1 is roughly half the speed of MD5.
SHA-1 和 SHA-256 为您提供了一些防止故意碰撞攻击的保护(理论上,但没有已知的实际使用 SHA-1 攻击;对于密钥数据,很少值得超过 160 位哈希码宽度)。SHA-1 的速度大约是 MD5 的一半。
Certainly if you use MD5, performance probably shouldn't be too much of an issue. But obviously this does depend on the size of your data. You may be interested in some information I put together about performance of secure hash functionsin Java.
当然,如果您使用 MD5,性能可能不会成为太大的问题。但这显然取决于您的数据大小。您可能对我汇总的有关Java中安全散列函数性能的一些信息感兴趣。
If you really do need something faster and you're only dealing with a few million items of data, then another option to consider is the 64-bit hash algorithm proposed by the Numerical Recipes authors.
如果您确实需要更快的速度并且您只处理几百万项数据,那么另一个要考虑的选择是数值食谱作者提出的 64 位哈希算法。
Java's standard hashCode() implementation (of, say, String) is probably not suitable: aside from any issues about the quality of the hash, its 32-bit width means that you'll expect a collision after just 16,000 items or so.
Java 的标准 hashCode() 实现(例如,String)可能不合适:除了关于哈希质量的任何问题之外,它的 32 位宽度意味着您预计仅在 16,000 个左右的项目后就会发生冲突。
回答by John Munsch
I'll endorse matt b's saying "don't optimize before you need to optimize."
我会赞同 matt b 的说法“在需要优化之前不要优化”。
However, should you decide you need something more than the hash code down the road... I used message digests (MD5 in my case) to "uniquely" identify various items downloaded from RSS feeds so I didn't end up with the same item appearing many times in the list as I polled over and over. Those were typically small postings so the digest could be calculated quickly. In my experience it was very effective and worked well.
然而,如果你决定你需要的不仅仅是哈希码......我使用消息摘要(在我的例子中是 MD5)来“唯一地”识别从 RSS 提要下载的各种项目,所以我没有得到相同的结果当我一遍又一遍地轮询时,项目在列表中多次出现。这些通常是小帖子,因此可以快速计算摘要。根据我的经验,它非常有效并且运作良好。
Since they normally are one way functions meant to react strongly to even very small changes in the input data, you are definitely less likely to get collisions with MD5 or SHA-1.
由于它们通常是一种函数,即使是对输入数据中的非常小的变化也能做出强烈反应,因此您肯定不太可能与 MD5 或 SHA-1 发生冲突。

