git git中的哈希冲突
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10434326/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Hash collision in git
提问by Sec
What would actually happen if I had a hash collision while using git?
如果我在使用 git 时发生哈希冲突,实际会发生什么?
E.g. I manage to commit two files with the same sha1 checksum, would git notice it or corrupt one of the files?
例如,我设法使用相同的 sha1 校验和提交两个文件,git 会注意到它还是损坏其中一个文件?
Could git be improved to live with that, or would I have to change to a new hash algorithm?
是否可以改进 git 以适应它,或者我是否必须更改为新的哈希算法?
(Please do not deflect this question by discussing how unlikely that is - Thanks)
(请不要通过讨论这是多么不可能来转移这个问题 - 谢谢)
回答by MichaelK
Picking atoms on 10 Moons
在 10 个卫星上挑选原子
An SHA-1 hash is a 40 hex character string... that's 4 bits per character times 40... 160 bits. Now we know 10 bits is approximately 1000 (1024 to be exact) meaning that there are 1 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 different SHA-1 hashes... 1048.
SHA-1 哈希是 40 个十六进制字符串……每个字符 4 位乘以 40……160 位。现在我们知道10位约为1000(1024是精确的)这意味着有1 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000不同的SHA-1散列... 10 48。
What is this equivalent of? Well the Moon is made up of about 1047atoms. So if we have 10 Moons... and you randomly pick one atom on one of these moons... and then go ahead and pick a random atom on them again... then the likelihood that you'll pick the same atom twice, is the likelihood that two given git commits will have the same SHA-1 hash.
这相当于什么?好吧,月球由大约 10 47 个原子组成。因此,如果我们有 10 个卫星……并且您在其中一个卫星上随机选择一个原子……然后继续在它们上再次随机选择一个原子……那么您将选择同一个原子两次的可能性, 是两个给定的 git 提交具有相同 SHA-1 哈希值的可能性。
Expanding on this we can ask the question...
扩展这一点,我们可以问这个问题......
How many commits do you need in a repository before you should start worrying about collisions?
在您开始担心冲突之前,您需要在存储库中提交多少次?
This relates to so called "Birthday attacks", which in turn refers to the "Birthday Paradox" or "Birthday Problem", which states that when you pick randomly from a given set, you need surprisingly few picks before you are more likely than not to have picked something twice. But "surprisingly few" is a very relative term here.
这与所谓的“生日攻击”有关,它反过来又指“生日悖论”或“生日问题”,它指出当您从给定的集合中随机选择时,您需要的选择很少,然后您更有可能选择了两次。但是“令人惊讶的很少”在这里是一个非常相对的术语。
Wikipedia has a table on the probability of Birthday Paradox collisions. There is no entry for a 40 character hash. But an interpolation of the entries for 32 and 48 characters lands us in the range of 5*1022git commits for a 0.1% probability of a collision. That is fifty thousand billion billion different commits, or fifty Zettacommits, before you have reached even a 0.1% chance that you have a collision.
维基百科有一个关于生日悖论碰撞概率的表格。没有 40 个字符的哈希条目。但是 32 和 48 个字符的条目的插值使我们处于 5*10 22git commits的范围内,碰撞概率为 0.1%。在发生碰撞的几率甚至达到 0.1% 之前,这是五万亿次不同的提交,或五十个Zettacommits。
The byte sum of the hashes alone for these commits would be more data than all the data generated on Earth for a year, which is to say you would need to churn out code faster than YouTube streams out video. Good luck with that. :D
仅这些提交的哈希字节总和将比地球上一年生成的所有数据更多,也就是说,您需要以比 YouTube 流式传输视频更快的速度生成代码。祝你好运。:D
The point of this is that unless someone is deliberately causing a collision, the probability of one happening at random is so staggeringly small you can ignore this issue
重点是除非有人故意造成碰撞,否则随机发生的概率非常小,您可以忽略此问题
"But when a collision doesoccur, then what actually happens?"
“但是当碰撞确实发生时,那么实际上会发生什么?”
Ok, suppose the improbable does happen, or suppose someone managed to tailor a deliberate SHA-1 hash collision. What happens then?
好吧,假设不可能的事情确实发生了,或者假设有人设法定制了一个故意的 SHA-1 哈希冲突。那会发生什么?
In that case there is an excellent answer where someone experimented on it. I will quote from that answer:
在这种情况下,有一个很好的答案,有人对其进行了实验。我会引用那个答案:
- If a blob already exists with the same hash, you will not get any warnings at all. Everything seems to be ok, but when you push, someone clones, or you revert, you will lose the latest version (in line with what is explained above).
- If a tree object already exists and you make a blob with the same hash: Everything will seem normal, until you either try to push or someone clones your repository. Then you will see that the repo is corrupt.
- If a commit object already exists and you make a blob with the same hash: same as #2 - corrupt
- If a blob already exists and you make a commit object with the same hash, it will fail when updating the "ref".
- If a blob already exists and you make a tree object with the same hash. It will fail when creating the commit.
- If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref".
- If a tree object already exists and you make a tree object with the same hash, everything will seem ok. But when you commit, all of the repository will reference the wrong tree.
- If a commit object already exists and you make a commit object with the same hash, everything will seem ok. But when you commit, the commit will never be created, and the HEAD pointer will be moved to an old commit.
- If a commit object already exists and you make a tree object with the same hash, it will fail when creating the commit.
- 如果已经存在具有相同散列的 blob,则您根本不会收到任何警告。一切似乎都没问题,但是当您推送、有人克隆或还原时,您将丢失最新版本(与上面解释的内容一致)。
- 如果树对象已经存在,并且您使用相同的哈希创建了一个 blob:一切看起来都很正常,直到您尝试推送或有人克隆您的存储库。然后你会看到 repo 已损坏。
- 如果提交对象已经存在并且您使用相同的散列创建了一个 blob:与 #2 相同 - 损坏
- 如果 blob 已经存在并且您使用相同的散列创建提交对象,则更新“ref”时它将失败。
- 如果 blob 已经存在,并且您使用相同的散列创建了一个树对象。创建提交时它将失败。
- 如果树对象已经存在并且您使用相同的散列创建提交对象,则更新“ref”时它将失败。
- 如果一个树对象已经存在并且你用相同的散列创建了一个树对象,那么一切看起来都没有问题。但是当你提交时,所有的存储库都会引用错误的树。
- 如果提交对象已经存在并且您使用相同的散列创建了一个提交对象,那么一切看起来都没有问题。但是当你提交时,提交将永远不会被创建,并且 HEAD 指针将被移动到一个旧的提交。
- 如果提交对象已经存在并且您使用相同的哈希创建树对象,则创建提交时它将失败。
As you can seem some cases are not good. Especially cases #2 and #3 messes up your repository. However, it does seem that the fault stays within that repository, and the attack/bizarre improbability does not propagate to other reposistories.
正如您所看到的,有些情况并不好。尤其是情况 #2 和 #3 会弄乱您的存储库。但是,故障似乎确实存在于该存储库中,并且攻击/奇怪的不可能性不会传播到其他存储库。
Also it seems that the issue of deliberate collisions is being recognised as a real threat, and so for instance GitHub is taking measures to prevent it.
此外,蓄意冲突的问题似乎被认为是一个真正的威胁,因此GitHub 正在采取措施防止它。
回答by klaustopher
If two files have the same hash sum in git, it would treat those files as identical. In the absolutely unlikely case this happens, you could always go back one commit, and change something in the file so they wouldn't collide anymore ...
如果两个文件在 git 中具有相同的哈希和,它会将这些文件视为相同。在绝对不可能发生的情况下,您可以随时返回一次提交,并更改文件中的某些内容,以便它们不再发生冲突......
See Linus Torvalds' post in the thread “Starting to think about sha-256?” in the git mailing list.
回答by Steve
It's not really possible to answer this question with the right "but" without also explaining why it's not a problem. It's not possible to do that without really having a good grip on what a hash really is. It's more complicated than the simple cases you might have been exposed to in a CS program.
真的不可能用正确的“但是”来回答这个问题而不解释为什么它不是问题。如果不真正掌握哈希的真正含义,就不可能做到这一点。它比您在 CS 程序中可能接触过的简单情况更复杂。
There is a basic misunderstanding of information theory here. If you reduce a large amount of information into a smaller amount by discarding some amount (ie. a hash) there will be a chance of collision directly related to the length of the data. The shorter the data, the LESS likely it will be. Now, the vast majority of the collisions will be gibberish, making them that much more likely to actually happen (you would never check in gibberish...even a binary image is somewhat structured). In the end, the chances are remote. To answer your question, yes, git will treat them as the same, changing the hash algorithm won't help, it'll take a "second check" of some sort, but ultimately, you would need as much "additional check" data as the length of the data to be 100% sure...keep in mind you would be 99.99999....to a really long number of digits.... sure with a simple check like you describe. SHA-x are cryptographically strong hashes, which means is't generally hard to intentionally create two source data sets that are both VERY SIMILAR to each other, and have the same hash. One bit of change in the data should create more than one (preferably as many as possible) bits of change in the hash output, which also means it's very difficult (but not quite impossible) to work back from the hash to the complete set of collisions, and thereby pull out the original message from that set of collisions - all but a few will be gibberish, and of the ones that aren't there's still a huge number to sift through if the message length is any significant length. The downside of a crypto hash is that they are slow to compute...in general.
这里存在对信息论的基本误解。如果通过丢弃一些信息(即散列)将大量信息减少到较小的信息量,则可能会发生与数据长度直接相关的冲突。数据越短,它就越不可能。现在,绝大多数碰撞都是胡言乱语,使它们更有可能真正发生(你永远不会检查胡言乱语......即使是二进制图像也有点结构化)。最后,机会渺茫。要回答您的问题,是的,git 会将它们视为相同的,更改哈希算法无济于事,它需要进行某种“第二次检查”,但最终,您将需要尽可能多的“附加检查”数据因为数据的长度是 100% 确定的...请记住,您将是 99.99999.... 到一个非常长的数字......当然可以像你描述的那样进行简单的检查。SHA-x 是加密强哈希,这意味着有意创建两个彼此非常相似且具有相同哈希的源数据集通常并不难。数据中的一点变化应该在散列输出中产生不止一个(最好尽可能多)的变化,这也意味着很难(但并非完全不可能)从散列返回到完整的一组冲突,从而从那组冲突中提取原始消息 - 除了少数之外的所有消息都是胡言乱语,如果消息长度是任何重要的长度,那么在那些没有的那些中仍然有大量的数字需要筛选。加密散列的缺点是它们的计算速度很慢……总的来说。
So, what's it all mean then for Git? Not much. The hashes get done so rarely (relative to everything else) that their computational penalty is low overall to operations. The chances of hitting a pair of collisions is so low, it's not a realistic chance to occur and not be detected immediately (ie. your code would most likely suddenly stop building), allowing the user to fix the problem (back up a revision, and make the change again, and you'll almost certainly get a different hash because of the time change, which also feeds the hash in git). There is more likely for it to be a real problem for you if you're storing arbitrary binaries in git, which isn't really what it's primary use model is. If you want to do that...you're probably better off using a traditional database.
那么,这对 Git 来说意味着什么呢?不多。散列很少完成(相对于其他所有事情),以至于它们的计算代价对操作来说总体上很低。发生一对冲突的可能性非常低,这不是一个现实的机会,不会立即被检测到(即您的代码很可能会突然停止构建),从而允许用户修复问题(备份修订版,并再次进行更改,由于时间更改,您几乎肯定会得到不同的哈希值,这也会在 git 中提供哈希值)。如果您在 git 中存储任意二进制文件,这对您来说更有可能成为一个真正的问题,这并不是它的主要使用模型。如果您想这样做……您最好使用传统数据库。
It's not wrong to think about this - it's a good question that a lot of people just pass off as "so unlikely it's not worth thinking about" - but it's really a little more complicated than that. If it DOES happen, it should be very readily detectible, it won't be a silent corruption in a normal workflow.
考虑这一点并没有错——这是一个很好的问题,很多人只是认为“不太可能,不值得考虑”——但实际上比这要复杂一些。如果它确实发生了,它应该很容易被检测到,它不会是正常工作流程中的无声损坏。
回答by Roberto Bonvallet
Could git be improved to live with that, or would I have to change to a new hash algorithm?
是否可以改进 git 以适应它,或者我是否必须更改为新的哈希算法?
Collisions are possible for any hash algorithm, so changing the hash function doesn't preclude the problem, it just makes it less likely to happen. So you should choose then a really good hash function (SHA-1 already is, but you asked not to be told :)
任何散列算法都可能发生冲突,因此更改散列函数并不能排除问题,只会降低发生的可能性。所以你应该选择一个非常好的散列函数(SHA-1已经是,但你要求不要被告知:)
回答by VonC
You can see a good study in "How would Git handle a SHA-1 collision on a blob?".
您可以在“ Git 如何处理 blob 上的 SHA-1 冲突?”中看到一项很好的研究。
Since a SHA1 collision is now possible (as I reference in this answerwith shattered.io), know that Git 2.13 (Q2 2017) will improve/mitigate the current situation with a "detect attempt to create collisions" variant of SHA-1 implementation by Marc Stevens (CWI) and Dan Shumow (Microsoft).
由于SHA1碰撞现在可以(我在引用这个答案与shattered.io),知道了Git 2.13(Q2 2017)将提高/使用的“检测企图制造冲突”变种缓解目前的状况SHA1实现作者:Marc Stevens (CWI) 和 Dan Shumow (Microsoft)。
See commit f5f5e7f, commit 8325e43, commit c0c2006, commit 45a574e, commit 28dc98e(16 Mar 2017) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
--in commit 48b3693, 24 Mar 2017)
请参阅Jeff King ( ) 的commit f5f5e7f、commit 8325e43、commit c0c2006、commit 45a574e、commit 28dc98e(2017 年 3 月 16 日)。(由Junio C Hamano合并-- --在2017 年 3 月 24 日提交 48b3693 中)peff
gitster
Makefile
: makeDC_SHA1
the defaultWe used to use the SHA1 implementation from the OpenSSL library by default.
As we are trying to be careful against collision attacks after the recent "shattered" announcement, switch the default to encourage people to use DC_SHA1 implementation instead.
Those who want to use the implementation from OpenSSL can explicitly ask for it byOPENSSL_SHA1=YesPlease
when running "make
".We don't actually have a Git-object collision, so the best we can do is to run one of the shattered PDFs through test-sha1. This should trigger the collision check and die.
Makefile
: 设为DC_SHA1
默认我们曾经默认使用 OpenSSL 库中的 SHA1 实现。
由于我们在最近的“粉碎”公告后试图小心防止碰撞攻击,因此请切换默认设置以鼓励人们改用 DC_SHA1 实现。
那些想要使用 OpenSSL 实现的人可以OPENSSL_SHA1=YesPlease
在运行“make
”时明确要求它。我们实际上没有 Git 对象碰撞,所以我们能做的最好的事情就是通过 test-sha1 运行一个破碎的 PDF。这应该触发碰撞检查并死亡。
Could Git be improved to live with that, or would I have to change to a new hash algorithm?
是否可以改进 Git 以适应它,或者我是否必须更改为新的哈希算法?
Update Dec. 2017with Git 2.16 (Q1 2018): this effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".
2017 年 12 月更新Git 2.16(2018 年第一季度):支持替代 SHA 的努力正在进行中:请参阅“为什么 Git 不使用更现代的 SHA?”。
You will be able to use another hash algorithm: SHA1 is no longer the only one for Git.
您将能够使用另一种哈希算法:SHA1 不再是 Git 的唯一算法。
Git 2.18 (Q2 2018) documents that process.
Git 2.18(2018 年第二季度)记录了该过程。
See commit 5988eb6, commit 45fa195(26 Mar 2018) by ?var Arnfj?re Bjarmason (avar
).
(Merged by Junio C Hamano -- gitster
--in commit d877975, 11 Apr 2018)
请参阅?var Arnfj?re Bjarmason ( ) 的提交 5988eb6、提交 45fa195(2018 年 3 月 26 日)。(由Junio C Hamano合并-- --在提交 d877975 中,2018 年 4 月 11 日)avar
gitster
doc
hash-function-transition
: clarify what SHAttered meansAttempt to clarify what the SHAttered attack means in practice for Git.
The previous version of the text made no mention whatsoever of Git already having a mitigation for this specific attack, which the SHAttered researchers claim will detect cryptanalytic collision attacks.I may have gotten some of the nuances wrong, but as far as I know this new text accurately summarizes the current situation with SHA-1 in git. I.e. git doesn't really use SHA-1 anymore, it uses Hardened-SHA-1(they just so happen to produce the same outputs 99.99999999999...% of the time).
Thus the previous text was incorrect in asserting that:
[...]As a result [of SHAttered], SHA-1 cannot be considered cryptographically secure any more[...]
That's not the case. We have a mitigation against SHAttered, howeverwe consider it prudent to move to work towards a
NewHash
should future vulnerabilities in either SHA-1 or Hardened-SHA-1 emerge.
doc
hash-function-transition
: 阐明 SHAttered 的含义尝试阐明 SHAttered 攻击在实践中对 Git 意味着什么。
先前版本的文本没有提及 Git 已经对这种特定攻击进行了缓解,SHAttered 研究人员声称这将检测密码分析碰撞攻击。我可能弄错了一些细微差别,但据我所知,这篇新文本准确地总结了 git 中 SHA-1 的当前情况。即git 不再真正使用 SHA-1,它使用 Hardened-SHA-1(它们恰好产生相同的输出 99.99999999999...% 的时间)。
因此,先前的文本断言是不正确的:
[...]作为[SHAttered]的结果,SHA-1不能再被认为是加密安全的[...]
事实并非如此。我们有针对 SHAttered 的缓解措施,但我们认为,
NewHash
如果未来出现 SHA-1 或 Hardened-SHA-1 中的漏洞,则应谨慎行事。
So the new documentationnow reads:
所以新文档现在是:
Git v2.13.0 and later subsequently moved to a hardened SHA-1 implementation by default, which isn't vulnerable to the SHAttered attack.
Thus Git has in effect already migrated to a new hash that isn't SHA-1 and doesn't share its vulnerabilities, its new hash function just happens to produce exactly the same output for all known inputs, except two PDFs published by the SHAttered researchers, and the new implementation (written by those researchers) claims to detect future cryptanalytic collision attacks.
Regardless, it's considered prudent to move past any variant of SHA-1 to a new hash. There's no guarantee that future attacks on SHA-1 won't be published in the future, and those attacks may not have viable mitigations.
If SHA-1 and its variants were to be truly broken, Git's hash function could not be considered cryptographically secure any more. This would impact the communication of hash values because we could not trust that a given hash value represented the known good version of content that the speaker intended.
Git v2.13.0 及更高版本在默认情况下转向了强化的 SHA-1 实现,它不易受到 SHAttered 攻击。
因此,Git 实际上已经迁移到一个不是 SHA-1 的新散列并且不共享其漏洞,它的新散列函数恰好为所有已知输入产生完全相同的输出,除了 SHAttered 发布的两个 PDF研究人员,以及新的实现(由这些研究人员编写)声称可以检测未来的密码分析碰撞攻击。
无论如何,将 SHA-1 的任何变体转移到新的哈希值都被认为是谨慎的。无法保证将来不会发布对 SHA-1 的未来攻击,并且这些攻击可能没有可行的缓解措施。
如果 SHA-1 及其变体真的被破解,Git 的哈希函数就不能再被认为是加密安全的。这会影响散列值的通信,因为我们不能相信给定的散列值代表说话者想要的已知内容的良好版本。
Note: that same document now (Q3 2018, Git 2.19) explicitly references the "new hash" as SHA-256: see "Why doesn't Git use more modern SHA?".
注意:现在同一文档(2018 年第三季度,Git 2.19)将“新哈希”明确引用为 SHA-256:请参阅“为什么 Git 不使用更现代的 SHA?”。
回答by Petercommand Hsu
Google now claims that SHA-1 collision is possible under certain preconditions: https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
谷歌现在声称在某些先决条件下可能发生 SHA-1 冲突:https: //security.googleblog.com/2017/02/annoucing-first-sha1-collision.html
Since git uses SHA-1 to check for file integrity, this means that file integrity in git is compromised.
由于 git 使用 SHA-1 来检查文件完整性,这意味着 git 中的文件完整性受到损害。
IMO, git should definitely use a better hashing algorithm since deliberate collision is now possible.
IMO,git 绝对应该使用更好的散列算法,因为现在可能发生故意碰撞。
回答by bytecode77
A hash collision is so highly unlikely, that it is sheer mind blowing! Scientists all over the world are trying hard to achieve one, but didn't manage it yet. For certain algorithms such as MD5 they successed, though.
哈希冲突的可能性非常小,简直令人难以置信!全世界的科学家都在努力实现一个目标,但还没有成功。不过,对于某些算法(例如 MD5),他们成功了。
What are the odds?
赔率是多少?
SHA-256has 2^256 possible hashes. That is about 10^78. Or to be more graphic, the chances of a collision are at about
SHA-256有 2^256 个可能的哈希值。那大约是10^78。或者更形象地说,碰撞的可能性大约是
1 : 100 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
1 : 100 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 0 000 000 0
The chance of winning the lottery is about 1 : 14 Mio. The chance of a collision with SHA-256 is like winning the lottery on 11 consecutive days!
中彩票的概率大约是1:14神达。与SHA-256碰撞的几率就像连续11天中彩票一样!
Mathematic explanation: 14 000 000^ 11 ~ 2^256
数学解释:14 000 000^ 11 ~ 2^256
Furthermore, the universehas about 10^80 atoms. That's just 100 times more than there are SHA-256 combinations.
此外,宇宙大约有 10^80 个原子。这仅是 SHA-256 组合的 100 倍。
Successful MD5 collision
成功的 MD5 碰撞
Even for MD5the chances are tiny. Though, mathematicians managed to create a collision:
即使对于MD5,机会也很小。尽管如此,数学家还是设法制造了碰撞:
d131dd02c5e6eec4 693d9a0698aff95c 2fcab58712467eab 4004583eb8fb7f89 55ad340609f4b302 83e488832571415a 085125e8f7cdc99f d91dbdf280373c5b d8823e3156348f5b ae6dacd436c919c6 dd53e2b487da03fd 02396306d248cda0 e99f33420f577ee8 ce54b67080a80d1e c69821bcb6a88393 96f9652b6ff72a70
has the same MD5 as
具有相同的 MD5
d131dd02c5e6eec4 693d9a0698aff95c 2fcab50712467eab 4004583eb8fb7f89 55ad340609f4b302 83e4888325f1415a 085125e8f7cdc99f d91dbd7280373c5b d8823e3156348f5b ae6dacd436c919c6 dd53e23487da03fd 02396306d248cda0 e99f33420f577ee8 ce54b67080280d1e c69821bcb6a88393 96f965ab6ff72a70
This doesn't mean that MD5 is less safe now that its algorithm is cracked. You can create MD5 collisions on purpose, but the chance of an accidental MD5 collision is still 2^128, which is still a lot.
这并不意味着 MD5 由于其算法被破解而变得不那么安全。可以故意制造MD5碰撞,但是意外MD5碰撞的几率还是2^128,还是很多的。
Conclusion
结论
You don't have to have a single worry about collisions. Hashing algorithms are the second safest way to check file sameness. The only safer way is a binary comparison.
您不必担心碰撞。散列算法是检查文件相同性的第二种最安全的方法。唯一更安全的方法是二进制比较。
回答by Conor Bradley
回答by Guenther Brunthaler
I recently found a posting from 2013-04-29 in a BSD discussion group at
我最近在 BSD 讨论组中发现了 2013-04-29 的帖子
http://openbsd-archive.7691.n7.nabble.com/Why-does-OpenBSD-use-CVS-td226952.html
http://openbsd-archive.7691.n7.nabble.com/Why-does-OpenBSD-use-CVS-td226952.html
where the poster claims:
海报声称:
I ran into a hash collision once, using git rebase.
我使用 git rebase 遇到了一次哈希冲突。
Unfortunately, he provides no proof for his claim. But maybe you would like trying to contact him and ask him about this supposed incident.
不幸的是,他没有为他的主张提供任何证据。但也许你想尝试联系他并询问他关于这个假设的事件。
But on a more general level, due to the birthday attack a chance for an SHA-1 hash collision is 1 in pow(2, 80).
但在更一般的层面上,由于生日攻击,在 pow(2, 80) 中,SHA-1 哈希冲突的几率为 1。
This sounds a lot and is certainly way more than the total number of versions of individual files present in all Git repositories of the world combined.
这听起来很多,而且肯定远远超过世界上所有 Git 存储库中存在的单个文件的版本总数。
However, this only applies to the versions which actually remain in version history.
但是,这仅适用于实际保留在版本历史记录中的版本。
If a developer relies very much on rebasing, every time a rebase is run for a branch, all the commits in all the versions of that branch (or rebased part of the branch) get new hashes. The same is true for every file modifies with "git filter-branch". Therefore, "rebase" and "filter-branch" might be big multipliers for the number of hashes generated over time, even though not all of them are actually kept: Frequently, after rebasing (especially for the purpose of "cleaning up" a branch), the original branch is thrown away.
如果开发人员非常依赖变基,则每次为分支运行变基时,该分支的所有版本(或分支的变基部分)中的所有提交都会获得新的哈希值。对于使用“git filter-branch”修改的每个文件也是如此。因此,“rebase”和“filter-branch”可能是随着时间的推移生成的散列数量的巨大乘数,即使实际上并不是所有的散列都被保留:经常,在rebase之后(特别是为了“清理”一个分支),原来的分支被扔掉。
But if the collision occurs during the rebase or filter-branch, it can still have adverse effects.
但是如果在 rebase 或 filter-branch 期间发生碰撞,它仍然会产生不利影响。
Another thing would be to estimate the total number of hashed entities in git repositories and see how far they are from pow(2, 80).
另一件事是估计 git 存储库中散列实体的总数,并查看它们与 pow(2, 80) 的距离。
Let's say we have about 8 billion people, and all of them would be running git and keep their stuff versioned in 100 git repositories per person. Let' further assume the average repository has 100 commits and 10 files, and only one of those files changes per commit.
假设我们有大约 80 亿人,他们都将运行 git 并将他们的东西版本控制在每人 100 个 git 存储库中。让我们进一步假设平均存储库有 100 个提交和 10 个文件,并且每次提交只有这些文件中的一个更改。
For every revision we have at least a hash for the tree object and the commit object itself. Together with the changed file we have 3 hashes per revision, and thus 300 hashes per repository.
对于每个修订,我们至少有一个树对象和提交对象本身的哈希值。连同更改的文件,我们每个修订版有 3 个哈希值,因此每个存储库有 300 个哈希值。
For 100 repositories of 8 billion people this gives pow(2, 47) which is still far from pow(2, 80).
对于 80 亿人的 100 个存储库,这给出了与 pow(2, 80) 相去甚远的 pow(2, 47)。
However, this does not include the supposed multiplication effect mentioned above, because I am uncertain how to include it in this estimation. Maybe it could increase the chances for a collision considerably. Especially if very large repositories which a long commit history (like the Linux Kernel) are rebased by many people for small changes, which nevertheless create different hashes for all affected commits.
然而,这并不包括上面提到的假设的乘法效应,因为我不确定如何将它包括在这个估计中。也许它可以大大增加碰撞的机会。特别是如果非常大的存储库具有很长的提交历史(如 Linux 内核)被许多人重新定位以进行小的更改,但仍会为所有受影响的提交创建不同的哈希值。