Git 如何创建唯一的提交哈希,主要是前几个字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34764195/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 11:41:41  来源:igfitidea点击:

How does Git create unique commit hashes, mainly the first few characters?

gitalgorithmhashgit-hash

提问by Ben

I find it hard to wrap my head around how Git creates fully unique hashes that aren't allowed to be the same even in the first 4 characters. I'm able to call commits in Git Bash using only the first four characters. Is it specifically decided in the algorithm that the first characters are "ultra"-unique and will not ever conflict with other similar hashes, or does the algorithm generate every part of the hash in the same way?

我发现很难理解 Git 如何创建完全独特的哈希,即使在前 4 个字符中也不允许相同。我只能使用前四个字符在 Git Bash 中调用提交。是在算法中明确决定第一个字符是“超”唯一的并且永远不会与其他类似的散列冲突,还是算法以相同的方式生成散列的每个部分?

回答by Chris Maes

Git uses the following information to generate the sha-1:

Git 使用以下信息生成 sha-1:

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info (with timestamp)
  • The committer info (right, those are different!, also with timestamp)
  • The commit message
  • 提交的源树(分解为所有子树和 blob)
  • 父提交 sha1
  • 作者信息(带时间戳)
  • 提交者信息(对,那些是不同的!,还有时间戳)
  • 提交消息

(on the complete explanation; look here).

(关于完整的解释;看这里)。

Git does NOTguarantee that the first 4 characters will be unique. In chapter 7 of the Pro Git Bookit is written:

Git保证前 4 个字符是唯一的。在Pro Git Book 的第 7 章中写道:

Git can figure out a short, unique abbreviation for your SHA-1 values. If you pass --abbrev-commit to the git log command, the output will use shorter values but keep them unique; it defaults to using seven characters but makes them longer if necessary to keep the SHA-1 unambiguous:

Git 可以为您的 SHA-1 值找出一个简短而独特的缩写。如果您将 --abbrev-commit 传递给 git log 命令,输出将使用较短的值但保持它们的唯一性;它默认使用七个字符,但如果有必要使它们更长以保持 SHA-1 明确:

So Git just makes the abbreviation as long as necessaryto remain unique. They even note that:

因此,Git 只会在必要时使用缩写以保持唯一性。他们甚至注意到:

Generally, eight to ten characters are more than enough to be unique within a project.

As an example, the Linux kernel, which is a pretty large project with over 450k commits and 3.6 million objects, has no two objects whose SHA-1s overlap more than the first 11 characters.

一般来说,八到十个字符足以在一个项目中成为唯一的。

例如,Linux 内核是一个非常大的项目,拥有超过 45 万次提交和 360 万个对象,没有两个对象的 SHA-1 重叠超过前 11 个字符。

So in fact they just depend on the great improbabilityof having the exact same (X first characters of a) sha.

因此,实际上它们仅取决于具有完全相同(a 的第 X 个字符)sha 的可能性很小。

回答by VonC

Apr. 2017: Beware that after the all shattered.io episode(where a SHA1 collision was achieved by Google), the 20-byte format won't be there forever.

2017 年 4 月:请注意,在 all shattered.io 事件(Google 实现了 SHA1 冲突)之后,20 字节格式将不会永远存在。

A first step for that is to replace unsigned char sha1[20]which is hard-code all over the Git codebase by a generic object whose definition might change in the future (SHA2?, Blake2, ...)

第一步是unsigned char sha1[20]用一个通用对象替换整个 Git 代码库中的硬编码,该对象的定义将来可能会改变(SHA2?,Blake2,...)

See commit e86ab2c(21 Feb 2017) by brian m. carlson (bk2204).

请参阅brian m 的commit e86ab2c(2017 年 2 月 21 日)。卡尔森 ( bk2204)

Convert the remaining uses of unsigned char [20]to struct object_id.

将 的其余用途转换unsigned char [20]struct object_id

That is an example of an ongoing effort started with commit 5f7817c(13 Mar 2015) by brian m. carlson (bk2204), for v2.5.0-rc0, in cache.h:

这是一个持续努力的例子,从Brian m 的commit 5f7817c(2015 年 3 月 13 日)开始。carlson ( bk2204),对于 v2.5.0-rc0,在cache.h

/* The length in bytes and in hex digits of an object name (SHA-1 value). */
#define GIT_SHA1_RAWSZ 20
#define GIT_SHA1_HEXSZ (2 * GIT_SHA1_RAWSZ)

struct object_id {
    unsigned char hash[GIT_SHA1_RAWSZ];
};

And don't forget that, even with SHA1, the 4 first characters are no longer enough to guarantee uniqueness, as I explain in "How much of a git sha is generallyconsidered necessary to uniquely identify a change in a given codebase?".

并且不要忘记,即使使用 SHA1,前 4 个字符也不足以保证唯一性,正如我在“通常认为需要多少 git sha来唯一标识给定代码库中的更改?”中所述。



Update Dec. 2017with Git 2.16 (Q1 2018): this effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".

2017 年 12 月更新Git 2.16(2018 年第一季度):支持替代 SHA 的努力正在进行中:请参阅“为什么 Git 不使用更现代的 SHA?”。

You will be able to use another hash: SHA1 is no longer the only one for Git.

您将能够使用另一个哈希:SHA1 不再是 Git 的唯一哈希。

Update 2018-2019: the choice has been made in Git 2.19+: SHA-256.
See "hash-function-transition".

2018-2019 更新:已在 Git 2.19+ 中做出选择:SHA-256
参见“散列函数转换”。

This is not yet active (meaning git 2.21 is still using SHA1), but the code is being done to support in the future SHA-256.

这尚未激活(意味着 git 2.21 仍在使用 SHA1),但正在编写代码以支持未来的 SHA-256。



With Git 2.26 (Q1 2020), the work goes on, and uses "struct object_id"for replacing use of "char *sha1"

使用Git 2.26(Q1 2020),工作在继续,并用“结构object_id"替换使用” char *sha1

See commit 2fecc48, commit 6ac9760, commit b99b6bc, commit 63f4a7f, commit e31c710, commit 500e4f2, commit f66d4e0, commit a93c141, commit 3f83fd5, commit 0763671(24 Feb 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster--in commit e8e7184, 05 Mar 2020)

提交2fecc48提交6ac9760提交b99b6bc提交63f4a7f提交e31c710提交500e4f2提交f66d4e0提交a93c141提交3f83fd5提交0763671(2020年2月24日),由杰夫·王(peff
(由Junio C gitsterHamano合并-- --提交 e8e7184 中,2020 年 3 月 5 日)

packfile: drop nth_packed_object_sha1()

Signed-off-by: Jeff King

Once upon a time, nth_packed_object_sha1()was the primary way to get the oid of a packfile's index position.
But these days we have the more type-safe nth_packed_object_id()wrapper, and all callers have been converted.

Let's drop the "sha1" version (turning the safer wrapper into a single function) so that nobody is tempted to introduce new callers.

packfile: 降低 nth_packed_object_sha1()

签字人:杰夫·金

曾几何时,这nth_packed_object_sha1()是获取包文件索引位置的 oid 的主要方法。
但是现在我们有了更加类型安全的nth_packed_object_id()包装器,并且所有的调用者都被转换了。

让我们放弃 " sha1" 版本(将更安全的包装器变成单个函数),以便没有人想要引入新的调用者。