*通常*认为需要多少 git sha 来唯一标识给定代码库中的更改?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18134627/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 16:44:34  来源:igfitidea点击:

How much of a git sha is *generally* considered necessary to uniquely identify a change in a given codebase?

gitgithubsha

提问by Jun-Dai Bates-Kobashigawa

If you're going to build, say, a directory structure where a directory is named for a commit in a Git repository, and you want it to be short enough to make your eyes not bleed, but long enough that the chance of it colliding would be negligible, how much of the SHA substring is generally required?

例如,如果您要构建一个目录结构,其中一个目录以 Git 存储库中的提交命名,并且您希望它足够短以防止流血,但足够长以防止发生冲突可以忽略不计,通常需要多少 SHA 子串?

Let's say I want to uniquely identify this change: https://github.com/wycats/handlebars.js/commit/e62999f9ece7d9218b9768a908f8df9c11d7e920

假设我想唯一标识此更改:https: //github.com/wycats/handlebars.js/commit/e62999f9ece7d9218b9768a908f8df9c11d7e920

I can use as little as the first four characters: https://github.com/wycats/handlebars.js/commit/e629

我可以使用前四个字符:https: //github.com/wycats/handlebars.js/commit/e629

But I feel like that would be risky. But ssuming a codebase that, over a couple of years, might have—say—30k changes, what are the chances of collision if I use 8 characters? 12? Is there a number that's generally considered acceptable for this sort of thing?

但我觉得那样会很冒险。但是假设一个代码库在几年内可能会发生(比如说)3 万次更改,如果我使用 8 个字符,发生冲突的可能性有多大?12?对于这类事情,是否有一个通常被认为可以接受的数字?

回答by Nevik Rehnel

This question is actually answered in Chapter 7 of the Pro Git book:

这个问题实际上在Pro Git 书的第 7 章中得到了回答:

Generally, eight to ten characters are more than enough to be unique within a project. One of the largest Git projects, the Linux kernel, is beginning to need 12 characters out of the possible 40 to stay unique.

一般来说,八到十个字符足以在一个项目中成为唯一的。最大的 Git 项目之一,Linux 内核,开始需要 40 个字符中的 12 个字符来保持唯一性。

7 digits is the Git default for a short SHA, so that's fine for most projects. The Kernel team have increased theirs several times, as mentioned, because the have several hundred thousandcommits. So for your ~30k commits, 8 or 10 digits should be perfectly fine.

7 位是短 SHA 的 Git 默认值,所以这对大多数项目都适用。如前所述,内核团队已经增加了几次,因为有几十万次提交。因此,对于大约 30k 次提交,8 或 10 位数字应该完全没问题。

回答by VonC

Note: you can ask git rev-parse --shortfor the shortest and yet unique SHA1.
See "git get short hash from regular hash"

注意:您可以要求git rev-parse --short最短但唯一的 SHA1。
参见“ git get short hash from regular hash

git rev-parse --short=4 921103db8259eb9de72f42db8b939895f5651489
92110

As you can see in my example the SHA1 has a length of 5 even if I specified a length of 4.

正如您在我的示例中所看到的,即使我指定的长度为 4,SHA1 的长度也是 5。



For big repos, 7 isn't enough since 2010, and commit dce9648by Linus Torvalds himself (git 1.7.4.4, Oct 2010):

对于大型存储库,自 2010 年以来 7 还不够,并且由 Linus Torvalds 本人提交 dce9648(git 1.7.4.4,2010 年 10 月):

The default of 7 comes from fairly early in git development, when seven hex digits was a lot (it covers about 250+ million hash values).
Back then I thought that 65k revisions was a lot (it was what we were about to hit in BK), and each revision tends to be about 5-10 new objects or so, so a million objects was a big number.

默认值 7 来自 git 开发的早期,当时 7 个十六进制数字很多(它涵盖了大约 250+ 百万个哈希值)。
那时我认为 65k 修订是很多(这是我们将要在 BK 中达到的),并且每个修订往往是大约 5-10 个新对象左右,所以一百万个对象是一个很大的数字。

(BK = BitKeeper)

(BK = BitKeeper)

These days, the kernel isn't even the largest git project, and even the kernel has about 220k revisions (muchbigger than the BK tree ever was) and we are approaching two million objects.
At that point, seven hex digits is still unique for a lot of them, but when we're talking about just two orders of magnitude difference between number of objects and the hash size, there willbe collisions in truncated hash values.
It's no longer even close to unrealistic - it happens all the time.

We should both increase the default abbrev that was unrealistically small, andadd a way for people to set their own default per-project in the git config file.

这些天来,内核甚至不是最大的Git项目,甚至内核约220K版本(比BK树曾是大),我们正在接近200万级的对象。
在这一点上,七个十六进制数字对于其中的很多仍然是唯一的,但是当我们谈论对象数量和散列大小之间只有两个数量级的差异时,截断的散列值发生冲突。
它甚至不再接近于不切实际——它一直在发生。

我们应该增加不切实际的小默认缩写,添加一种方法让人们在 git 配置文件中设置他们自己的默认每个项目

core.abbrev

Set the length object names are abbreviated to.
If unspecified, many commands abbreviate to 7 hexdigits, which may not be enough for abbreviated object names to stay unique for sufficiently long time.

设置长度对象名称的缩写。
如果未指定,许多命令会缩写为 7 个十六进制数字,这可能不足以让缩写的对象名称在足够长的时间内保持唯一。

environment.c:

environment.c

int minimum_abbrev = 4, default_abbrev = 7;

Note: As commented belowby marco.m, core.abbrevLengthwas renamed in core.abbrevin that same Git 1.7.4.4 in commit a71f09f

注意:正如marco.m下面评论那样,在提交 a71f09f 的同一个 Git 1.7.4.4 中core.abbrevLength被重命名core.abbrev

Rename core.abbrevlengthback to core.abbrev

It corresponds to --abbrev=$ncommand line option after all.

重命名core.abbrevlengthcore.abbrev

--abbrev=$n毕竟它对应于命令行选项。



More recently, Linus added in commit e6c587c(for Git 2.11, Q4 2016):
(as mentioned in Matthieu Moy's answer)

最近,Linus 添加了提交 e6c587c(对于 Git 2.11,2016年第四季度):(
Matthieu Moy回答中所述

In fairly early days we somehow decided to abbreviate object names down to 7-hexdigits, but as projects grow, it is becoming more and more likely to see such a short object names made in earlier days and recorded in the log messages no longer unique.

Currently the Linux kernel project needs 11 to 12 hexdigits, while Git itself needs 10 hexdigits to uniquely identify the objects they have, while many smaller projects may still be fine with the original 7-hexdigit default. One-size does not fit all projects.

Introduce a mechanism, where we estimate the number of objects in the repository upon the first request to abbreviate an object name with the default setting and come up with a sane default for the repository. Based on the expectation that we would see collision in a repository with 2^(2N)objects when using object names shortened to first N bits, use sufficient number of hexdigits to cover the number of objects in the repository.
Each hexdigit (4-bits) we add to the shortened name allows us to have four times (2-bits) as many objects in the repository.

在相当早的时候,我们以某种方式决定将对象名称缩写为 7 位十六进制数字,但随着项目的发展,越来越多的人看到早期创建的如此短的对象名称并记录在日志消息中不再是唯一的。

目前 Linux 内核项目需要 11 到 12 个十六进制数字,而 Git 本身需要 10 个十六进制数字来唯一标识它们拥有的对象,而许多较小的项目可能仍然可以使用原始的 7 进制默认值。一种尺寸并不适合所有项目。

引入一种机制,我们在第一次请求时估计存储库中的对象数量,以使用默认设置缩写对象名称,并为存储库提供合理的默认值。基于在2^(2N)使用缩短为前 N 位的对象名称时我们会在存储库中看到对象冲突的预期,使用足够数量的十六进制数字来覆盖存储库中的对象数量。
我们添加到缩短名称的每个十六进制数字(4 位)允许我们在存储库中拥有四倍(2 位)的对象。

See commit e6c587c(01 Oct 2016) by Linus Torvalds (torvalds).
See commit 7b5b772, commit 65acfea(01 Oct 2016) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster--in commit bb188d0, 03 Oct 2016)

请参阅Linus Torvalds ( )提交的 e6c587c(2016 年 10 月 1 日)。 请参阅Junio C Hamano() 的commit 7b5b772commit 65acfea(2016 年 10 月 1 日(由Junio C Hamano合并-- --bb188d0 提交中,2016 年 10 月 3 日)torvalds
gitster
gitster

That new property (guessing a reasonnable default for SHA1 abbrev value) has a direct effect on how Git compute its own version number for release.

这个新属性(猜测 SHA1 abbrev 值的合理默认值)对Git如何计算它自己的 release 版本号有直接影响。

回答by plugwash

This is known as the birthday problem.

这被称为生日问题。

For probabilities less than 1/2 the probability of a collision can be approximated as

对于小于 1/2 的概率,碰撞的概率可以近似为

p ~= (n2)/(2m)

p ~= (n 2)/(2m)

Where n is the number of items and m is the number of possibilities for each item.

其中 n 是项目数,m 是每个项目的可能性数。

The number of possibilities for a hex string is 16cwhere c is the number of characters.

十六进制字符串的可能性数是 16 c,其中 c 是字符数。

So for 8 characters and 30K commits

所以对于 8 个字符和 30K 提交

30K ~= 215

30K ~= 2 15

p ~= (n2)/(2m) ~= ((215)2)/(2*168) = 230/233= ?

p ~= (n 2)/(2m) ~= ((2 15) 2)/(2*16 8) = 2 30/2 33= ?

Increasing it to 12 characters

将其增加到 12 个字符

p ~= (n2)/(2m) ~= ((215)2)/(2*1612) = 230/249= 2-19

p ~= (n 2)/(2m) ~= ((2 15) 2)/(2*16 12) = 2 30/2 49= 2 -19

回答by Messa

This question has been answered, but for anyone looking for the math behind - it's called Birthday problem(Wikipedia).

这个问题已经得到解答,但对于任何寻找背后数学的人来说 - 这被称为生日问题维基百科)。

It is about the probability of having 2 (or more) people from group of N people to have birthday on the same day in year. Which is analogical to probabily of 2 (or more) git commits from repository having N commits in total having the same hash prefix of length X.

它是关于 N 人组中有 2 个(或更多)人在一年中的同一天过生日的概率。这类似于来自存储库的 2 个(或更多)git 提交的概率,总共有 N 个提交,具有相同的长度 X 的哈希前缀。

Look at the Probability table. For example for hash hex string of length 8 the probability of having a collision reaches 1 % when the repository has just about 9300 items (git commits). For 110 000 commits the probability is 75 %. But if you have hash hex string of length 12 the probability of collision in 100 000 commits is below 0.1 %.

查看概率表。例如,对于长度为 8 的哈希十六进制字符串,当存储库只有大约 9300 个项目(git 提交)时,发生冲突的概率达到 1%。对于 110 000 次提交,概率为 75%。但是,如果您有长度为 12 的哈希十六进制字符串,则 100 000 次提交中发生冲突的概率低于 0.1%。

回答by Matthieu Moy

Git version 2.11 (or perhaps 2.12?) will contain a feature that adapts the number of characters used in short identifiers (e.g. git log --oneline) to the size of the project. Once you use such version of Git, the answer to your question can be "pick whatever length Git gives you with git log --oneline, it's safe enough".

Git 版本 2.11(或者可能是 2.12?)将包含一个特性,该特性使短标识符(例如git log --oneline)中使用的字符数适应项目的大小。一旦你使用了这样的 Git 版本,你的问题的答案可以是“选择 Git 给你的任何长度git log --oneline,它足够安全”。

For more details, see Changing the default for “core.abbrev”? discussion in Git Rev News edition 20and commit bb188d00f7.

有关更多详细信息,请参阅更改“core.abbrev”的默认值?Git Rev News edition 20 中的讨论并提交bb188d00f7