git 使用什么算法来检测工作树上的变化?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4075528/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 04:42:11  来源:igfitidea点击:

What algorithm does git use to detect changes on your working tree?

git

提问by Khelben

This is about the internals of git.

这是关于git.

I've been reading the great 'Pro Git'book and learning a little about how git is working internally (all about the SHA1, blobs, references, trees, commits, etc, etc). Pretty clever architecture, by the way.

我一直在阅读伟大的“Pro Git”一书,并了解了一些关于 git 在内部如何工作的知识(所有关于 SHA1、blob、引用、树、提交等)。顺便说一句,相当聪明的建筑。

So, to put into context, git references the content of a file as a SHA1 value, so it's able to know if a specific content has changed just comparing the hash values. But my question is specifically about how git checks that the content in the working tree has changed or not.

因此,在上下文中,git 将文件的内容作为 SHA1 值引用,因此它能够仅通过比较哈希值就知道特定内容是否已更改。但我的问题特别是关于 git 如何检查工作树中的内容是否已更改。

The naive approach will be thinking that each time you run a command as git statusor similar command, it will search through all the files on the working directory, calculating the SHA1 and comparing it with the one that has the last commit. But that seems very inefficient for big projects, as the Linux kernel.

幼稚的方法会认为每次运行命令git status或类似命令时,它都会搜索工作目录中的所有文件,计算 SHA1 并将其与上次提交的文件进行比较。但这对于大型项目来说似乎非常低效,例如 Linux 内核。

Another idea could be to check last modification date on the file, but I think git is not storing that information (when you clone a repository, all the files have a new time)

另一个想法可能是检查文件的最后修改日期,但我认为 git 没有存储该信息(当您克隆存储库时,所有文件都有新的时间)

I'm sure it's doing it in an efficient way (git is really fast), does anyone know how that is achieved?

我确定它是以一种有效的方式进行的(git 真的很快),有谁知道这是如何实现的?

PD: Just to add an interesting linkabout the git index, specifically stating that the index keeps information about files timestamps, even when the tree objects do not.

PD:只是添加一个关于 git 索引的有趣链接,特别说明索引保留有关文件时间戳的信息,即使树对象没有。

采纳答案by Josh Lee

Git's index maintains timestamps of when git last wrote each file into the working tree (and updates these whenever files are cached from the working tree or from a commit). You can see the metadata with git ls-files --debug. In addition to the timestamp, it records the size, inode, and other information from lstatto reduce the chance of a false positive.

Git 的索引维护 git 最后一次将每个文件写入工作树的时间戳(并在文件从工作树或提交中缓存时更新这些)。您可以使用git ls-files --debug. 除了时间戳之外,它还记录了来自lstat的大小、inode 和其他信息,以减少误报的机会。

When you perform git-status, it simply calls lstaton every file in the working tree and compares the metadata in order to quickly determine which files are unchanged. This is described in the documentation under racy-gitand update-index.

当您执行 git-status 时,它只是对工作树中的每个文件调用lstat并比较元数据,以便快速确定哪些文件未更改。这在racy-gitupdate-index下的文档中有所描述。

回答by bcorso

On a unix file-system, the file-info is tracked and can be accesed using lstatmethod. The stat structurecontains multiple time-stamps, size information, and more:

在 unix 文件系统上,文件信息被跟踪并可使用lstat方法访问。该stat结构包含多个时间戳,大小信息,以及更多:

struct stat {
    dev_t     st_dev;     /* ID of device containing file */
    ino_t     st_ino;     /* inode number */
    mode_t    st_mode;    /* protection */
    nlink_t   st_nlink;   /* number of hard links */
    uid_t     st_uid;     /* user ID of owner */
    gid_t     st_gid;     /* group ID of owner */
    dev_t     st_rdev;    /* device ID (if special file) */
    off_t     st_size;    /* total size, in bytes */
    blksize_t st_blksize; /* blocksize for file system I/O */
    blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
    time_t    st_atime;   /* time of last access */
    time_t    st_mtime;   /* time of last modification */
    time_t    st_ctime;   /* time of last status change */
};

It seems that initially Git simply relied on this stat structureto decide if a file had been changed (see reference):

似乎最初 Git 只是简单地依靠这个stat 结构来确定文件是否已更改(请参阅参考资料):

When checking if they differ, Git first runs lstat(2)on the files and compares the result with this information

检查它们是否不同时,Git 首先运行lstat(2)文件并将结果与​​此信息进行比较

However, a race condition was reported (racy-git) that found if a file was modified in the following manner:

但是,报告了一个竞争条件 ( racy-git),它发现文件是否以以下方式修改:

: modify 'foo'
$ git update-index 'foo'
: modify 'foo' again, in-place, without changing its size 
                      (And quickly enough to not change it's timestamps)

This left the file in a state that was modified but not detectable by lstat.

这使文件处于已修改但无法被 lstat 检测到的状态。

To fix this issue, now in such situations where lstat state is ambiguous, Git compares the contents of the files to determine if it has been changed.

为了解决这个问题,现在在 lstat 状态不明确的情况下,Git 会比较文件的内容以确定它是否已更改。



NOTE:

笔记:

If anyone is confused, like I was, about st_mtimedescription, which states that it is updated by writes "of more than zero bytes," this means absolutechange.

如果有人像我一样对st_mtimedescription感到困惑,它指出它是通过写入“超过零个字节”来更新的,这意味着绝对变化。

For example, in the case of a text file file with a single character A: if Ais changed to Bthere is 0 net change in total byte size, but the st_mtime will still be updated (had to try it myself to verify, use ls -lto see timestamp).

例如,一个文本文件文件只有一个字符A:如果A更改为B总字节大小净变化为0,但st_mtime仍然会更新(必须自己尝试验证,用于ls -l查看时间戳)。