Linus Torvalds 说 Git “从不”跟踪文件是什么意思?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/55602748/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What does Linus Torvalds mean when he says that Git "never ever" tracks a file?
提问by Simón Ramírez Amaya
Quoting Linus Torvalds when asked how many files Git can handle during his Tech Talk at Google in 2007(43:09):
在 2007 年在谷歌的技术演讲中,当被问及 Git 可以处理多少文件时引用了 Linus Torvalds (43:09):
…Git tracks your content. It never ever tracks a single file. You cannot track a file in Git. What you can do is you can track a project that has a single file, but if your project has a single file, sure do that and you can do it, but if you track 10,000 files, Git never ever sees those as individual files. Git thinks everything as the full content. All history in Git is based on the history of the whole project…
…Git 跟踪您的内容。它从不跟踪单个文件。您无法在 Git 中跟踪文件。你可以做的是你可以跟踪一个只有一个文件的项目,但是如果你的项目只有一个文件,一定要这样做,你可以做到,但是如果你跟踪 10,000 个文件,Git 永远不会将它们视为单个文件。Git 认为一切都是完整的内容。Git 中的所有历史都是基于整个项目的历史……
(Transcripts here.)
(这里的成绩单。)
Yet, when you dive into the Git book, the first thing you are told is that a file in Git can be either trackedor untracked. Furthermore, it seems to me like the whole Git experience is geared towards file versioning. When using git diff
or git status
output is presented on a per file basis. When using git add
you also get to choose on a per file basis. You can even review history on a file basis and is lightning fast.
然而,当你潜入Git的书,你被告知的第一件事是,在Git的文件既可以跟踪或未经跟踪。此外,在我看来,整个 Git 体验都是面向文件版本控制的。使用git diff
或git status
输出时按文件显示。使用时,git add
您还可以在每个文件的基础上进行选择。您甚至可以以文件为基础查看历史记录,而且速度快如闪电。
How should this statement be interpreted? In terms of file tracking, how is Git different from other source control systems, such as CVS?
这句话应该如何解释?在文件跟踪方面,Git 与其他源代码控制系统(例如 CVS)有何不同?
回答by bk2204
In CVS, history was tracked on a per-file basis. A branch might consist of various files with their own various revisions, each with its own version number. CVS was based on RCS (Revision Control System), which tracked individual files in a similar way.
在 CVS 中,历史记录是基于每个文件进行跟踪的。一个分支可能由各种文件组成,每个文件都有自己的不同版本,每个文件都有自己的版本号。CVS 基于 RCS(修订控制系统),它以类似的方式跟踪单个文件。
On the other hand, Git takes snapshots of the state of the whole project. Files are not tracked and versioned independently; a revision in the repository refers to a state of the whole project, not one file.
另一方面,Git 对整个项目的状态进行快照。文件不是独立跟踪和版本控制的;存储库中的修订是指整个项目的状态,而不是一个文件。
When Git refers to tracking a file, it means simply that it is to be included in the history of the project. Linus's talk was not referring to tracking files in the Git context, but was contrasting the CVS and RCS model with the snapshot-based model used in Git.
当 Git 提到跟踪一个文件时,它只是意味着它要被包含在项目的历史中。Linus 的演讲不是指在 Git 上下文中跟踪文件,而是将 CVS 和 RCS 模型与 Git 中使用的基于快照的模型进行对比。
回答by torek
I agree with brian m. carlson's answer: Linus is indeed distinguishing, at least in part, between file-oriented and commit-oriented version control systems. But I think there is more to it than that.
我同意布赖恩米。carlson 的回答:Linus 确实至少部分区分了面向文件和面向提交的版本控制系统。但我认为还有更多。
In my book, which is stalled and might never get finished, I tried to come up with a taxonomyfor version control systems. In my taxonomy the term for what we're interested here is the atomicityof the version control system. See what is currently page 22. When a VCS has file-level atomicity, there is in fact a history for each file. The VCS must remember the name of the file and what occurred to it at each point.
在我的书中,这本书已经停滞不前并且可能永远不会完成,我试图为版本控制系统提出一个分类法。在我的分类中,我们在这里感兴趣的术语是版本控制系统的原子性。查看当前第 22 页的内容。当 VCS 具有文件级原子性时,实际上每个文件都有一个历史记录。VCS 必须记住文件的名称以及它在每个点上发生了什么。
Git doesn't do that. Git has only a history of commits—the commit is its unit of atomicity, and the history isthe set of commits in the repository. What a commit remembers is the data—a whole tree-full of file names and the contents that go with each of those files—plus some metadata: for instance, who made the commit, when, and why, and the internal Git hash ID of the commit's parentcommit. (It is this parent, and the directed acycling graph formed by reading all commits and their parents, that isthe history in a repository.)
Git 不会那样做。Git 只有提交历史——提交是它的原子性单位,历史是存储库中的一组提交。提交记住的是数据——一整棵树,包含文件名和每个文件的内容——以及一些元数据:例如,提交的人、时间和原因,以及内部 Git 哈希 ID提交的父提交。(正是这种父母,并通过读取所有的提交和他们的父母形成的定向acycling图,这是在一个仓库的历史。)
Note that a VCS can be commit-oriented, yet still store data file-by-file. That's an implementation detail, though sometimes an important one, and Git does not do that either. Instead, each commit records a tree, with the tree object encoding file names, modes(i.e., is this file executable or not?), and a pointer to the actual file content. The content itself is stored independently, in a blob object. Like a commit object, a blob gets a hash ID that is unique to its content—but unlike a commit, which can only appear once, the blob can appear in many commits. So the underlying file content in Git is stored directly as a blob, and then indirectlyin a tree object whose hash ID is recorded (directly or indirectly) in the commit object.
请注意,VCS 可以是面向提交的,但仍可以逐个文件地存储数据。这是一个实现细节,尽管有时很重要,而且 Git 也不会这样做。相反,每个提交记录一个树,树对象编码文件名、模式(即,这个文件是否可执行?)和一个指向实际文件内容的指针。内容本身独立存储在blob 对象中。与提交对象一样,blob 获得对其内容唯一的哈希 ID,但与只能出现一次的提交不同,blob 可以出现在多次提交中。所以Git中的底层文件内容直接存储为blob,然后间接存储在其哈希 ID 记录(直接或间接)在提交对象中的树对象中。
When you ask Git to show you a file's history using:
当您要求 Git 使用以下命令向您显示文件的历史记录时:
git log [--follow] [starting-point] [--] path/to/file
what Git is really doing is walking the commithistory, which is the only history Git has, but not showingyou any of these commits unless:
Git 真正在做的是遍历提交历史,这是 Git 唯一的历史,但不会向您展示任何这些提交,除非:
- the commit is a non-merge commit, and
- the parent of that commit also has the file, but the content in the parent differs, or the parent of the commit doesn't have the file at all
- 提交是非合并提交,并且
- 该提交的父级也有该文件,但父级中的内容不同,或者提交的父级根本没有该文件
(but some of these conditions can be modified via additional git log
options, and there's a very difficult to describe side effect called History Simplification that makes Git omit some commits from the history walk entirely). The file history you see here does not exactly exist in the repository, in some sense: instead, it's just a synthetic subset of the real history. You'll get a different "file history" if you use different git log
options!
(但其中一些条件可以通过附加git log
选项进行修改,并且有一个非常难以描述的副作用,称为历史简化,它使 Git 完全省略了历史记录中的一些提交)。从某种意义上说,您在此处看到的文件历史记录并不完全存在于存储库中:相反,它只是真实历史记录的合成子集。如果您使用不同的git log
选项,您将获得不同的“文件历史记录” !
回答by Yakk - Adam Nevraumont
The confusing bit is here:
令人困惑的地方在这里:
Git never ever sees those as individual files. Git thinks everything as the full content.
Git 永远不会将这些视为单独的文件。Git 认为一切都是完整的内容。
Git often uses 160 bit hashes in place of objects in its own repo. A tree of files is basically a list of names and hashes associated with the content of each (plus some metadata).
Git 通常使用 160 位哈希来代替它自己的存储库中的对象。文件树基本上是与每个文件的内容(加上一些元数据)相关联的名称和散列的列表。
But the 160 bit hash uniquely identifies the content (within the universe of the git database). So a tree with hashes as content includes the contentin its state.
但是 160 位散列唯一标识内容(在 git 数据库的范围内)。因此,以散列作为内容的树包括处于其状态的内容。
If you change the state of the content of a file, its hash changes. But if its hash changes, the hash associated with the file name's content also changes. Which in turn changes the hash of the "directory tree".
如果您更改文件内容的状态,则其哈希值也会更改。但是如果它的散列值发生变化,与文件名内容相关联的散列值也会发生变化。这反过来又会改变“目录树”的哈希值。
When a git database stores a directory tree, that directory tree implies and includes all of the content of all of the subdirectories and all of the files in it.
当 git 数据库存储目录树时,该目录树暗示并包括所有子目录的所有内容以及其中的所有文件。
It is organized in a tree structure with (immutable, reusable) pointers to blobs or other trees, but logically it is a single snapshot of the entire content of the entire tree. The representationin the git database isn't the flat data contents, but logically it is all of its data and nothing else.
它被组织成一个树结构,带有指向 blob 或其他树的(不可变的、可重用的)指针,但从逻辑上讲,它是整个树的整个内容的单个快照。该代表在git的数据库是不平坦的数据内容,但在逻辑上是所有的数据,并没有其他的。
If you serialized the tree to a filesystem, deleted all .git folders, and told git to add the tree back into its database, you'd end up with adding nothing to the database -- the element would already be there.
如果您将树序列化到文件系统,删除所有 .git 文件夹,并告诉 git 将树添加回其数据库,您最终不会向数据库添加任何内容——该元素已经存在。
It may help to think of git's hashes as a reference counted pointer to immutable data.
将 git 的哈希值视为指向不可变数据的引用计数指针可能会有所帮助。
If you built an application around that, a document is a bunch of pages, which have layers, which have groups, which have objects.
如果你围绕它构建了一个应用程序,一个文档就是一堆页面,这些页面有层、有组、有对象。
When you want to change an object, you have to create a completely new group for it. If you want to change a group, you have to create a new layer, which needs a new page, which needs a new document.
当您想要更改一个对象时,您必须为其创建一个全新的组。如果要更改组,则必须创建一个新图层,该图层需要一个新页面,该图层需要一个新文档。
Every time you change a single object, it spawns a new document. The old document continues to exist. The new and old document share most of their content -- they have the same pages (except 1). That one page has the same layers (except 1). That layer has the same groups (except 1). That group has the same objects (except 1).
每次更改单个对象时,它都会生成一个新文档。旧文件继续存在。新旧文档共享其大部分内容——它们具有相同的页面(除了 1)。该一页具有相同的层(除了 1)。该层具有相同的组(除了 1)。该组具有相同的对象(除了 1)。
And by same, I mean logically a copy, but implementation-wise it is just another reference counted pointer to the same immutable object.
同样,我的意思是逻辑上是一个副本,但在实现方面,它只是指向同一个不可变对象的另一个引用计数指针。
A git repo is a lot like that.
git repo 很像这样。
This means that a given git changeset contains its commit message (as a hash code), it contains its work tree, and it contains its parent changes.
这意味着给定的 git 变更集包含它的提交消息(作为哈希码),它包含它的工作树,它包含它的父更改。
Those parent changes contain their parent changes, all the way back.
这些父更改包含其父更改,一直返回。
The part of the git repo that contains historyis that chain of changes. That chain of changes it at a level abovethe "directory" tree -- from a "directory" tree, you cannot uniquely get to a change set and the chain of changes.
包含历史的 git repo 部分是更改链。该更改链在“目录”树之上的级别进行更改——从“目录”树中,您不能唯一地获得更改集和更改链。
To find out what happens to a file, you start with that file in a changeset. That changeset has a history. Often in that history, the same named file exists, sometimes with the same content. If the content is the same, there was no change to the file. If it is different, there is a change, and work needs to be done to work out exactly what.
要了解文件发生了什么,您可以从变更集中的该文件开始。该变更集有历史。通常在该历史记录中,存在相同命名的文件,有时具有相同的内容。如果内容相同,则文件没有更改。如果不同,则说明发生了变化,需要进行工作以弄清楚究竟是什么。
Sometimes the file is gone; but, the "directory" tree might have another file with the same content (same hash code), so we can track it that way (note; this is why you want a commit-to-move a file separate from a commit-to-edit). Or the same file name, and after checking the file is similar enough.
有时文件不见了;但是,“目录”树可能有另一个具有相同内容(相同哈希码)的文件,因此我们可以通过这种方式跟踪它(注意;这就是为什么您希望将文件与提交文件分开的原因-编辑)。或者相同的文件名,并且经过检查文件是否足够相似。
So git can patchwork together a "file history".
所以git可以拼凑一个“文件历史”。
But this file history comes from efficient parsing of the "entire changeset", not from a link from one version of the file to another.
但是这个文件历史来自对“整个变更集”的有效解析,而不是来自一个文件版本到另一个版本的链接。
回答by Yakk - Adam Nevraumont
"git does not track files" basically means that git's commits consist of a file tree snapshot connecting a path in the tree to a "blob" and a commit graph tracking the history of commits. Everything else is reconstructed on-the-fly by commands like "git log" and "git blame". This reconstruction can be told via various options how hard it should look for file-based changes. The default heuristics can determine when a blob changes place in the file tree without change, or when a file is associated with a different blob than before. The compression mechanisms Git uses don't care a whole lot about blob/file boundaries. If the content is somewhere already, this will keep the repository growth small without associating the various blobs.
“git 不跟踪文件”基本上意味着 git 的提交包含一个文件树快照,将树中的路径连接到“blob”和一个跟踪提交历史的提交图。其他所有内容都是通过诸如“git log”和“git blame”之类的命令即时重建的。可以通过各种选项告诉这种重建应该如何努力寻找基于文件的更改。默认启发式可以确定 blob 何时更改文件树中的位置而没有更改,或者文件何时与与以前不同的 blob 相关联。Git 使用的压缩机制不太关心 blob/文件边界。如果内容已经在某处,这将使存储库增长很小,而不会关联各种 blob。
Now that is the repository. Git also has a working tree, and in this working tree there are tracked and untracked files. Only the tracked files are recorded in the index (staging area? cache?) and only what is tracked there makes it into the repository.
现在这是存储库。Git 也有一个工作树,在这个工作树中,有跟踪和未跟踪的文件。只有被跟踪的文件被记录在索引中(暂存区?缓存?),并且只有在那里被跟踪的才能进入存储库。
The index is file-oriented and there are some file-oriented commands for manipulating it. But what ends up in the repository is just commits in the form of file tree snapshots and the associated blob data and the commit's ancestors.
索引是面向文件的,并且有一些面向文件的命令用于操作它。但是最终在存储库中的只是文件树快照形式的提交以及相关的 blob 数据和提交的祖先。
Since Git does not track file histories and renames and its efficiency does not depend on them, sometimes you have to try a few times with different options until Git produces the history/diffs/blames you are interested in for non-trivial histories.
由于 Git 不跟踪文件历史和重命名,并且其效率不依赖于它们,因此有时您必须尝试使用不同的选项几次,直到 Git 为非平凡的历史生成您感兴趣的历史/差异/责备。
That's different with systems like Subversion which recordrather than reconstructhistories. If it's not on record, you don't get to hear about it.
这与像 Subversion 这样记录而不是重建历史的系统不同。如果它没有记录在案,你就不会听到它。
I actually built a differential installer at one time that just compared release trees by checking them into Git and then producing a script duplicating their effect. Since sometimes whole trees were moved, this produced much smaller differential installers than overwriting/deleting everything would have produced.
我实际上曾经构建了一个差异安装程序,它只是通过将发布树签入 Git 来比较它们,然后生成一个脚本来复制它们的效果。由于有时整棵树都会被移动,这会产生比覆盖/删除所有内容更小的差异安装程序。
回答by Double Vision Stout Fat Heavy
Git doesn't track a file directly, but tracks snapshots of the repository, and these snapshots happen to consist of files.
Git 不直接跟踪文件,而是跟踪存储库的快照,而这些快照恰好由文件组成。
Here's a way to look at it.
这是一种看待它的方法。
In other version control systems (SVN, Rational ClearCase), you can right click on a file and get its change history.
在其他版本控制系统(SVN、Rational ClearCase)中,您可以右键单击文件并获取其更改历史记录。
In Git, there is no direct command that does this. See this question. You'll be surprised at how many different answers there are. There is no one simple answer because Git doesn't simply track a file, not in the way that SVN or ClearCase does it.
在 Git 中,没有执行此操作的直接命令。看到这个问题。你会惊讶于有多少不同的答案。没有一个简单的答案,因为Git 不是简单地跟踪文件,而不是像 SVN 或 ClearCase 那样。
回答by VonC
Tracking "content", incidentally, is what led to not track empty directories.
That is why, if you git rm the last file of a folder, the folder itself gets deleted.
顺便说一下,跟踪“内容”是导致不跟踪空目录的原因。
这就是为什么,如果你 git rm 文件夹的最后一个文件,文件夹本身会被删除。
That wasn't always the case, and only Git 1.4 (May 2006) enforced that "tracking content" policy with commit 443f833:
情况并非总是如此,只有 Git 1.4(2006 年 5 月)通过提交 443f833强制执行“跟踪内容”策略:
git status: skip empty directories, and add -u to show all untracked files
By default, we use
--others --directory
to show uninteresting directories (to get user's attention) without their contents (to unclutter output).
Showing empty directories do not make sense, so pass--no-empty-directory
when we do so.Giving
-u
(or--untracked
) disables this uncluttering to let the user get all untracked files.
git status: 跳过空目录,并添加 -u 以显示所有未跟踪的文件
默认情况下,我们
--others --directory
用来显示不感兴趣的目录(以引起用户的注意)而不显示其内容(以整理输出)。
显示空目录没有意义,所以--no-empty-directory
当我们这样做时通过。Giving
-u
(或--untracked
) 禁用这种整洁,让用户获得所有未跟踪的文件。
That was echoed years later in Jan. 2011 with commit 8fe533, Git v1.7.4:
几年后的 2011 年 1 月,提交了 8fe533,Git v1.7.4,这得到了回应:
This is in keeping with the general UI philosophy: git tracks content, not empty directories.
这符合一般的 UI 哲学:git 跟踪内容,而不是空目录。
In the meantime, with Git 1.4.3 (Sept. 2006), Git starts limiting untracked content to non-empty folders, with commit 2074cb0:
与此同时,在 Git 1.4.3(2006 年 9 月)中,Git 开始将未跟踪的内容限制为非空文件夹,提交 2074cb0:
it should not list the contents of completely untracked directories, but only the name of that directory (plus a trailing '
/
').
它不应列出完全未跟踪目录的内容,而应仅列出该目录的名称(加上尾随的“
/
”)。
Tracking content is what allowed git blame to, very early on (Git 1.4.4, Oct. 2006, commit cee7f24) be more performant:
跟踪内容允许 git blame 很早就(Git 1.4.4,2006 年10 月,提交 cee7f24)性能更高:
More importantly, its internal structure is designed to support contentmovement (aka cut-and-paste) more easily by allowing more than one paths to be taken from the same commit.
更重要的是,它的内部结构旨在通过允许从同一个提交中采用多个路径来更轻松地支持内容移动(也就是剪切和粘贴)。
That (tracking content) is also what put git add in the Git API, with Git 1.5.0 (Dec. 2006, commit 366bfcb)
那(跟踪内容)也是将 git add 放入 Git API 的原因,使用 Git 1.5.0(2006 年 12 月,提交 366bfcb)
make 'git add' a first class user friendly interface to the index
This brings the power of the index up front using a proper mental model without talking about the index at all.
See for example how all the technical discussion has been evacuated from the git-add man page.Any content to be committed must be added together.
Whether that content comes from new files or modified files doesn't matter.
You just need to "add" it, either with git-add, or by providing git-commit with-a
(for already known files only of course).
使“git add”成为索引的一流用户友好界面
这使用适当的思维模型将索引的力量放在前面,而根本不谈论索引。
例如,请参阅如何从 git-add 手册页中撤出所有技术讨论。任何要提交的内容都必须添加在一起。
该内容是来自新文件还是修改后的文件并不重要。
您只需要使用 git-add 或通过提供 git-commit 来“添加”它-a
(当然仅适用于已知文件)。
That is what made git add --interactive
possible, with the same Git 1.5.0 (commit 5cde71d)
这就是git add --interactive
使用相同的 Git 1.5.0(commit 5cde71d)成为可能的原因
After making the selection, answer with an empty line to stage the contentsof working tree files for selected paths in the index.
做出选择后,用空行回答以暂存索引中选定路径的工作树文件的内容。
That is also why, to recursively remove all contents from a directory, you need to pass -r
option, not just the directory name as the <path>
(still Git 1.5.0, commit 9f95069).
这也是为什么要从目录中递归删除所有内容,您需要传递-r
选项,而不仅仅是目录名称作为<path>
(仍然是 Git 1.5.0,提交 9f95069)。
Seeing file content instead of file itself is what allows merge scenario like the one described in commit 1de70db(Git v2.18.0-rc0, Apr. 2018)
查看文件内容而不是文件本身是允许合并场景的原因,如commit 1de70db(Git v2.18.0-rc0,2018 年 4 月)
Consider the following merge with a rename/add conflict:
- side A: modify
foo
, add unrelatedbar
- side B: rename
foo->bar
(but don't modify the mode or contents)In this case, the three-way merge of original foo, A's foo, and B's
bar
will result in a desired pathname ofbar
with the same mode/contents that A had forfoo
.
Thus, A had the right mode and contents for the file, and it had the right pathname present (namely,bar
).
考虑以下合并与重命名/添加冲突:
- A面:修改
foo
,添加无关bar
- B面:重命名
foo->bar
(但不要修改模式或内容)在这种情况下,原始 foo、A 的 foo 和 B 的三路合并
bar
将产生所需的路径名 ,bar
其模式/内容与 A 具有的相同foo
。
因此, A 具有文件的正确模式和内容,并且它具有正确的路径名(即,bar
)。
Commit 37b65ce, Git v2.21.0-rc0, Dec. 2018, recently improved colliding conflict resolutions.
And commit bbafc9cfirther illustrates the importance of considering file content, by improving the handling for rename/rename(2to1) conflicts:
Commit 37b65ce,Git v2.21.0-rc0,2018 年 12 月,最近改进了冲突冲突解决方案。
并承诺bbafc9cfirther说明考虑文件的重要内容,通过提高重命名/重命名(情况下,2to1)冲突的处理:
- Instead of storing files at
collide_path~HEAD
andcollide_path~MERGE
, the files are two-way merged and recorded atcollide_path
.- Instead of recording the version of the renamed file that existed on the renamed side in the index (thus ignoring any changes that were made to the file on the side of history without the rename), we do a three-way content merge on the renamed path, then store that at either stage 2 or stage 3.
- Note that since the content merge for each rename may have conflicts, and then we have to merge the two renamed files, we can end up with nested conflict markers.
- 不是将文件存储在
collide_path~HEAD
和collide_path~MERGE
,而是将文件双向合并并记录在collide_path
。- 我们没有记录索引中重命名侧存在的重命名文件的版本(从而忽略对历史侧没有重命名的文件所做的任何更改),我们对重命名进行了三向内容合并路径,然后将其存储在第 2 阶段或第 3 阶段。
- 请注意,由于每次重命名的内容合并可能会发生冲突,然后我们必须合并两个重命名的文件,因此我们最终可能会出现嵌套的冲突标记。