git 是如何存储文件的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8198105/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 12:21:31  来源:igfitidea点击:

How does git store files?

git

提问by mteffaha

I just started learning git and to do so I started reading the Git Community Book, and in this book they say that SVN and CVS store the difference between files and that git stores a snapshot of all the files.

我刚开始学习 git,为此我开始阅读Git Community Book,在这本书中他们说 SVN 和 CVS 存储文件之间的差异,而 git 存储所有文件的快照。

But I didn't really get what they mean by snapshot. Does git really make a copy of all the files in each commit because that's what I understood from their explanation.

但我并没有真正理解他们所说的快照是什么意思。git 是否真的会复制每次提交中的所有文件,因为这是我从他们的解释中了解到的。

PS: If any one has any better source to learn git I would appreciate it.

PS:如果有人有更好的学习 git 的资源,我将不胜感激。

回答by VonC

Git does include for each commit a full copy of all the files, except that, for the content already present in the Git repo, the snapshot will simply point to said content rather than duplicate it.
That also means that several files with the same content are stored only once.

Git 确实为每次提交包含所有文件的完整副本,除了对于 Git 存储库中已经存在的内容,快照将简单地指向所述内容而不是复制它。
这也意味着具有相同内容的多个文件仅存储一次。

So a snapshot is basically a commit, referring to the contentof a directory structure.

所以快照基本上是一次提交,指的是目录结构的内容

Some good references are:

一些很好的参考是:

You tell Git you want to save a snapshot of your project with the git commit command and it basically records a manifest of what all of the files in your project look like at that point

你告诉 Git 你想用 git commit 命令保存项目的快照,它基本上记录了项目中所有文件在那时的样子的清单

Lab 12illustrates how to get previous snapshots

实验 12说明了如何获取以前的快照



The progit bookhas the more comprehensive description of a snapshot:

progit本书具有快照的更全面的描述:

The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data.
Conceptually, most other systems store information as a list of file-based changes. These systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they keep as a set of files and the changes made to each file over time

Git 与任何其他 VCS(包括 Subversion 和朋友)之间的主要区别在于 Git 考虑其数据的方式。
从概念上讲,大多数其他系统将信息存储为基于文件的更改列表。这些系统(CVS、Subversion、Perforce、Bazaar 等)将它们保存的信息视为一组文件以及随着时间的推移对每个文件所做的更改

delta-based VCS

基于增量的 VCS

Git doesn't think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini filesystem.
Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot.
To be efficient, if files have not changed, Git doesn't store the file again—just a link to the previous identical file it has already stored.
Git thinks about its data more like as below:

Git 不会以这种方式考虑或存储其数据。相反,Git 认为它的数据更像是一组迷你文件系统的快照。
每次提交或在 Git 中保存项目状态时,它基本上都会拍下当时所有文件的样子并存储对该快照的引用。
为了提高效率,如果文件没有改变,Git 不会再次存储文件——只是一个指向它已经存储的前一个相同文件的链接。
Git 认为它的数据更像是如下:

snapshot-based VCS

基于快照的 VCS

This is an important distinction between Git and nearly all other VCSs. It makes Git reconsider almost every aspect of version control that most other systems copied from the previous generation. This makes Git more like a mini filesystem with some incredibly powerful tools built on top of it, rather than simply a VCS.

这是 Git 与几乎所有其他 VCS 之间的重要区别。它使 Git 重新考虑大多数其他系统从上一代复制的版本控制的几乎每个方面。这使得 Git 更像是一个迷你文件系统,在它之上构建了一些非常强大的工具,而不仅仅是一个 VCS。



Jan Hudecadds this important comment:

Jan Hudec添加了以下重要评论

While that's true and important on the conceptual level, it is NOT true at the storage level.
Git does use deltas for storage.
Not only that, but it's more efficient in it than any other system. Because it does not keep per-file history, when it wants to do delta compression, it takes each blob, selects some blobs that are likely to be similar (using heuristics that includes the closest approximation of previous version and some others), tries to generate the deltas and picks the smallest one. This way it can (often, depends on the heuristics) take advantage of other similar files or older versions that are more similar than the previous. The "pack window" parameter allows trading performance for delta compression quality. The default (10) generally gives decent results, but when space is limited or to speed up network transfers, git gc --aggressiveuses value 250, which makes it run very slow, but provide extra compression for history data.

虽然这在概念层面上是正确且重要的,但在存储层面上却并非如此。
Git 确实使用增量进行存储
不仅如此,它比任何其他系统都更有效率。因为它不保留每个文件的历史记录,当它想做增量压缩时,它需要每个 blob,选择一些可能相似的 blob(使用包含先前版本和其他一些最接近近似值的启发式算法),尝试生成增量并选择最小的增量。通过这种方式,它可以(通常取决于启发式)利用其他类似文件或比以前更相似的旧版本。“pack window”参数允许交易性能以换取增量压缩质量。默认值 (10) 通常会给出不错的结果,但是当空间有限或为了加快网络传输速度时,git gc --aggressive使用值 250,这会使其运行速度非常慢,但会为历史数据提供额外的压缩。

回答by svick

Git logically stores each file under its SHA1. What this means is if you have two files with exactly the same content in a repository (or if you rename a file), only one copy is stored.

Git 在逻辑上将每个文件存储在其 SHA1 下。这意味着如果您在存储库中有两个内容完全相同的文件(或者如果您重命名文件),则只会存储一个副本。

But this also means that when you modify a small part of a file and commit, another copy of the file is stored. The way git solves this is using pack files. Once in a while, all the “loose” files (actually, not just files, but objects containing commit and directory information too) from a repo are gathered and compressed into a pack file. The pack file is compressed using zlib. And similar files are also delta-compressed.

但这也意味着当您修改文件的一小部分并提交时,会存储该文件的另一个副本。git 解决这个问题的方法是使用包文件。偶尔,所有“松散”文件(实际上,不仅仅是文件,还有包含提交和目录信息的对象)被收集并压缩到一个包文件中。使用 zlib 压缩包文件。类似的文件也是增量压缩的。

The same format is also used when pulling or pushing (at least with some protocols), so those files don't have to be recompressed again.

拉或推时也使用相同的格式(至少对于某些协议),因此这些文件不必再次重新压缩。

The result of this is that a git repository, containing the whole uncompressed working copy, uncompressed recent files and compressed older files is usually relatively small, two times smaller than the size of the working copy. And this means it's smaller than SVN repo with the same files, even though SVN doesn't store the history locally.

这样做的结果是,包含整个未压缩工作副本、未压缩最近文件和压缩旧文件的 git 存储库通常相对较小,比工作副本的大小小两倍。这意味着它比具有相同文件的 SVN 存储库小,即使 SVN 不在本地存储历史记录。