Git 如何既节省空间又快速?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2869213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 04:17:15  来源:igfitidea点击:

How does Git save space and is fast at the same time?

gitversion-controlcompressionperformancegithub

提问by Lazer

I just saw the first Gittutorial at http://blip.tv/play/Aeu2CAI.

我刚刚在http://blip.tv/play/Aeu2CAI看到了第一个Git教程。

How does Git store all the versions of all the files, and how can it still be more economical in space than Subversionwhich saves only the latest version of the code?

Git如何存储所有文件的所有版本,它如何在空间上仍然比只保存最新版本代码的Subversion更经济?

I know this can be done using compression, but that would be at the cost of speed, but this also says that Git is much faster (though where it gains the maximum is the fact that most of its operations are offline).

我知道这可以使用压缩来完成,但这会以速度为代价,但这也说明 Git 的速度要快得多(尽管它获得最大值的地方是它的大部分操作都是离线的)。

So, my guess is that

所以,我的猜测是

  • Git compresses data extensively
  • It is still faster because uncompression + workis still faster than network_fetch + work
  • Git广泛压缩数据
  • 它仍然更快,因为uncompression + work仍然比network_fetch + work

Am I correct? Even close?

我对么?甚至关闭?

回答by Jakub Nar?bski

I assume you are asking how it is possible for a git clone (full repository + checkout) to be smaller than checked-out sources in Subversion. Or did you mean something else?

我假设您问的是 git clone(完整存储库 + 检出)如何比 Subversion 中检出的源更小。或者你的意思是别的?

This question is answered in the comments

这个问题在评论中得到了回答



Repository size

存储库大小

First you should take into account that along checkout (working version) Subversion stores pristine copy (last version) in those .svnsubdirectories. Pristine copy is stored uncompressed in Subversion.

首先,您应该考虑到结帐(工作版本)Subversion 在这些.svn子目录中存储原始副本(最新版本)。原始副本未压缩地存储在 Subversion 中。

Second, git uses the following techniques to make repository smaller:

其次,git 使用以下技术使存储库更小:

  • each version of a file is stored only once; this means that if you have only two different versions of some file in 10 revisions (10 commits), git stores only those two versions, not 10.
  • objects (and deltas, see below) are stored compressed; text files used in programming compress really well (around 60% of original size, or 40% reduction in size from compression)
  • after repacking, objects are stored in deltified form, as a difference from some other version; additionally git tries to order delta chains in such a way that the delta consists mainly of deletions (in the usual case of growing files it is in recency order); IIRC deltas are compressed as well.
  • 文件的每个版本只存储一次;这意味着如果某个文件在 10 次修订(10 次提交)中只有两个不同版本,git 只会存储这两个版本,而不是 10。
  • 对象(和增量,见下文)被压缩存储;编程中使用的文本文件压缩得非常好(原始大小的 60% 左右,或压缩后大小减少 40%)
  • 重新打包后,对象以deltified形式存储,与其他版本不同;此外,git 尝试以这样一种方式对增量链进行排序,即增量主要由删除组成(在通常情况下,增加的文件是按新近顺序排列的);IIRC deltas 也被压缩。

Performance (speed of operations)

性能(操作速度)

First, any operation that involves network would be much slower than a local operation. Therefore for example comparing current state of working area with some other version, or getting a log (a history), which in Subversion involves network connection and network transfer, and in Git is a local operation, would of course be much slower in Subversion than in Git. BTW. this is the difference between centralizedversion control systems (using client-server workflow) and distributedversion control systems (using peer-to-peer workflow), not only between Subversion and Git.

首先,任何涉及网络的操作都会比本地操作慢得多。因此例如将当前工作区状态与其他版本进行比较,或者获取日志(历史记录),在Subversion中涉及网络连接和网络传输,而在Git中是本地操作,在Subversion中当然会慢得多在 Git 中。顺便提一句。这是集中式版本控制系统(使用客户端-服务器工作流)和分布式版本控制系统(使用点对点工作流)之间的区别,不仅仅是 Subversion 和 Git 之间的区别。

Second, if I understand it correctly, nowadays the limitation is not CPU but IO (disk access). Therefore it is possible that the gain from having to read less data from disk because of compression (and being able to mmap it in memory) overcomes the loss from having to decompress data.

其次,如果我理解正确的话,现在的限制不是 CPU 而是 IO(磁盘访问)。因此,由于压缩(并且能够将其映射到内存中)而不得不从磁盘读取更少数据的收益可能会克服必须解压缩数据的损失。

Third, Git was designed with performance in mind (see e.g. GitHistorypage on Git Wiki):

第三,Git 的设计考虑了性能(参见Git Wiki 上的GitHistory页面):

  • The index stores stat information for files, and Git uses it to decide without examining files if the files were modified or not (see e.g. core.trustctimeconfig variable).
  • The maximum delta depth is limited to pack.depth, which defaults to 50. Git has delta cache to speed up access. There is (generated) packfile index for fast access to objects in packfile.
  • Git takes care to not touch files it doesn't have to. For example when switching branches, or rewinding to another version, Git updates only files that changed. The consequence of this philosophy is that Git does support only very minimal keyword expansion (at least out of the box).
  • Git uses its own versionof LibXDifflibrary, nowadays also for diff and merge, instead of calling external diff / external merge tool.
  • Git tries to minimize latency, which means good perceivedperformance. For example it outputs first page of "git log" as fast as possible, and you see it almost immediately, even if generating full history would take more time; it doesn't wait for full history to be generated before displaying it.
  • When fetching new changes, Git checks what objects you have in common with the server, and sends only (compressed) differences in the form of thin packfile. Admittedly Subversion can (or perhaps by default it does) also send only differences when updating.
  • 索引存储文件的统计信息,Git 使用它来决定文件是否被修改,而无需检查文件(参见例如core.trustctime配置变量)。
  • 最大增量深度限制为pack.depth,默认为 50。Git 具有增量缓存以加快访问速度。有(生成的)packfile 索引用于快速访问 packfile 中的对象。
  • Git 注意不要触摸它不需要的文件。例如,在切换分支或回退到另一个版本时,Git 仅更新更改的文件。这种哲学的结果是 Git 只支持非常小的关键字扩展(至少开箱即用)。
  • Git 使用自己版本LibXDiff库,现在也用于差异和合并,而不是调用外部差异/外部合并工具。
  • Git 尝试最小化延迟,这意味着良好的感知性能。例如,它git log尽可能快地输出“ ”的第一页,您几乎可以立即看到它,即使生成完整的历史记录需要更多时间;它不会等待生成完整的历史记录才显示它。
  • 在获取新更改时,Git 会检查您与服务器共有哪些对象,并仅以瘦包文件的形式发送(压缩)差异。诚然,Subversion 可以(或者默认情况下它确实可以)在更新时也只发送差异。

I am not a Git hacker, and I probably missed some techniques and tricks that Git uses for better performance. Note however that Git heavily uses POSIX (like memory mapped files) for that, so the gain might be not as large on MS Windows.

我不是 Git 黑客,我可能错过了 Git 用于提高性能的一些技术和技巧。但是请注意,Git 大量使用 POSIX(如内存映射文件),因此在 MS Windows 上增益可能没有那么大。

回答by VonC

Not a complete answer, but those comments(from AlBlue) might help on the space management aspect of the question:

不是一个完整的答案,但这些评论(来自AlBlue)可能有助于解决问题的空间管理方面:

There's a couple of things worth clarifying here.

Firstly, it is possible to have a bigger Git repository than an SVN repository; I hope I didn't imply that that was never the case. However, in practice, it generally tends to be the case that a Git repository takes less space on disk than an equivalent SVN repository would.
One thing you cite is Apache's single SVN repository, which is obviously massive. However, one only has to look at git.apache.org, and you'll note that each Apache project has its own Git repository. What's really needed is a comparison of like-for-like; in other words, a checkout of the (abdera) SVN project vs the clone of the (abdera) Git repository.

I was able to check out git://git.apache.org/abdera.git. On disk, it consumed 28.8Mb.
I then checked out the SVN version http://svn.apache.org/repos/asf/abdera/java/trunk/, and it consumed 34.3Mb.
Both numbers were taken from a separately mounted partition in RAM space, and the number quoted was the number of bytes taken from the disk.
If using du -shas a means of testing, the Git checkout was 11Mb and the SVN checkout was 17Mb.

The Git version of Apache Abdera would let me work with any version of the history up to and including the current release; the SVN would only have the backup of the currently checked out version. Yet it takes less space on disk.

How, you may ask?

Well, for one thing, SVN creates a lot more files. The SVN checkout has 2959 files; the corresponding Git repository has 845 files.

Secondly, whilst SVN has an .svnfolder at each level of the hierarchy, a Git repo only has a single .gitrepository at the top level. This means (amongst other things) that renames from one dir to another have relatively smaller impact in Git than in SVN, which admitedly, already has relatively small impact anyway.

Thirdly, Git stores its data as compressed objects, whereas SVN stores them as uncompressed copies. Go into any .svn/text-basedirectory, and you'll find uncompressed copies of the (base) files.
Git has a mechanism to compress all files (and indeed, all history) into pack files. In Abdera's case, .git/objects/pack/has a single .pack file (containing all history) in a 4.8Mb file.
So the size of the repository is (roughly) the same size as the current checked out code in this case, though I wouldn't expect that always to be the case.

Anyway, you're right that history can grow to be more than the total size of the current checkout; but because of the way that SVN works, it really has to approach twice the size in order to make much of a difference. Even then, disk space reduction is not really the main reason to use a DVCS anyway; it's an advantage for some things, sure, but it's not the real reason why people use it.

Note that Git (and Hg, and other DVCSs) do suffer from a problem where (large) binaries are checked in, then deleted, as they'll still show up in the repository and take up space, even if they're not current. The text compression takes care of these kind of things for text files, but binary ones are more of an issue. (There are administrative commands that can update the contents of Git repositories, but they have slightly higher overhead/administrative cost than CVS; git filter-branch is like svnadmin dump/filter/load.)

这里有几件事值得澄清。

首先,可以拥有比 SVN 存储库更大的 Git 存储库;我希望我没有暗示从来没有发生过这种情况。然而,在实践中,通常情况下,Git 存储库占用的磁盘空间比等效的 SVN 存储库少。
您引用的一件事是 Apache 的单个 SVN 存储库,它显然是庞大的。但是,只需查看git.apache.org,您就会注意到每个 Apache 项目都有自己的 Git 存储库。真正需要的是同类比较;换句话说,结帐 (abdera) SVN 项目与 (abdera) Git 存储库的克隆

我能够检查出来git://git.apache.org/abdera.git。在磁盘上,它消耗了 28.8Mb。
然后我检查了 SVN 版本http://svn.apache.org/repos/asf/abdera/java/trunk/,它消耗了 34.3Mb。
这两个数字都是从 RAM 空间中单独安装的分区中获取的,引用的数字是从磁盘中获取的字节数。
如果du -sh用作测试手段,Git checkout 为 11Mb,SVN checkout 为 17Mb。

Apache Abdera 的 Git 版本可以让我处理任何版本的历史记录,包括当前版本;SVN 只会有当前签出版本的备份。然而,它占用的磁盘空间更少。

你可能会问怎么办?

嗯,一方面,SVN 创建了更多的文件。SVN checkout 有 2959 个文件;对应的 Git 仓库有 845 个文件。

其次,虽然 SVN.svn在层次结构的每个级别都有一个文件夹,但 Git.git存储库在顶层只有一个存储库。这意味着(除其他外)从一个目录重命名到另一个目录在 Git 中的影响比在 SVN 中的影响要小,无可否认,SVN 无论如何已经产生了相对较小的影响。

第三,Git 将其数据存储为压缩对象,而 SVN 将它们存储为未压缩的副本。进入任何.svn/text-base目录,您将找到(基本)文件的未压缩副本。
Git 有一种机制可以将所有文件(实际上,所有历史记录)压缩为包文件。在 Abdera 的情况下,.git/objects/pack/在 4.8Mb 文件中有一个 .pack 文件(包含所有历史记录)。
因此,在这种情况下,存储库的大小(大致)与当前检出的代码大小相同,尽管我不希望情况总是如此。

无论如何,您是对的,历史记录可以增长到超过当前结帐的总大小;但是由于 SVN 的工作方式,它确实必须接近两倍的大小才能产生很大的不同。即便如此,磁盘空间减少并不是使用 DVCS 的真正主要原因。当然,这对某些事情来说是一个优势,但这并不是人们使用它的真正原因。

请注意,Git(以及 Hg 和其他 DVCS)确实存在一个问题,即(大型)二进制文件被检入,然后被删除,因为它们仍然会出现在存储库中并占用空间,即使它们不是最新的. 文本压缩处理文本文件的这些事情,但二进制文件更像是一个问题。(有一些管理命令可以更新 Git 存储库的内容,但它们的开销/管理成本比 CVS 稍高;git filter-branch 就像svnadmin dump/filter/load.)



As for the speed aspect, I mentioned it in my "How fast is git over subversion with remote operations?" answer (like Linus said in its Google presentation: (paraphrasing here) "anything involving network will just kill the performances")

至于速度方面,我在我的“ git over subversion with remote operations 的速度有多快?” 回答中提到了它(就像Linus 在其谷歌演示中所说的:(在此处释义)“任何涉及网络的事情都会扼杀性能”)

And the GitBenchmark documentmentioned by Jakub Nar?bskiis a good addition, even though it doesn't deal directly with Subversion.
It does list the kind of operation you need to monitor on a DVCS performance-wise.

Jakub Nar?bski提到的GitBenchmark 文档是一个很好的补充,即使它不直接与 Subversion 打交道。 它确实列出了您需要在 DVCS 性能方面监控的操作类型。

Other Git benchmarks are mentioned in this SO question.

这个SO question中提到了其他 Git 基准测试。