Git 的包文件是增量而不是快照吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5176225/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Are Git's pack files deltas rather than snapshots?
提问by Nathan Long
One of the key differences between Git and most other version control systems is that the others tend to store commits as a series of deltas - changesets between one commit and the next. This seems logical, since it's the smallest possible amount of information to store about a commit. But the longer the commit history gets, the more calculation it takes to compare ranges of revisions.
Git 与大多数其他版本控制系统之间的主要区别之一是其他版本控制系统倾向于将提交存储为一系列增量 - 一次提交和下一次之间的变更集。这似乎是合乎逻辑的,因为它是存储关于提交的尽可能少的信息。但是提交历史越长,比较修订范围所需的计算就越多。
By contrast, Git stores a complete snapshot of the whole project in each revision. The reason this doesn't make the repo size grow dramatically with each commit is each file in the project is stored as a file in the Git subdirectory, named for the hash of its contents. So if the contents haven't changed, the hash hasn't changed, and the commit just points to the same file. And there are other optimizations as well.
相比之下,Git在每个修订版中存储整个项目的完整快照。这不会使 repo 大小随着每次提交而急剧增加的原因是项目中的每个文件都作为文件存储在 Git 子目录中,以其内容的哈希命名。所以如果内容没有改变,哈希也没有改变,提交只是指向同一个文件。还有其他优化。
All this made sense to me until I stumbled on this information about pack files, into which Git puts data periodically to save space:
所有这些对我来说都是有意义的,直到我偶然发现了有关包文件的信息,Git 定期将数据放入其中以节省空间:
In order to save that space, Git utilizes the packfile. This is a format where Git will only save the part that has changed in the second file, with a pointer to the file it is similar to.
为了节省空间,Git 使用了包文件。在这种格式中,Git 只会保存在第二个文件中发生更改的部分,并带有指向它相似的文件的指针。
Isn't this basically going back to storing deltas? If not, how is it different? How does this avoid subjecting Git to the same problems other version controls systems have?
这不是基本上回到存储增量吗?如果不是,那有什么不同?这如何避免 Git 遇到与其他版本控制系统相同的问题?
For example, Subversion uses deltas, and rolling back 50 versions means undoing 50 diffs, whereas with Git you can just grab the appropriate snapshot. Unless git also stores 50 diffs in the packfiles... is there some mechanism that says "after some small number of deltas, we'll store a whole new snapshot" so that we don't pile up too large a changeset? How else might Git avoid the disadvantages of deltas?
例如,Subversion 使用增量,回滚 50 个版本意味着撤消 50 个差异,而使用 Git,您可以只获取适当的快照。除非 git 还在包文件中存储 50 个差异......是否有某种机制说“在少量增量之后,我们将存储一个全新的快照”,以便我们不会堆积太大的变更集?Git 还能如何避免增量的缺点?
回答by Chris Johnsen
Summary:
Git's pack files are carefully constructed to effectively use disk caches and
provide “nice” access patterns for common commands and for reading recently referenced
objects.
简介:
Git 的包文件经过精心构建,以有效地使用磁盘缓存并为常用命令和读取最近引用的对象提供“良好”的访问模式。
Git's pack file format is quite flexible (see Documentation/technical/pack-format.txt, or The Packfilein The Git Community Book). The pack files store objects in two main ways: “undeltified” (take the raw object data and deflate-compress it), or “deltified” (form a delta against some other object then deflate-compress the resulting delta data). The objects stored in a pack can be in any order (they do not (necessarily) have to be sorted by object type, object name, or any other attribute) and deltified objects can be made against any other suitable object of the same type.
Git 的包文件格式非常灵活(参见Documentation/technical/pack- format.txt或The Git Community Book中的The Packfile)。包文件以两种主要方式存储对象:“未deltified”(获取原始对象数据并对其进行压缩压缩)或“deltified”(针对某个其他对象形成delta,然后对生成的delta数据进行压缩压缩)。存储在包中的对象可以按任何顺序排列(它们不必(必须)按对象类型、对象名称或任何其他属性进行排序),并且可以针对相同类型的任何其他合适的对象制作已删除的对象。
Git's pack-objectscommand uses several heuristicsto provide excellent locality of referencefor common commands. These heuristics control both the selection of base objects for deltified objects and the order of the objects. Each mechanism is mostly independent, but they share some goals.
Git 的pack-objects命令使用多种启发式方法为常用命令提供出色的参考位置。这些启发式方法既控制了对删除对象的基础对象的选择,也控制了对象的顺序。每个机制大多是独立的,但它们有一些共同的目标。
Git does form long chains of delta compressed objects, but the
heuristics try to make sure that only “old” objects are at the ends of
the long chains. The delta base cache (whose size is controlled by the
core.deltaBaseCacheLimit
configuration variable) is automatically
used and can greatly reduce the number of “rebuilds” required for
commands that need to read a large number of objects (e.g. git log
-p
).
Git 确实形成了 delta 压缩对象的长链,但启发式尝试确保只有“旧”对象位于长链的末端。增量基本缓存(其大小由core.deltaBaseCacheLimit
配置变量控制
)是自动使用的,可以大大减少需要读取大量对象(例如git log
-p
)的命令所需的“重建”次数。
Delta Compression Heuristic
Delta 压缩启发式
A typical Git repository stores a very large number of objects, so it can not reasonably compare them all to find the pairs (and chains) that will yield the smallest delta representations.
典型的 Git 存储库存储大量对象,因此无法合理地比较它们以找到将产生最小增量表示的对(和链)。
The delta base selection heuristic is based on the idea that the good delta bases will be found among objects with similar filenames and sizes. Each type of object is processed separately (i.e. an object of one type will never be used as the delta base for an object of another type).
delta base 选择启发式基于这样的想法,即好的 delta base 将在具有相似文件名和大小的对象中找到。每种类型的对象都单独处理(即,一种类型的对象永远不会用作另一种类型对象的增量基础)。
For the purposes of delta base selection, the objects are sorted (primarily) by filename and then size. A window into this sorted list is used to limit the number of objects that are considered as potential delta bases. If a “good enough”1delta representation is not found for an object among the objects in its window, then the object will not be delta compressed.
出于增量基础选择的目的,对象按文件名排序(主要),然后按大小排序。此排序列表的窗口用于限制被视为潜在增量基础的对象数量。如果在其窗口中的对象中找不到对象的“足够好”的1delta 表示,则不会对该对象进行 delta 压缩。
The size of the window is controlled by the --window=
option of
git pack-objects
, or the pack.window
configuration variable. The
maximum depth of a delta chain is controlled by the --depth=
option of git pack-objects
, or the pack.depth
configuration
variable. The --aggressive
option of git gc
greatly enlarges
both the window size and the maximum depth to attempt to create
a smaller pack file.
窗口的大小由--window=
选项
git pack-objects
或pack.window
配置变量控制。delta 链的最大深度由--depth=
选项git pack-objects
或pack.depth
配置变量控制。该--aggressive
的选项git gc
极大地扩大窗口的大小和最大深度都试图创建一个较小的包文件。
The filename sort clumps together the objects for entries with with
identical names (or at least similar endings (e.g. .c
)). The size
sort is from largest to smallest so that deltas that remove data are
preferred to deltas that add data (since removal deltas have shorter
representations) and so that the earlier, larger objects (usually
newer) tend to be represented with plain compression.
文件名排序将具有相同名称(或至少相似的结尾(例如.c
))的条目的对象聚集在一起。大小排序是从最大到最小的,因此删除数据的 delta 比添加数据的 delta 更受欢迎(因为删除 delta 具有较短的表示),并且较早、较大的对象(通常较新)倾向于用普通压缩表示。
1What qualifies as “good enough” depends on the size of the object in question and its potential delta base as well as how deep its resulting delta chain would be.
1什么才算是“足够好”取决于所讨论对象的大小及其潜在的 delta 基础,以及由此产生的 delta 链有多深。
Object Ordering Heuristic
对象排序启发式
Objects are stored in the pack files in a “most recently referenced” order. The objects needed to reconstruct the most recent history are placed earlier in the pack and they will be close together. This usually works well for OS disk caches.
对象以“最近引用”的顺序存储在包文件中。重建最近历史所需的对象放在包中的较早位置,并且它们将靠近在一起。这通常适用于操作系统磁盘缓存。
All the commit objects are sorted by commit date (most recent first)
and stored together. This placement and ordering optimizes the disk
accesses needed to walk the history graph and extract basic commit
information (e.g. git log
).
所有提交对象都按提交日期(最近的在前)排序并存储在一起。这种放置和排序优化了遍历历史图和提取基本提交信息(例如git log
)所需的磁盘访问。
The tree and blob objects are stored starting with the tree from the first stored (most recent) commit. Each tree is processed in a depth first fashion, storing any objects that have not already been stored. This puts all the trees and blobs required to reconstruct the most recent commit together in one place. Any trees and blobs that have not yet been saved but that are required for later commits are stored next, in the sorted commit order.
树和 blob 对象从第一次存储(最近)提交的树开始存储。每棵树都以深度优先的方式处理,存储尚未存储的任何对象。这将重建最近提交所需的所有树和 blob 放在一个地方。任何尚未保存但稍后提交所需的树和 blob 将按已排序的提交顺序存储。
The final object ordering is slightly affected by the delta base selection in that if an object is selected for delta representation and its base object has not been stored yet, then its base object is stored immediately before the deltified object itself. This prevents likely disk cache misses due to the non-linear access required to read a base object that would have “naturally” been stored later in the pack file.
最终对象排序受 delta 基础选择的轻微影响,因为如果选择了一个对象进行 delta 表示并且其基础对象尚未存储,则其基础对象将立即存储在 deltified 对象本身之前。这可以防止由于读取“自然”存储在包文件中的基础对象所需的非线性访问而导致的可能的磁盘缓存未命中。
回答by Greg Hewgill
The use of delta storage in the pack file is just an implementation detail. At that level, Git doesn't know why or how something changed from one revision to the next, rather it just knows that blob B is pretty similar to blob A except for these changes C. So it will only store blob A and changes C (if it chooses to do so - it could also choose to store blob A and blob B).
在包文件中使用增量存储只是一个实现细节。在那个级别,Git 不知道为什么或如何从一个修订版更改为下一个修订版,而是只知道 blob B 与 blob A 非常相似,除了这些更改 C。因此它只会存储 blob A 并更改 C (如果它选择这样做 - 它也可以选择存储 blob A 和 blob B)。
When retrieving objects from the pack file, the delta storage is not exposed to the caller. The caller still sees complete blobs. So, Git works the same way it always has without the delta storage optimisation.
从包文件中检索对象时,增量存储不会暴露给调用者。调用者仍会看到完整的斑点。因此,在没有增量存储优化的情况下,Git 的工作方式与往常相同。
回答by VonC
As I mentioned in "What are git's thin packs?"
正如我在“ git 的瘦包是什么?”中提到的那样
Git does deltification only in packfiles
Git 仅在包文件中进行 deltification
I detailed the delta encoding used for pack files in "Is the git binary diff algorithm (delta storage) standardized?".
See also "When and how does git use deltas for storage?".
我在“ git binary diff 算法(增量存储)是否标准化?”中详细介绍了用于打包文件的增量编码。
另请参阅“ git 何时以及如何使用增量进行存储?”。
Note that the core.deltaBaseCacheLimit
config which controls the default size for the pack file will soon be bumped from 16MB to 96MB, for Git 2.0.x/2.1 (Q3 2014).
请注意core.deltaBaseCacheLimit
,对于 Git 2.0.x/2.1(2014 年第三季度),控制包文件默认大小的配置将很快从 16MB 增加到 96MB。
See commit 4874f54by David Kastrup (May 2014):
请参阅David Kastrup提交的 4874f54(2014 年 5 月):
Bump core.deltaBaseCacheLimit to 96m
将 core.deltaBaseCacheLimit 提高到 96m
The default of 16m causes serious thrashing for large delta chains combined with large files.
Here are some benchmarks (pu variant of
git blame
):
16m 的默认值会导致大型 delta 链与大型文件相结合的严重抖动。
以下是一些基准测试( 的 pu 变体
git blame
):
time git blame -C src/xdisp.c >/dev/null
for a repository of Emacs repacked with
git gc --aggressive
(v1.9, resulting in a window size of 250) located on an SSD drive.
The file in question has about 30000 lines, 1Mb of size, and a history with about 2500 commits.
用于重新
git gc --aggressive
打包(v1.9,导致窗口大小为 250)位于 SSD 驱动器上的Emacs 存储库。
有问题的文件有大约 30000 行,1Mb 大小,以及大约 2500 次提交的历史记录。
16m (previous default):
real 3m33.936s
user 2m15.396s
sys 1m17.352s
96m:
real 2m5.668s
user 1m50.784s
sys 0m14.288s