git binary diff 算法(增量存储)是否标准化?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9478023/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 06:32:53  来源:igfitidea点击:

Is the git binary diff algorithm (delta storage) standardized?

gitcompressionbinary-diffvcdiff

提问by Thilo

Git uses a delta compression to store objects that are similar to each-other.

Git 使用增量压缩来存储彼此相似的对象。

Is this algorithm standardized and used in other tools as well? Is there documentation describing the format? Is it compatible with xdelta/VCDIFF/RFC 3284?

该算法是否标准化并在其他工具中使用?是否有描述格式的文档?它是否与 xdelta/VCDIFF/RFC 3284 兼容?

回答by VonC

I think the diff algo used for pack fileswas linked to one of the delta encodingout there: initially (2005) xdelta, and then libXDiff.
But then, as detailed below, it shifted to a custom implementation.

我认为用于包文件的 diff 算法与那里的增量编码之一相关联:最初是 (2005) xdelta,然后是libXDiff
但是,如下详述,它转向了自定义实现。

Anyway, as mentioned here:

无论如何,正如这里提到的

Git does deltification onlyin packfiles.
But when you push via SSH git would generate a pack file with commits the other side doesn't have, and those packs are thin packs, so they also have deltas... but the remote side then adds bases to those thin packs making them standalone.

Git在包文件中进行 deltification。
但是当你通过 SSH 推送时,git 会生成一个包文件,其中包含另一方没有的提交,这些包是瘦包,所以它们也有增量......独立。

(note: creating many packfiles, or retrieving information in huge packfile is costly, and explain why git doesn't handle well huge files or huge repo.
See more at "git with large files")

(注意:创建许多包文件,或在巨大的包文件中检索信息的成本很高,并解释为什么 git 不能很好地处理巨大的文件或巨大的存储库。
更多信息请参见“带有大文件的 git”)

This threadalso reminds us:

这个帖子也提醒我们:

Actually packfiles and deltification (LibXDiff, not xdelta) was, from what I remember and understand, originally because of network bandwidth(which is much more costly than disk space), and I/O performanceof using single mmapped file instead of very large number of loose objects.

实际上,packfiles 和 deltification(LibXDiff,而不是 xdelta)是,据我记忆和理解,最初是因为网络带宽(比磁盘空间成本高得多),以及使用单个 mmapped 文件而不是非常大的数字的I/O 性能松散的物体。

LibXDiff is mentioned in this 2008 thread.

LibXDiff 在这个2008 线程中被提及。

However, since then, the algo has evolved, probably in a custom one, as this 2011 thread illustrates, and as the header of diff-delta.cpoints out:

然而,从那时起,算法已经发展,可能是自定义的,正如这个2011 线程所说明的那样,并且正如标题diff-delta.c指出的那样:

So, strictly speaking, the current code in Git doesn't bear any resemblance with the libxdiff code at all.
However the basic algorithm behind both implementations is the same
.
Studying the libxdiff version is probably easier in order to gain an understanding of how this works.

因此,严格来说,Git 中的当前代码与 libxdiff 代码根本没有任何相似之处。
然而,这两种实现背后的基本算法是相同的

研究 libxdiff 版本可能更容易,以便了解其工作原理。

/*
 * diff-delta.c: generate a delta between two buffers
 *
 * This code was greatly inspired by parts of LibXDiff from Davide Libenzi
 * http://www.xmailserver.org/xdiff-lib.html
 *
 * Rewritten for GIT by Nicolas Pitre <[email protected]>, (C) 2005-2007
 */

More on the packfiles the Git Book:

更多关于包文件的 Git 书

packfile format

包文件格式



Git 2.18 adds to the delta descriptionin this new documentation section, which now (Q2 2018) states:

Git 2.18在这个新文档部分添加了增量描述,现在(2018 年第二季度)指出:

Object types

Valid object types are:

  • OBJ_COMMIT(1)
  • OBJ_TREE(2)
  • OBJ_BLOB(3)
  • OBJ_TAG(4)
  • OBJ_OFS_DELTA(6)
  • OBJ_REF_DELTA(7)

Type 5 is reserved for future expansion. Type 0 is invalid.

Deltified representation

Conceptually there are only four object types: commit, tree, tag and blob.
However to save space, an object could be stored as a "delta" of another "base" object.
These representations are assigned new types ofs-delta and ref-delta, which is only valid in a pack file.

Both ofs-deltaand ref-deltastore the "delta" to be applied to another object (called 'base object') to reconstruct the object.
The difference between them is,

  • ref-delta directly encodes 20-byte base object name.
    • If the base object is in the same pack, ofs-delta encodes the offset of the base object in the pack instead.

The base object could also be deltified if it's in the same pack.
Ref-delta can also refer to an object outside the pack (i.e. the so-called "thin pack"). When stored on disk however, the pack should be self contained to avoid cyclic dependency.

The delta datais a sequence of instructions to reconstruct an object from the base object.
If the base object is deltified, it must be converted to canonical form first. Each instruction appends more and more data to the target object until it's complete.
There are two supported instructions so far:

  • one for copy a byte range from the source object and
  • one for inserting new data embedded in the instruction itself.

Each instruction has variable length. Instruction type is determined by the seventh bit of the first octet. The following diagrams follow the convention in RFC 1951 (Deflate compressed data format).

对象类型

有效的对象类型是:

  • OBJ_COMMIT(1)
  • OBJ_TREE(2)
  • OBJ_BLOB(3)
  • OBJ_TAG(4)
  • OBJ_OFS_DELTA(6)
  • OBJ_REF_DELTA(7)

类型 5 保留用于未来扩展。类型 0 无效。

德尔格表示

从概念上讲,只有四种对象类型:commit、tree、tag 和 blob。
然而,为了节省空间,一个对象可以存储为另一个“基本”对象的“增量”。
这些表示被分配了新的 s-delta 和 ref-delta 类型,它们仅在包文件中有效。

ofs-deltaref-delta存储“增量”要被施加到另一个对象(称为“基础对象”)以重建该对象。
它们之间的区别在于,

  • ref-delta 直接编码 20 字节的基本对象名称。
    • 如果基础对象在同一个包中,ofs-delta 将编码包中基础对象的偏移量。

如果基础对象在同一个包中,它也可以被删除。
Ref-delta 也可以指包外的对象(即所谓的“瘦包”)。然而,当存储在磁盘上时,包应该是自包含的以避免循环依赖。

差分数据是指令序列来重建从基础对象的对象。
如果基础对象被删除,则必须先将其转换为规范形式。每条指令都会向目标对象追加越来越多的数据,直到完成为止。
到目前为止,有两个受支持的指令:

  • 一个用于从源对象复制一个字节范围和
  • 一种用于插入嵌入指令本身的新数据。

每条指令都有可变长度。指令类型由第一个八位字节的第七位决定。下图遵循 RFC 1951(Deflate 压缩数据格式)中的约定。

回答by Thiago de Arruda

Git delta encoding is copy/insert based.

Git delta 编码是基于复制/插入的。

This means that the derived file is encoded as a sequence of opcodes which can represent copy instructions(eg: copy from the base file y bytes starting from offset x into the target buffer) or insert instructions(eg: insert the next x bytes into the target buffer).

这意味着派生文件被编码为一系列操作码,这些操作码可以表示复制指令(例如:从基文件复制 y 个字节,从偏移量 x 开始到目标缓冲区)或插入指令(例如:将下一个 x 字节插入到目标缓冲区中)目标缓冲区)。

As a very simple example(taken from the paper 'File System Support for Delta Compression'), consider that we want to create a delta buffer to transform the text "proxy  cache" into "cache  proxy". The resulting instructions should be:

作为一个非常简单的例子(取自论文“文件系统支持增量压缩”),考虑我们要创建一个增量缓冲区来将文本“代理缓存”转换为“缓存代理”。结果指令应该是:

  1. Copy 5 bytes from offset 7 (copy 'cache' from base buffer)
  2. Insert two spaces
  3. Copy 5 bytes from offset 0 (copy 'proxy' from base buffer)
  1. 从偏移量 7 复制 5 个字节(从基本缓冲区复制“缓存”)
  2. 插入两个空格
  3. 从偏移量 0 复制 5 个字节(从基本缓冲区复制“代理”)

Which translated to git's encoding becomes:

转换为 git 的编码就变成了:

(bytes 1-3 represent the first instruction)

(字节1-3代表第一条指令)

  • 0x91 (10010001), which is split into
    • 0x80 (10000000)(most significant bit set makes this a 'copy from base to output' instruction)
    • 0x01 (00000001)(means 'advance one byte and use it as the base offset)
    • 0x10 (00010000)(advance one byte and use it as length)
  • 0x07 (offset)
  • 0x05 (length)
  • 0x91 (10010001),分为
    • 0x80 (10000000)(设置最高有效位使其成为“从基址到输出的复制”指令)
    • 0x01 (00000001)(意思是'前进一个字节并用它作为基址偏移量)
    • 0x10 (00010000)(提前一字节作为长度)
  • 0x07(偏移)
  • 0x05(长​​度)

(bytes 4-6 represent the second instruction)

(字节 4-6 代表第二条指令)

  • 0x02 (since the MSB is not set, this means 'insert the next two bytes into the output')
  • 0x20 (space)
  • 0x20 (space)
  • 0x02(由于未设置 MSB,这意味着“将接下来的两个字节插入输出中”)
  • 0x20(空格)
  • 0x20(空格)

(bytes 7-8 represent the last instruction)

(字节 7-8 代表最后一条指令)

  • 0x90 (10010000), which is split into
    • 0x80 (10000000)(means 'copy')
    • 0x10 (00010000)(advance one byte and use it as length)
  • 0x05 (length)
  • 0x90 (10010000),分为
    • 0x80 (10000000)(表示“复制”)
    • 0x10 (00010000)(提前一字节作为长度)
  • 0x05(长​​度)

Notice that in the last copy instruction does not specify an offset which means offset 0. Other bits in the copy opcode can also be set when bigger offsets/lengths are needed.

请注意,在最后一条复制指令中没有指定偏移量,这意味着偏移量为 0。当需要更大的偏移量/长度时,也可以设置复制操作码中的其他位。

The result delta buffer has in this example has 8 bytes, which is not much of a compression since the target buffer has 12 bytes, but when this encoding applied to large text files it can make a huge difference.

在这个例子中,delta 缓冲区的结果有 8 个字节,这并不是一个很大的压缩,因为目标缓冲区有 12 个字节,但是当这种编码应用于大文本文件时,它可以产生巨大的差异。

I have recently pushed a node.js libraryto github which implements both diff/patch functions using git delta encoding. The codeshould be more readable and commented than the one in git source, which is heavily optimized.

我最近向 github推送了一个node.js 库,它使用 git delta 编码实现了 diff/patch 函数。的 代码应该更可读的并且比在GIT中源,其被高度优化评价。

I also have written a few teststhat explain the output opcodes used in each example with a format similar to the above.

我还编写了一些 测试来解释每个示例中使用的输出操作码,其格式与上述类似。