`git diff --patience` 和 `git diff --histogram` 有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32365271/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 11:20:30  来源:igfitidea点击:

What's the difference between `git diff --patience` and `git diff --histogram`?

gitalgorithmdiffgit-diff

提问by Stuart P. Bentley

This earlier questionasked for the differences between 4 different Git diff strategies, but the only difference that was explained was the difference between myersand patience, which is pretty well explained elsewhere.

这个较早的问题询问了 4 种不同 Git 差异策略之间的差异,但解释的唯一差异是myers和之间的差异patience,这在别处得到了很好的解释。

How does the histogramstrategy work? What differentiates it from patience? The git-diff man pageonly says that it "extends the patience algorithm to "support low-occurrence common elements"." Other pages mention that it's faster, and that it comes from JGit, but they don't explain where or how its algorithm or results will differ from patience.

histogram策略如何运作?它与什么区别patience?在GIT-DIFF手册页只能说,它“伸出耐心算法‘支持低发生共同的元素’。” 其他页面提到它更快,并且它来自 JGit,但他们没有解释它的算法或结果将在何处或如何与patience.

Where can I find a description of the histogramalgorithm relative to the patiencealgorithm, with the same level of detail as Bram Cohen's original description of the patiencealgorithm?

我在哪里可以找到与histogram算法相关的patience算法描述,其详细程度与Bram Cohen 对patience算法的原始描述相同

(If it's just a matter of implementation performance with no case that will produce different results, why wasn't it just implemented as a new backend for patience?)

(如果这只是一个实现性能的问题,没有任何情况会产生不同的结果,为什么不只是作为一个新的后端来实现patience?)

回答by VonC

This histogram strategy was introduced in git 1.7.7 (Sept 2011), with the following description (as mentioned by the OP)

这种直方图策略是在git 1.7.7 (Sept 2011) 中引入的,具有以下描述(如 OP 所述)

"git diff" learned a "--histogram" option to use a different diff generation machinery stolen from jgit, which might give better performance.

" git diff"学习了" " --histogram" 使用从jgit窃取的不同差异生成机制的选项,这可能会提供更好的性能。

JGit includes src/org/eclipse/jgit/diff/HistogramDiff.javaand tst/org/eclipse/jgit/diff/HistogramDiffTest.java

JGit 包括src/org/eclipse/jgit/diff/HistogramDiff.javatst/org/eclipse/jgit/diff/HistogramDiffTest.java

The description there is fairly complete:

那里的描述相当完整:

HistogramDiff

An extended form of Bram Cohen's patience diff algorithm.

This implementation was derived by using the 4 rules that are outlined in Bram Cohen's blog, and then was further extended to support low-occurrence common elements.

The basic idea of the algorithm is to create a histogram of occurrences for each element of sequence A. Each element of sequence B is then considered in turn. If the element also exists in sequence A, and has a lower occurrence count, the positions are considered as a candidate for the longest common subsequence (LCS).
After scanning of B is complete the LCS that has the lowest number of occurrences is chosen as a split point. The region is split around the LCS, and the algorithm is recursively applied to the sections before and after the LCS.

By always selecting a LCS position with the lowest occurrence count, this algorithm behaves exactly like Bram Cohen's patience diff whenever there is a unique common element available between the two sequences.
When no unique elements exist, the lowest occurrence element is chosen instead
.
This offers more readable diffs than simply falling back on the standard Myers' O(ND)algorithm would produce.

To prevent the algorithm from having an O(N^2)running time, an upper limit on the number of unique elements in a histogram bucket is configured by #setMaxChainLength(int).
If sequence A has more than this many elements that hash into the same hash bucket, the algorithm passes the region to #setFallbackAlgorithm(DiffAlgorithm).
If no fallback algorithm is configured, the region is emitted as a replace edit.

During scanning of sequence B, any element of A that occurs more than #setMaxChainLength(int)times is never considered for an LCS match position, even if it is common between the two sequences. This limits the number of locations in sequence A that must be considered to find the LCS,and helps maintain a lower running time bound.

So long as #setMaxChainLength(int)is a small constant (such as 64), the algorithm runs in O(N * D)time, where Nis the sum of the input lengths and Dis the number of edits in the resulting EditList.
If the supplied SequenceComparatorhas a good hash function, this implementation typically out-performs MyersDiff, even though its theoretical running time is the same.

This implementation has an internal limitation that prevents it from handling sequences with more than 268,435,456 (2^28) elements

直方图差异

Bram Cohen 的耐心差异算法的扩展形式。

此实现是通过使用Bram Cohen 的博客中概述的 4 条规则派生而来 ,然后进一步扩展以支持低发生率的公共元素。

该算法的基本思想是为序列 A 的每个元素创建一个出现次数的直方图。然后依次考虑序列 B 的每个元素。如果该元素也存在于序列 A 中,并且出现次数较少,则这些位置被视为最长公共子序列 (LCS) 的候选者。
完成 B 的扫描后,选择出现次数最少的 LCS 作为分割点。该区域围绕 LCS 进行分割,并将算法递归应用于 LCS 前后的部分。

通过始终选择出现次数最少的 LCS 位置,只要两个序列之间存在唯一的公共元素,该算法的行为就与 Bram Cohen 的耐心差异完全相同。
当不存在唯一元素时,将选择出现次数最少的元素

这提供了比简单地退回到标准 MyersO(ND)算法会产生的更具可读性的差异。

为了防止算法有O(N^2)运行时间,直方图桶中唯一元素的数量上限由 配置#setMaxChainLength(int)
如果序列 A 有多个元素散列到同一个散列桶中,则算法将该区域传递给#setFallbackAlgorithm(DiffAlgorithm)
如果未配置回退算法,则该区域将作为替换编辑发出。

在扫描序列 B 期间,A 中出现#setMaxChainLength(int)多次的任何元素都不会被视为 LCS 匹配位置,即使它在两个序列之间是通用的。这限制了在序列 A 中寻找 LCS 必须考虑的位置数量,并有助于保持较低的运行时间界限。

只要#setMaxChainLength(int)是一个小常数(例如 64),算法就会O(N * D)及时运行,其中N是输入长度的总和,D是结果 中的编辑次数EditList
如果所提供SequenceComparator的散列函数具有良好的散列函数MyersDiff,则即使其理论运行时间相同,此实现通常也会优于 。

此实现有一个内部限制,无法处理超过 268,435,456 (2^28) 个元素的序列



Note that this kind of algo was already used for pack_check, back in 2006 (git 1.3), for git-verify-pack -v. It was reused for index-pack in git 1.7.7

请注意,这种算法早在 2006 年 (git 1.3) 就已用于 pack_check,用于git-verify-pack -v. 它被重用于 git 1.7.7 中的 index-pack



Commit 8c912eeactually introduced --histogramto diff:

Commit 8c912ee实际上引入--histogram了diff:

Port JGit's HistogramDiff algorithm over to C. Rough numbers (TODO) show that it is faster than its --patiencecousin, as well as the default Meyers algorithm.

The implementation has been reworked to use structs and pointers, instead of bitmasks, thus doing away with JGit's 2^28line limit.

We also use xdiff's default hash table implementation (xdl_hash_bits()with XDL_HASHLONG()) for convenience.

将 JGit 的 HistogramDiff 算法移植到 C。粗略数字 (TODO) 表明它比它的--patience表亲以及默认的 Meyers 算法更快。

该实现已被重新设计以使用结构体和指针,而不是位掩码,从而取消了 JGit 的2^28行限制

为方便起见,我们还使用了xdiff默认的哈希表实现(xdl_hash_bits()with XDL_HASHLONG())。

commit 8555123 (git 1.7.10, April 2012)added:

提交 8555123(git 1.7.10,2012 年 4 月)添加:

8c912ee (teach --histogramto diff, 2011-07-12) claimed histogram diff was faster than both Myers and patience.

We have since incorporated a performance testing framework, so add a test that compares the various diff tasks performed in a real 'log -p' workload.
This does indeed show that histogram diff slightly beats Myers, while patience is much slower than the others.

8c912ee (teach --histogramto diff, 2011-07-12) 声称直方图差异比 Myers 和耐心都快。

我们已经合并了一个性能测试框架,因此添加一个测试来比较在真实“ log -p”工作负载中执行的各种差异任务。
这确实表明直方图差异略胜于迈尔斯,而耐心比其他人慢得多。

Finally, commit 07ab4de (git 1.8.2, March 2013)add

最后,提交 07ab4de(git 1.8.2,2013 年 3 月)添加

config: Introduce diff.algorithm variable

config: 引入 diff.algorithm 变量

Some users or projects prefer different algorithms over others, e.g. patience over myers or similar.
However, specifying appropriate argument every time diff is to be used is impractical. Moreover, creating an alias doesn't play nicely with other tools based on diff (git-showfor instance).

Hence, a configuration variable which is able to set specific algorithm is needed.
For now, these four values are accepted:

  • 'myers' (which has the same effect as not setting the config variable at all),
  • 'minimal',
  • 'patience' and
  • 'histogram'.

一些用户或项目比其他用户或项目更喜欢不同的算法,例如耐心而不是迈尔斯或类似的。
但是,每次使用 diff 时都指定适当的参数是不切实际的。此外,创建别名不能很好地与基于 diff 的其他工具一起使用(git-show例如)。

因此,需要一个能够设置特定算法的配置变量。
目前,接受这四个值:

  • ' myers'(与根本不设置配置变量具有相同的效果),
  • ' minimal',
  • ' patience' 和
  • ' histogram'。

Commit 07924d4added concurrently the --diff-algorithmcommand line option.
As the OP Stuart P. Bentleymentions in the comments:

Commit 07924d4同时添加了--diff-algorithm命令行选项。
正如 OP Stuart P. Bentley在评论中提到

you can configure Git to use histogram by defaultwith:

您可以将 Git 配置为默认使用直方图

git config --global diff.algorithm histogram


Update: Git 2.12 (Q1 2017) will retire the "fast hash" that had disastrous performance issues in some corner cases.

更新:Git 2.12(2017 年第一季度)将淘汰在某些极端情况下具有灾难性性能问题的“快速哈希”。

See commit 1f7c926(01 Dec 2016) by Jeff King (peff). (Merged by Junio C Hamano -- gitster--in commit 731490b, 19 Dec 2016)

请参阅Jeff King ( ) 的commit 1f7c926(01 Dec 2016 )(由Junio C Hamano合并-- --commit 731490b,2016 年 12 月 19 日)peffgitster

xdiff: drop XDL_FAST_HASH

The xdiffcode hashes every line of both sides of a diff, and then compares those hashes to find duplicates. The overall performance depends both on how fast we can compute the hashes, but also on how many hash collisions we see.

The idea of XDL_FAST_HASHis to speed up the hash computation.
But the generated hashes have worse collision behavior. This means that in some cases it speeds diffs up (running "git log -p" on git.gitimproves by ~8%with it), but in others it can slow things down. One pathological case saw over a 100x slowdown.

There may be a better hash function that covers both properties, but in the meantime we are better off with the original hash. It's slightly slower in the common case, but it has fewer surprising pathological cases.

xdiff: 降低 XDL_FAST_HASH

xdiff代码对 diff 两边的每一行进行散列,然后比较这些散列以找到重复项。整体性能既取决于我们计算散列的速度,也取决于我们看到多少散列冲突。

的想法XDL_FAST_HASH是加速哈希计算。
但是生成的散列具有更差的碰撞行为。这意味着在某些情况下它会加快差异(运行“ git log -p”会随着它的git.git改进而提高~8%),但在其他情况下它会减慢速度。一个病理案例出现了超过 100 倍的减速

可能有一个更好的散列函数可以覆盖这两个属性,但同时我们最好使用原始散列。在普通情况下它会稍微慢一些,但它具有较少的令人惊讶的病理情况。



Note: "git diff --histogram" had a bad memory usage pattern, which has been rearranged to reduce the peak usage, with Git 2.19 (Q3 2018).

注意:“ git diff --histogram”有一个糟糕的内存使用模式,已重新安排以减少使用 Git 2.19(2018 年第三季度)的峰值。

See commit 79cb2eb, commit 64c4e8b, commit c671d4b, commit 2820985(19 Jul 2018) by Stefan Beller (stefanbeller).
(Merged by Junio C Hamano -- gitster--in commit 57fbd8e, 15 Aug 2018)

请参阅Stefan Beller ( ) 的提交 79cb2eb提交 64c4e8b提交 c671d4b提交 2820985(2018 年 7 月 19 日(由Junio C Hamano合并-- --commit 57fbd8e,2018 年 8 月 15 日)stefanbeller
gitster

xdiff/xhistogram: move index allocation into find_lcs

This fixes a memory issue when recursing a lot, which can be reproduced as

seq 1   100000 >one
seq 1 4 100000 >two
git diff --no-index --histogram one two

Before this patch, histogram_diffwould call itself recursively before calling free_index, which would mean a lot of memory is allocated during the recursion and only freed afterwards.

By moving the memory allocation (and its free call) into find_lcs, the memory is free'd before we recurse, such that memory is reused in the next step of the recursion instead of using new memory.

This addresses only the memory pressure, not the run time complexity, that is also awful for the corner case outlined above.

xdiff/xhistogram: 将索引分配移入 find_lcs

这修复了大量递归时的内存问题,可以将其复制为

seq 1   100000 >one
seq 1 4 100000 >two
git diff --no-index --histogram one two

在此补丁之前,histogram_diff会在调用之前递归调用自身free_index,这意味着在递归期间分配了大量内存,然后才释放。

通过将内存分配(及其空闲调用)移动到 中find_lcs,内存在我们递归之前被释放,以便在递归的下一步中重用内存而不是使用新内存。

这仅解决了内存压力,而不是运行时间复杂度,这对于上面概述的极端情况来说也是很糟糕的。