`git diff --patience` 和 `git diff --histogram` 有什么区别?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32365271/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What's the difference between `git diff --patience` and `git diff --histogram`?
提问by Stuart P. Bentley
This earlier questionasked for the differences between 4 different Git diff strategies, but the only difference that was explained was the difference between myers
and patience
, which is pretty well explained elsewhere.
这个较早的问题询问了 4 种不同 Git 差异策略之间的差异,但解释的唯一差异是myers
和之间的差异patience
,这在别处得到了很好的解释。
How does the histogram
strategy work? What differentiates it from patience
? The git-diff man pageonly says that it "extends the patience algorithm to "support low-occurrence common elements"." Other pages mention that it's faster, and that it comes from JGit, but they don't explain where or how its algorithm or results will differ from patience
.
histogram
策略如何运作?它与什么区别patience
?在GIT-DIFF手册页只能说,它“伸出耐心算法‘支持低发生共同的元素’。” 其他页面提到它更快,并且它来自 JGit,但他们没有解释它的算法或结果将在何处或如何与patience
.
Where can I find a description of the histogram
algorithm relative to the patience
algorithm, with the same level of detail as Bram Cohen's original description of the patience
algorithm?
我在哪里可以找到与histogram
算法相关的patience
算法描述,其详细程度与Bram Cohen 对patience
算法的原始描述相同?
(If it's just a matter of implementation performance with no case that will produce different results, why wasn't it just implemented as a new backend for patience
?)
(如果这只是一个实现性能的问题,没有任何情况会产生不同的结果,为什么不只是作为一个新的后端来实现patience
?)
回答by VonC
This histogram strategy was introduced in git 1.7.7 (Sept 2011), with the following description (as mentioned by the OP)
这种直方图策略是在git 1.7.7 (Sept 2011) 中引入的,具有以下描述(如 OP 所述)
"
git diff
" learned a "--histogram
" option to use a different diff generation machinery stolen from jgit, which might give better performance.
"
git diff
"学习了" "--histogram
" 使用从jgit窃取的不同差异生成机制的选项,这可能会提供更好的性能。
JGit includes src/org/eclipse/jgit/diff/HistogramDiff.java
and tst/org/eclipse/jgit/diff/HistogramDiffTest.java
JGit 包括src/org/eclipse/jgit/diff/HistogramDiff.java
和tst/org/eclipse/jgit/diff/HistogramDiffTest.java
The description there is fairly complete:
那里的描述相当完整:
HistogramDiff
An extended form of Bram Cohen's patience diff algorithm.
This implementation was derived by using the 4 rules that are outlined in Bram Cohen's blog, and then was further extended to support low-occurrence common elements.
The basic idea of the algorithm is to create a histogram of occurrences for each element of sequence A. Each element of sequence B is then considered in turn. If the element also exists in sequence A, and has a lower occurrence count, the positions are considered as a candidate for the longest common subsequence (LCS).
After scanning of B is complete the LCS that has the lowest number of occurrences is chosen as a split point. The region is split around the LCS, and the algorithm is recursively applied to the sections before and after the LCS.By always selecting a LCS position with the lowest occurrence count, this algorithm behaves exactly like Bram Cohen's patience diff whenever there is a unique common element available between the two sequences.
When no unique elements exist, the lowest occurrence element is chosen instead.
This offers more readable diffs than simply falling back on the standard Myers'O(ND)
algorithm would produce.To prevent the algorithm from having an
O(N^2)
running time, an upper limit on the number of unique elements in a histogram bucket is configured by#setMaxChainLength(int)
.
If sequence A has more than this many elements that hash into the same hash bucket, the algorithm passes the region to#setFallbackAlgorithm(DiffAlgorithm)
.
If no fallback algorithm is configured, the region is emitted as a replace edit.During scanning of sequence B, any element of A that occurs more than
#setMaxChainLength(int)
times is never considered for an LCS match position, even if it is common between the two sequences. This limits the number of locations in sequence A that must be considered to find the LCS,and helps maintain a lower running time bound.So long as
#setMaxChainLength(int)
is a small constant (such as 64), the algorithm runs inO(N * D)
time, whereN
is the sum of the input lengths andD
is the number of edits in the resultingEditList
.
If the suppliedSequenceComparator
has a good hash function, this implementation typically out-performsMyersDiff
, even though its theoretical running time is the same.This implementation has an internal limitation that prevents it from handling sequences with more than 268,435,456 (2^28) elements
直方图差异
Bram Cohen 的耐心差异算法的扩展形式。
此实现是通过使用Bram Cohen 的博客中概述的 4 条规则派生而来 的,然后进一步扩展以支持低发生率的公共元素。
该算法的基本思想是为序列 A 的每个元素创建一个出现次数的直方图。然后依次考虑序列 B 的每个元素。如果该元素也存在于序列 A 中,并且出现次数较少,则这些位置被视为最长公共子序列 (LCS) 的候选者。
完成 B 的扫描后,选择出现次数最少的 LCS 作为分割点。该区域围绕 LCS 进行分割,并将算法递归应用于 LCS 前后的部分。通过始终选择出现次数最少的 LCS 位置,只要两个序列之间存在唯一的公共元素,该算法的行为就与 Bram Cohen 的耐心差异完全相同。
当不存在唯一元素时,将选择出现次数最少的元素。
这提供了比简单地退回到标准 MyersO(ND)
算法会产生的更具可读性的差异。为了防止算法有
O(N^2)
运行时间,直方图桶中唯一元素的数量上限由 配置#setMaxChainLength(int)
。
如果序列 A 有多个元素散列到同一个散列桶中,则算法将该区域传递给#setFallbackAlgorithm(DiffAlgorithm)
。
如果未配置回退算法,则该区域将作为替换编辑发出。在扫描序列 B 期间,A 中出现
#setMaxChainLength(int)
多次的任何元素都不会被视为 LCS 匹配位置,即使它在两个序列之间是通用的。这限制了在序列 A 中寻找 LCS 必须考虑的位置数量,并有助于保持较低的运行时间界限。只要
#setMaxChainLength(int)
是一个小常数(例如 64),算法就会O(N * D)
及时运行,其中N
是输入长度的总和,D
是结果 中的编辑次数EditList
。
如果所提供SequenceComparator
的散列函数具有良好的散列函数MyersDiff
,则即使其理论运行时间相同,此实现通常也会优于 。此实现有一个内部限制,无法处理超过 268,435,456 (2^28) 个元素的序列
Note that this kind of algo was already used for pack_check, back in 2006 (git 1.3), for git-verify-pack -v
. It was reused for index-pack in git 1.7.7
请注意,这种算法早在 2006 年 (git 1.3) 就已用于 pack_check,用于git-verify-pack -v
. 它被重用于 git 1.7.7 中的 index-pack
Commit 8c912eeactually introduced --histogram
to diff:
Commit 8c912ee实际上引入--histogram
了diff:
Port JGit's HistogramDiff algorithm over to C. Rough numbers (TODO) show that it is faster than its
--patience
cousin, as well as the default Meyers algorithm.The implementation has been reworked to use structs and pointers, instead of bitmasks, thus doing away with JGit's
2^28
line limit.We also use
xdiff
's default hash table implementation (xdl_hash_bits()
withXDL_HASHLONG()
) for convenience.
将 JGit 的 HistogramDiff 算法移植到 C。粗略数字 (TODO) 表明它比它的
--patience
表亲以及默认的 Meyers 算法更快。该实现已被重新设计以使用结构体和指针,而不是位掩码,从而取消了 JGit 的
2^28
行限制。为方便起见,我们还使用了
xdiff
默认的哈希表实现(xdl_hash_bits()
withXDL_HASHLONG()
)。
commit 8555123 (git 1.7.10, April 2012)added:
提交 8555123(git 1.7.10,2012 年 4 月)添加:
8c912ee (teach
--histogram
todiff
, 2011-07-12) claimed histogram diff was faster than both Myers and patience.We have since incorporated a performance testing framework, so add a test that compares the various diff tasks performed in a real '
log -p
' workload.
This does indeed show that histogram diff slightly beats Myers, while patience is much slower than the others.
8c912ee (teach
--histogram
todiff
, 2011-07-12) 声称直方图差异比 Myers 和耐心都快。我们已经合并了一个性能测试框架,因此添加一个测试来比较在真实“
log -p
”工作负载中执行的各种差异任务。
这确实表明直方图差异略胜于迈尔斯,而耐心比其他人慢得多。
Finally, commit 07ab4de (git 1.8.2, March 2013)add
最后,提交 07ab4de(git 1.8.2,2013 年 3 月)添加
config: Introduce diff.algorithm variable
config: 引入 diff.algorithm 变量
Some users or projects prefer different algorithms over others, e.g. patience over myers or similar.
However, specifying appropriate argument every time diff is to be used is impractical. Moreover, creating an alias doesn't play nicely with other tools based on diff (git-show
for instance).Hence, a configuration variable which is able to set specific algorithm is needed.
For now, these four values are accepted:
- '
myers
' (which has the same effect as not setting the config variable at all),- '
minimal
',- '
patience
' and- '
histogram
'.
一些用户或项目比其他用户或项目更喜欢不同的算法,例如耐心而不是迈尔斯或类似的。
但是,每次使用 diff 时都指定适当的参数是不切实际的。此外,创建别名不能很好地与基于 diff 的其他工具一起使用(git-show
例如)。因此,需要一个能够设置特定算法的配置变量。
目前,接受这四个值:
- '
myers
'(与根本不设置配置变量具有相同的效果),- '
minimal
',- '
patience
' 和- '
histogram
'。
Commit 07924d4added concurrently the --diff-algorithm
command line option.
As the OP Stuart P. Bentleymentions in the comments:
Commit 07924d4同时添加了--diff-algorithm
命令行选项。
正如 OP Stuart P. Bentley在评论中提到的:
you can configure Git to use histogram by defaultwith:
您可以将 Git 配置为默认使用直方图:
git config --global diff.algorithm histogram
Update: Git 2.12 (Q1 2017) will retire the "fast hash" that had disastrous performance issues in some corner cases.
更新:Git 2.12(2017 年第一季度)将淘汰在某些极端情况下具有灾难性性能问题的“快速哈希”。
See commit 1f7c926(01 Dec 2016) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
--in commit 731490b, 19 Dec 2016)
请参阅Jeff King ( ) 的commit 1f7c926(01 Dec 2016 )。
(由Junio C Hamano合并-- --在commit 731490b,2016 年 12 月 19 日)peff
gitster
xdiff
: dropXDL_FAST_HASH
The
xdiff
code hashes every line of both sides of a diff, and then compares those hashes to find duplicates. The overall performance depends both on how fast we can compute the hashes, but also on how many hash collisions we see.The idea of
XDL_FAST_HASH
is to speed up the hash computation.
But the generated hashes have worse collision behavior. This means that in some cases it speeds diffs up (running "git log -p
" ongit.git
improves by~8%
with it), but in others it can slow things down. One pathological case saw over a 100x slowdown.There may be a better hash function that covers both properties, but in the meantime we are better off with the original hash. It's slightly slower in the common case, but it has fewer surprising pathological cases.
xdiff
: 降低XDL_FAST_HASH
该
xdiff
代码对 diff 两边的每一行进行散列,然后比较这些散列以找到重复项。整体性能既取决于我们计算散列的速度,也取决于我们看到多少散列冲突。的想法
XDL_FAST_HASH
是加速哈希计算。
但是生成的散列具有更差的碰撞行为。这意味着在某些情况下它会加快差异(运行“git log -p
”会随着它的git.git
改进而提高~8%
),但在其他情况下它会减慢速度。一个病理案例出现了超过 100 倍的减速。可能有一个更好的散列函数可以覆盖这两个属性,但同时我们最好使用原始散列。在普通情况下它会稍微慢一些,但它具有较少的令人惊讶的病理情况。
Note: "git diff --histogram
" had a bad memory usage pattern, which has
been rearranged to reduce the peak usage, with Git 2.19 (Q3 2018).
注意:“ git diff --histogram
”有一个糟糕的内存使用模式,已重新安排以减少使用 Git 2.19(2018 年第三季度)的峰值。
See commit 79cb2eb, commit 64c4e8b, commit c671d4b, commit 2820985(19 Jul 2018) by Stefan Beller (stefanbeller
).
(Merged by Junio C Hamano -- gitster
--in commit 57fbd8e, 15 Aug 2018)
请参阅Stefan Beller ( ) 的提交 79cb2eb、提交 64c4e8b、提交 c671d4b、提交 2820985(2018 年 7 月 19 日)。(由Junio C Hamano合并-- --在commit 57fbd8e,2018 年 8 月 15 日)stefanbeller
gitster
xdiff/xhistogram
: move index allocation intofind_lcs
This fixes a memory issue when recursing a lot, which can be reproduced as
seq 1 100000 >one seq 1 4 100000 >two git diff --no-index --histogram one two
Before this patch,
histogram_diff
would call itself recursively before callingfree_index
, which would mean a lot of memory is allocated during the recursion and only freed afterwards.By moving the memory allocation (and its free call) into
find_lcs
, the memory is free'd before we recurse, such that memory is reused in the next step of the recursion instead of using new memory.This addresses only the memory pressure, not the run time complexity, that is also awful for the corner case outlined above.
xdiff/xhistogram
: 将索引分配移入find_lcs
这修复了大量递归时的内存问题,可以将其复制为
seq 1 100000 >one seq 1 4 100000 >two git diff --no-index --histogram one two
在此补丁之前,
histogram_diff
会在调用之前递归调用自身free_index
,这意味着在递归期间分配了大量内存,然后才释放。通过将内存分配(及其空闲调用)移动到 中
find_lcs
,内存在我们递归之前被释放,以便在递归的下一步中重用内存而不是使用新内存。这仅解决了内存压力,而不是运行时间复杂度,这对于上面概述的极端情况来说也是很糟糕的。