C# 文本差异算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/145607/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Text difference algorithm
提问by Graviton
I need an algorithm that can compare two text files and highlight their difference and ( even better!) can compute their difference in a meaningful way (like two similar files should have a similarity score higher than two dissimilar files, with the word "similar" defined in the normal terms). It sounds easy to implement, but it's not.
我需要一种算法,可以比较两个文本文件并突出显示它们的差异,并且(甚至更好!)可以以有意义的方式计算它们的差异(例如,两个相似的文件的相似度得分应该高于两个不同的文件,并带有“相似”一词)以正常术语定义)。听起来很容易实现,但事实并非如此。
The implementation can be in c# or python.
实现可以在 c# 或 python 中。
Thanks.
谢谢。
采纳答案by tzot
In Python, there is difflib, as also others have suggested.
在 Python 中,有difflib,正如其他人所建议的那样。
difflib
offers the SequenceMatcherclass, which can be used to give you a similarity ratio. Example function:
difflib
提供SequenceMatcher类,可用于为您提供相似度。示例函数:
def text_compare(text1, text2, isjunk=None):
return difflib.SequenceMatcher(isjunk, text1, text2).ratio()
回答by Douglas Leeder
回答by Torsten Marek
If you need a finer granularity than lines, you can use Levenshtein distance. Levenshtein distance is a straight-forward measure on how to similar two texts are.
You can also use it to extract the edit logs and can a very fine-grained diff, similar to that on the edit history pages of SO.
Be warned though that Levenshtein distance can be quite CPU- and memory-intensive to calculate, so using difflib,as Douglas Leder suggested, is most likely going to be faster.
如果您需要比线更细的粒度,您可以使用 Levenshtein 距离。Levenshtein 距离是衡量两个文本如何相似的直接方法。
您还可以使用它来提取编辑日志,并且可以进行非常细粒度的差异,类似于 SO 的编辑历史记录页面上的差异。但请注意,Levenshtein 距离的计算可能会占用大量 CPU 和内存,因此使用 difflib,正如 Douglas Leder 建议的那样,很可能会更快。
Cf. also this answer.
参见 还有这个答案。
回答by paradoja
As stated, use difflib. Once you have the diffed output, you may find the Levenshtein distanceof the different strings as to give a "value" of how different they are.
如上所述,使用 difflib。获得差异输出后,您可能会发现不同字符串的Levenshtein 距离,以给出它们差异程度的“值”。
回答by Daniel James
Bazaarcontains an alternative difference algorithm, called patience diff(there's more info in the comments on that page) which is claimed to be better than the traditional diff algorithm. The file 'patiencediff.py' in the bazaar distribution is a simple command line front end.
Bazaar包含另一种差异算法,称为耐心差异(该页面的评论中有更多信息),据称它比传统的差异算法更好。bazaar 发行版中的文件“patiencediff.py”是一个简单的命令行前端。
回答by aku
I can recommend to take a look at Neil Fraser's code and articles:
我可以推荐看看 Neil Fraser 的代码和文章:
Currently available in Java, JavaScript, C++ and Python. Regardless of language, each library features the same API and the same functionality. All versions also have comprehensive test harnesses.
目前可用于 Java、JavaScript、C++ 和 Python。无论语言如何,每个库都具有相同的 API 和相同的功能。所有版本还具有全面的测试工具。
Neil Fraser: Diff Strategies- for theory and implementation notes
Neil Fraser:差异策略- 理论和实施说明
回答by johnp
There are a number of distance metrics, as paradoja mentioned there is the Levenshtein distance, but there is also NYSIISand Soundex. In terms of Python implementations, I have used py-editdistand ADVASbefore. Both are nice in the sense that you get a single number back as a score. Check out ADVAS first, it implements a bunch of algorithms.
有许多距离度量,正如 paradoja 提到的,有 Levenshtein 距离,但也有NYSIIS和Soundex。在Python实现方面,我之前用过py-editdist和ADVAS。两者都很好,因为您可以将单个数字作为分数返回。首先查看ADVAS,它实现了一堆算法。
回答by user8134
My current understanding is that the best solution to the Shortest Edit Script (SES) problem is Myers "middle-snake" method with the Hirschberg linear space refinement.
我目前的理解是,最短编辑脚本 (SES) 问题的最佳解决方案是使用 Hirschberg 线性空间细化的 Myers“中蛇”方法。
The Myers algorithm is described in:
Myers 算法描述如下:
E. Myers, ``An O(ND) Difference Algorithm and Its Variations,''
Algorithmica 1, 2 (1986), 251-266.
E. Myers,“一种 O(ND) 差分算法及其变体”,
Algorithmica 1, 2 (1986), 251-266。
The GNU diff utility uses the Myers algorithm.
GNU diff 实用程序使用 Myers 算法。
The "similarity score" you speak of is called the "edit distance" in the literature which is the number of inserts or deletes necessary to transform one sequence into the other.
您所说的“相似度得分”在文献中称为“编辑距离”,它是将一个序列转换为另一个序列所需的插入或删除次数。
Note that a number of people have cited the Levenshtein distance algorithm but that is, albeit easy to implement, not the optimal solution as it is inefficient (requires the use of a possibly huge n*m matrix) and does not provide the "edit script" which is the sequence of edits that could be used to transform one sequence into the other and vice versa.
请注意,许多人引用了 Levenshtein 距离算法,但尽管它易于实现,但不是最佳解决方案,因为它效率低下(需要使用可能巨大的 n*m 矩阵)并且不提供“编辑脚本” "这是可用于将一个序列转换为另一个序列的编辑序列,反之亦然。
For a good Myers / Hirschberg implementation look at:
要获得良好的 Myers / Hirschberg 实施,请查看:
http://www.ioplex.com/~miallen/libmba/dl/src/diff.c
http://www.ioplex.com/~miallen/libmba/dl/src/diff.c
The particular library that it is contained within is no longer maintained but to my knowledge the diff.c module itself is still correct.
它所包含的特定库不再维护,但据我所知,diff.c 模块本身仍然是正确的。
Mike
麦克风
回答by zeadqunes
You could use the solution to the Longest Common Subsequence (LCS) problem. See also the discussion about possible ways to optimize this solution.
您可以使用最长公共子序列 (LCS) 问题的解决方案。另请参阅有关优化此解决方案的可能方法的讨论。
回答by Lasse V. Karlsen
One method I've employed for a different functionality, to calculate how much data was new in a modified file, could perhaps work for you as well.
我用于不同功能的一种方法,用于计算修改后的文件中有多少新数据,也许也适用于您。
I have a diff/patch implementation C# that allows me to take two files, presumably old and new version of the same file, and calculate the "difference", but not in the usual sense of the word. Basically I calculate a set of operations that I can perform on the old version to update it to have the same contents as the new version.
我有一个差异/补丁实现 C#,它允许我使用两个文件,大概是同一个文件的新旧版本,并计算“差异”,但不是通常意义上的。基本上,我计算了一组可以在旧版本上执行的操作,以将其更新为与新版本具有相同的内容。
To use this for the functionality initially described, to see how much data was new, I simple ran through the operations, and for every operation that copied from the old file verbatim, that had a 0-factor, and every operation that inserted new text (distributed as part of the patch, since it didn't occur in the old file) had a 1-factor. All characters was given this factory, which gave me basically a long list of 0's and 1's.
为了将其用于最初描述的功能,查看有多少数据是新的,我简单地运行了操作,以及从旧文件逐字复制的每个操作,具有 0 因子,以及每个插入新文本的操作(作为补丁的一部分分发,因为它没有出现在旧文件中)有一个 1-factor。所有字符都被赋予了这个工厂,它基本上给了我一长串 0 和 1 的列表。
All I then had to do was to tally up the 0's and 1's. In your case, with my implementation, a low number of 1's compared to 0's would mean the files are very similar.
然后我所要做的就是计算 0 和 1。在您的情况下,在我的实现中,与 0 相比较少的 1 将意味着文件非常相似。
This implementation would also handle cases where the modified file had inserted copies from the old file out of order, or even duplicates (ie. you copy a part from the start of the file and paste it near the bottom), since they would both be copies of the same original part from the old file.
此实现还将处理修改后的文件从旧文件中无序插入副本,甚至重复的情况(即,您从文件的开头复制一部分并将其粘贴到底部附近),因为它们都是旧文件中相同原始部分的副本。
I experimented with weighing copies, so that the first copy counted as 0, and subsequent copies of the same characters had progressively higher factors, in order to give a copy/paste operation some "new-factor", but I never finished it as the project was scrapped.
我尝试了称重副本,因此第一个副本计为 0,并且相同字符的后续副本具有更高的因子,以便为复制/粘贴操作提供一些“新因子”,但我从未将其作为项目被废弃。
If you're interested, my diff/patch code is available from my Subversion repository.
如果您有兴趣,可以从我的 Subversion 存储库中获得我的差异/补丁代码。