用于自由文本差异的 Java 库
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/479654/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java library for free-text diff
提问by Joshua Fox
I need to match up two almost-the-same long freetext strings; i.e., to find index-to-index correspondences wherever possible.
我需要匹配两个几乎相同的长自由文本字符串;即,尽可能找到索引到索引的对应关系。
Because this is freetext, the comparison should not be line-based as in code diffing.
因为这是自由文本,所以比较不应该像代码差异那样基于行。
Any suggestions for Java libraries?
对 Java 库有什么建议吗?
A simple example (In real life , of course, there would not be extra whitespace to line things up, and there may be more complex challenges like entire clauses moved around.)
一个简单的例子(当然,在现实生活中,不会有额外的空格来排列,而且可能会有更复杂的挑战,比如整个子句移动。)
The quick brown fox jumped over the lazy dog.
|||||||||| ||||||||||||||||||||| |||||
The quick yellow fox jumped over the well-bred dog.
采纳答案by Joshua Fox
This one might be good Diff Match Patch.
这个可能是很好的Diff Match Patch。
回答by Fabian Steeg
Depending on your exact requirements, the StringUtils
class of the Apache Commons Langcomponent might be helpful, e.g.:
根据您的具体要求,在StringUtils
类的阿帕奇共享郎组件可能是有用的,例如:
- StringUtils#difference: Compares two Strings, and returns the portion where they differ
- StringUtils#getLevenshteinDistance: Find the Levenshtein distancebetween two Strings
- StringUtils#difference:比较两个字符串,并返回它们不同的部分
- StringUtils#getLevenshteinDistance:查找两个字符串之间的Levenshtein 距离
回答by joel.neely
Here's a (lightly-tested) version of code that does what you asked. You can easily traverse the result in parallel with the inputs to locate insertions and deletions.
这是一个(经过轻微测试的)代码版本,可以满足您的要求。您可以轻松地与输入并行遍历结果以定位插入和删除。
public class StringDiff {
private static int length(String s) { return s == null ? 0 : s.length(); }
private static char[] chars(String s) { return s == null ? new char[0] : s.toCharArray(); }
private final String left;
private final String right;
private final char[] lccs;
private final String lcs;
public StringDiff(String left, String right) {
this.left = left;
this.right = right;
lccs = init();
lcs = new String(lccs);
}
public String getLcs() { return lcs; }
public char[] getLccs() { return lccs.clone(); }
private char[] init() {
int lLength = length(left);
int rLength = length(right);
char[] lChars = chars(left);
char[] rChars = chars(right);
int [][] t = new int [lLength + 1][rLength + 1];
for (int i = lLength - 1; i >= 0; --i) {
for (int j = rLength - 1; j >= 0; --j) {
if (lChars[i] == rChars[j]) {
t[i][j] = t[i + 1][j + 1] + 1;
} else {
t[i][j] = Math.max(t[i + 1][j], t[i][j + 1]);
}
}
}
char[] result = new char[t[0][0]];
int l = 0, r = 0, p = 0;
while (l < lLength && r < rLength) {
if (lChars[l] == rChars[r]) {
result[p++] = lChars[l++];
r++;
} else {
if (t[l + 1][r] > t[l][r + 1]) {
++l;
} else {
++r;
}
}
}
return result;
}
}
According to it, the actual longest subsequence of your original inputs:
根据它,原始输入的实际最长子序列:
The quick brown fox jumped over the lazy dog.
The quick yellow fox jumped over the well-bred dog.
is:
是:
The quick ow fox jumped over the l dog.
(because "brown" and "yellow" have "ow" in common, etc.)
(因为“棕色”和“黄色”有共同的“ow”等)
It's relatively straightforward to modify the above to split on whitespace (instead of into char arrays) and substitute String#equals for == to get a version that finds the longest common subsequence of words instead of characters. For your example above that change would produce the obvious result:
修改上面的内容以拆分空格(而不是字符数组)并将 String#equals 替换为 == 以获得一个找到单词而不是字符的最长公共子序列的版本,这是相对简单的。对于上面的示例,该更改将产生明显的结果:
found 7 words
'The'
'quick'
'fox'
'jumped'
'over'
'the'
'dog.'
(Your question implied character comparisons, as you matched the spaces between words.)
(您的问题暗示了字符比较,因为您匹配了单词之间的空格。)
回答by Christoph
If you're example is really what you want to do - ie subsequences only match if they start at the same index (which is different from how diffs normally operate) - this is all you need to do:
如果您的示例确实是您想要做的 - 即子序列仅在它们以相同索引开始时才匹配(这与差异通常的操作方式不同) - 这就是您需要做的全部:
import java.util.*;
class StringDiff {
public static List<int[]> from(String s1, String s2) {
int start = -1;
int pos = 0;
LinkedList<int[]> list = new LinkedList<int[]>();
for(; pos < s1.length() && pos < s2.length(); ++pos) {
if(s1.charAt(pos) == s2.charAt(pos)) {
if(start < 0) start = pos;
}
else {
if(start >= 0) list.add(new int[] { start, pos });
start = -1;
}
}
if(start >= 0) list.add(new int[] { start, pos });
return list;
}
public static void main(String[] args) {
for(int[] idx : from(args[0], args[1]))
System.out.println(args[0].substring(idx[0], idx[1]));
}
}
An actual diff implementation will be far more sophisticated.
实际的差异实现将复杂得多。