用于自由文本差异的 Java 库

Question

提问by Joshua Fox

I need to match up two almost-the-same long freetext strings; i.e., to find index-to-index correspondences wherever possible.

我需要匹配两个几乎相同的长自由文本字符串；即，尽可能找到索引到索引的对应关系。

Because this is freetext, the comparison should not be line-based as in code diffing.

因为这是自由文本，所以比较不应该像代码差异那样基于行。

Any suggestions for Java libraries?

对 Java 库有什么建议吗？

A simple example (In real life , of course, there would not be extra whitespace to line things up, and there may be more complex challenges like entire clauses moved around.)

一个简单的例子（当然，在现实生活中，不会有额外的空格来排列，而且可能会有更复杂的挑战，比如整个子句移动。）

The quick brown  fox jumped over the  lazy     dog.
||||||||||      |||||||||||||||||||||         |||||
The quick yellow fox jumped over the well-bred dog.

Answer 1

采纳答案by Joshua Fox

This one might be good Diff Match Patch.

这个可能是很好的Diff Match Patch。

Answer 2

回答by Fabian Steeg

Depending on your exact requirements, the StringUtilsclass of the Apache Commons Langcomponent might be helpful, e.g.:

根据您的具体要求，在StringUtils类的阿帕奇共享郎组件可能是有用的，例如：

StringUtils#difference: Compares two Strings, and returns the portion where they differ
StringUtils#getLevenshteinDistance: Find the Levenshtein distancebetween two Strings

StringUtils#difference：比较两个字符串，并返回它们不同的部分
StringUtils#getLevenshteinDistance：查找两个字符串之间的Levenshtein 距离

Answer 3

回答by joel.neely

Here's a (lightly-tested) version of code that does what you asked. You can easily traverse the result in parallel with the inputs to locate insertions and deletions.

这是一个（经过轻微测试的）代码版本，可以满足您的要求。您可以轻松地与输入并行遍历结果以定位插入和删除。

public class StringDiff {

    private static int   length(String s) { return s == null ? 0 : s.length(); }
    private static char[] chars(String s) { return s == null ? new char[0] : s.toCharArray(); }

    private final String left;
    private final String right;

    private final char[] lccs;
    private final String lcs;

    public StringDiff(String left, String right) {
        this.left = left;
        this.right = right;
        lccs = init();
        lcs = new String(lccs);
    }

    public String getLcs()  { return lcs; }
    public char[] getLccs() { return lccs.clone(); }

    private char[] init() {
        int lLength = length(left);
        int rLength = length(right);
        char[] lChars = chars(left);
        char[] rChars = chars(right);
        int [][] t = new int [lLength + 1][rLength + 1];
        for (int i = lLength - 1; i >= 0; --i) {
            for (int j = rLength - 1; j >= 0; --j) {
                if (lChars[i] == rChars[j]) {
                    t[i][j] = t[i + 1][j + 1] + 1;
                } else {
                    t[i][j] = Math.max(t[i + 1][j], t[i][j + 1]);
                }
            }
        }
        char[] result = new char[t[0][0]];
        int l = 0, r = 0, p = 0;
        while (l < lLength && r < rLength) {
            if (lChars[l] == rChars[r]) {
                result[p++] = lChars[l++];
                r++;
            } else {
                if (t[l + 1][r] > t[l][r + 1]) {
                    ++l;
                } else {
                    ++r;
                }
            }
        }
        return result;
    }

}

According to it, the actual longest subsequence of your original inputs:

根据它，原始输入的实际最长子序列：

The quick brown  fox jumped over the  lazy     dog.
The quick yellow fox jumped over the well-bred dog.

is:

是：

The quick ow fox jumped over the l dog.

(because "brown" and "yellow" have "ow" in common, etc.)

（因为“棕色”和“黄色”有共同的“ow”等）

It's relatively straightforward to modify the above to split on whitespace (instead of into char arrays) and substitute String#equals for == to get a version that finds the longest common subsequence of words instead of characters. For your example above that change would produce the obvious result:

修改上面的内容以拆分空格（而不是字符数组）并将 String#equals 替换为 == 以获得一个找到单词而不是字符的最长公共子序列的版本，这是相对简单的。对于上面的示例，该更改将产生明显的结果：

found 7 words
    'The'
    'quick'
    'fox'
    'jumped'
    'over'
    'the'
    'dog.'

(Your question implied character comparisons, as you matched the spaces between words.)

（您的问题暗示了字符比较，因为您匹配了单词之间的空格。）

Answer 4

回答by Christoph

If you're example is really what you want to do - ie subsequences only match if they start at the same index (which is different from how diffs normally operate) - this is all you need to do:

如果您的示例确实是您想要做的 - 即子序列仅在它们以相同索引开始时才匹配（这与差异通常的操作方式不同） - 这就是您需要做的全部：

import java.util.*;

class StringDiff {
    public static List<int[]> from(String s1, String s2) {
        int start = -1;
        int pos = 0;
        LinkedList<int[]> list = new LinkedList<int[]>();

        for(; pos < s1.length() && pos < s2.length(); ++pos) {
            if(s1.charAt(pos) == s2.charAt(pos)) {
                if(start < 0) start = pos;
            }
            else {
                if(start >= 0) list.add(new int[] { start, pos });
                start = -1;
            }
        }

        if(start >= 0) list.add(new int[] { start, pos });

        return list;
    }

    public static void main(String[] args) {
        for(int[] idx : from(args[0], args[1]))
            System.out.println(args[0].substring(idx[0], idx[1]));
    }
}

An actual diff implementation will be far more sophisticated.

实际的差异实现将复杂得多。

用于自由文本差异的 Java 库

提问by Joshua Fox

采纳答案by Joshua Fox

回答by Fabian Steeg

回答by joel.neely

回答by Christoph

相关推荐

最近更新

标签

用于自由文本差异的 Java 库

提问by Joshua Fox

采纳答案by Joshua Fox

回答by Fabian Steeg

回答by joel.neely

回答by Christoph

相关推荐

javax.net.ssl.SSLHandshakeException：sun.security.validator.ValidatorException：PKIX 路径构建失败

Java int[][] 数组 - 迭代和查找值

Java NetBeans 8.1 激活失败

Java NotOLE2FileException: 无效的标头签名；读取 0x0000000000000000，预期为 0xE11AB1A1E011CFD0

相关推荐

最近更新

标签