java 查找给定两个字符串的所有公共子字符串

Question

提问by neerajdorle

I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the longest sub-string. The problem statement is as follows:

我遇到了一个问题陈述，以找到给定的两个子字符串之间的所有公共子字符串，这样在每种情况下您都必须打印最长的子字符串。问题陈述如下：

Write a program to find the common substrings between the two given strings. However, do not include substrings that are contained within longer common substrings.
For example, given the input strings eatsleepnightxyzand eatsleepabcxyz, the results should be:
eatsleep(due to eatsleepnightxyzeatsleepabcxyz)
xyz(due to eatsleepnightxyzeatsleepabcxyz)
a(due to eatsleepnightxyzeatsleepabcxyz)
t(due to eatsleepnightxyzeatsleepabcxyz)
However, the result set should notinclude efrom eatsleepnightxyzeatsleepabcxyz, because both es are already contained in the eatsleepmentioned above. Nor should you include ea, eat, ats, etc., as those are also all covered by eatsleep.
In this, you don't have to make use of String utility methods like: contains, indexOf, StringTokenizer, split and replace.

编写一个程序来查找两个给定字符串之间的公共子字符串。但是，不要包括包含在较长公共子字符串中的子字符串。
例如，给定输入字符串eatsleepnightxyz和eatsleepabcxyz，结果应该是：
eatsleep（由于eatsleepnightxyzeatsleepabcxyz）
xyz（由于）eatsleepnightxyzeatsleepabcxyz
a（由于）eatsleepnightxyzeatsleepabcxyz
t（由于）eatsleepnightxyzeatsleepabcxyz
但是，结果集应不包括e从，因为这两个s的已经包含在上面提到的。您也不应该包括, ,等，因为这些也都包含在.eatsleepnightxyzeatsleepabcxyzeeatsleepeaeatatseatsleep
在这种情况下，您不必使用 String 实用程序方法，例如：contains、indexOf、StringTokenizer、split 和 replace。

My Algorithm is as follows: I am starting with brute force and will switch to more optimized solution when I improve my basic understanding.

我的算法如下：我从蛮力开始，当我提高我的基本理解时会切换到更优化的解决方案。

 For String S1:
     Find all the substrings of S1 of all the lengths
     While doing so: Check if it is also a substring of 
     S2.

Attempt to figure out the time complexity of my approach.

尝试弄清楚我的方法的时间复杂度。

Let the two given strings be n1-String and n2-String

让两个给定的字符串是 n1-String 和 n2-String

The number of substrings of S1 is clearly n1(n1+1)/2.
But we have got to find the average length a substring of S1.
Let's say it is m. We'll find m separately.
Time Complexity to check whether an m-String is a substring of an n-String is O(n*m).
Now, we are checking for each m-String is a substring of S2, which is an n2-String.
This, as we have seen above, is an O(n²m) algorithm.
The time required by the overall algorithm then is
Tn=(Number of substrings in S1) * (average substring lengthtime for character comparison procedure)
By performing certain calculations, I came to conclusion that the time complexity is O(n³m²)
Now, our job is to find m in terms of n1.

S1 的子串数显然是 n1(n1+1)/2。
但是我们必须找到 S1 的子串的平均长度。
假设它是m。我们将分别找到 m。
检查 m-String 是否是 n-String 的子串的时间复杂度为 O(n*m)。
现在，我们正在检查每个 m-String 是否是 S2 的子串，也就是一个 n2-String。
正如我们在上面看到的，这是一个 O(n ²m) 算法。
整个算法所需的时间为
Tn=（S1 中的子串数）*（字符比较过程的平均子串长度时间）
通过执行某些计算，我得出的结论是时间复杂度为 O(n ³m ²)
现在，我们的工作是根据 n1 找到 m。

Attempt to find m in terms of n1.

尝试根据 n1 找到 m。

T_n= (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
where T_nis the sum of lengths of all the substrings.

T _n= (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
其中T _n是所有子串的长度之和。

Average will be the division of this sum by the total number of Substrings produced.

平均值将是该总和除以产生的子串总数。

This, simply is a summation and division problem whose solution is as follows O(n)

这就是一个求和和除法问题，其解如下 O(n)

Therefore...

所以...

Running time of my algorithm is O(n^5).

我的算法的运行时间是 O(n^5)。

With this in mind I wrote the following code:

考虑到这一点，我编写了以下代码：

 package pack.common.substrings;

 import java.util.ArrayList;
 import java.util.LinkedHashSet;
 import java.util.List;
 import java.util.Set;

 public class FindCommon2 {
    public static final Set<String> commonSubstrings = new      LinkedHashSet<String>();

 public static void main(String[] args) {
    printCommonSubstrings("neerajisgreat", "neerajisnotgreat");
    System.out.println(commonSubstrings);
}

 public static void printCommonSubstrings(String s1, String s2) {
    for (int i = 0; i < s1.length();) {
        List<String> list = new ArrayList<String>();
        for (int j = i; j < s1.length(); j++) {
            String subStr = s1.substring(i, j + 1);
            if (isSubstring(subStr, s2)) {
                list.add(subStr);
            }
        }
        if (!list.isEmpty()) {
            String s = list.get(list.size() - 1);
            commonSubstrings.add(s);
            i += s.length();
        }
    }
 }

 public static boolean isSubstring(String s1, String s2) {
    boolean isSubstring = true;
    int strLen = s2.length();
    int strToCheckLen = s1.length();
    if (strToCheckLen > strLen) {
        isSubstring = false;
    } else {
        for (int i = 0; i <= (strLen - strToCheckLen); i++) {
            int index = i;
            int startingIndex = i;
            for (int j = 0; j < strToCheckLen; j++) {
                if (!(s1.charAt(j) == s2.charAt(index))) {
                    break;
                } else {
                    index++;
                }
            }
            if ((index - startingIndex) < strToCheckLen) {
                isSubstring = false;
            } else {
                isSubstring = true;
                break;
            }
        }
    }
    return isSubstring;
 }
}

Explanation for my code:

我的代码的解释：

 printCommonSubstrings: Finds all the substrings of S1 and 
                        checks if it is also a substring of 
                        S2.
 isSubstring : As the name suggests, it checks if the given string 
               is a substring of the other string.

Issue: Given the inputs

问题：给定输入

  S1 = “neerajisgreat”;
  S2 = “neerajisnotgreat”
  S3 = “rajeatneerajisnotgreat”

In case of S1 and S2, the output should be: neerajisand greatbut in case of S1 and S3, the output should have been: neerajis, raj, great, eatbut still I am getting neerajisand greatas output. I need to figure this out.

在S1和S2的情况下，输出应该是：neerajis和great，但在S1和S3的情况下，输出应该是： neerajis，raj，great，eat但还是我得到neerajis和great作为输出。我需要弄清楚这一点。

How should I design my code?

我应该如何设计我的代码？

Answer 1

采纳答案by 200_success

You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-treeand dynamic-programming.

为任务使用适当的算法而不是蛮力方法会更好。维基百科描述了最长公共子串问题的两种常见解决方案：后缀树和动态规划。

The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:

动态规划解决方案需要 O( nm) 时间和 O( nm) 空间。这几乎是最长公共子串的维基百科伪代码的直接 Java 翻译：

public static Set<String> longestCommonSubstrings(String s, String t) {
    int[][] table = new int[s.length()][t.length()];
    int longest = 0;
    Set<String> result = new HashSet<>();

    for (int i = 0; i < s.length(); i++) {
        for (int j = 0; j < t.length(); j++) {
            if (s.charAt(i) != t.charAt(j)) {
                continue;
            }

            table[i][j] = (i == 0 || j == 0) ? 1
                                             : 1 + table[i - 1][j - 1];
            if (table[i][j] > longest) {
                longest = table[i][j];
                result.clear();
            }
            if (table[i][j] == longest) {
                result.add(s.substring(i - longest + 1, i + 1));
            }
        }
    }
    return result;
}

Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyzand eatsleepabcxyz:

现在，您需要所有公共子字符串，而不仅仅是最长的。您可以增强此算法以包含更短的结果。让我们检查示例输入的表格，eatsleepnightxyz并且eatsleepabcxyz：

  e a t s l e e p a b c x y z
e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 3

The eatsleepresult is obvious: that's the 12345678diagonal streak at the top-left.
The xyzresult is the 123diagonal at the bottom-right.
The aresult is indicated by the 1near the top (second row, ninth column).
The tresult is indicated by the 1near the bottom left.

该eatsleep结果是显而易见的：那就是12345678在左上角的斜条纹。
该xyz结果是123在右下角的对角线。
该a结果由表示1靠近顶部（第二行，第九列）。
该t结果是由指定的1左下角附近。

What about the other 1s at the left, the top, and next to the 6and 7? Those don't count because they appear within the rectangle formed by the 12345678diagonal — in other words, they are already covered by eatsleep.

那其他1的左侧，顶部和旁边的S6和7？那些不算数，因为它们出现在12345678对角线形成的矩形内——换句话说，它们已经被覆盖了eatsleep。

I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

我建议只做一次传递，只做一张桌子。然后，进行第二遍，从右下角向后迭代，以收集结果集。

Answer 2

回答by ktbiz

Typically this type of substring matching is done with the assistance of a separate data structure called a Trie(pronounced try). The specific variant that best suits this problem is a suffix tree. Your first step should be to take your inputs and build a suffix tree. Then you'll need to use the suffix tree to determine the longest common substring, which is a good exercise.

通常，这种类型的子串匹配是在称为Trie（发音为 try）的单独数据结构的帮助下完成的。最适合此问题的特定变体是后缀树。您的第一步应该是接受您的输入并构建后缀树。然后你需要使用后缀树来确定最长的公共子串，这是一个很好的练习。

java 查找给定两个字符串的所有公共子字符串

提问by neerajdorle

采纳答案by 200_success

回答by ktbiz

相关推荐

最近更新

标签

java 查找给定两个字符串的所有公共子字符串

提问by neerajdorle

采纳答案by 200_success

回答by ktbiz

相关推荐

java Apache Maven 无法在 cmd 中工作

java 测试时带有数据源的 Spring Boot

如何将 jndi 查找从 xml 转换为 java 配置

无法通过 selenium webdriver(Java) 在 Chromedriver 中向下滚动

相关推荐

最近更新

标签