java 查找给定两个字符串的所有公共子字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34805488/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Finding all the common substrings of given two strings
提问by neerajdorle
I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the longest sub-string. The problem statement is as follows:
我遇到了一个问题陈述,以找到给定的两个子字符串之间的所有公共子字符串,这样在每种情况下您都必须打印最长的子字符串。问题陈述如下:
Write a program to find the common substrings between the two given strings. However, do not include substrings that are contained within longer common substrings.
For example, given the input strings
eatsleepnightxyz
andeatsleepabcxyz
, the results should be:
eatsleep
(due toeatsleepnightxyz
eatsleepabcxyz
)xyz
(due toeatsleepnightxyz
eatsleepabcxyz
)a
(due toeatsleepnightxyz
eatsleepabcxyz
)t
(due toeatsleepnightxyz
eatsleepabcxyz
)However, the result set should notinclude
e
fromeatsleepnightxyz
eatsleepabcxyz
, because bothe
s are already contained in theeatsleep
mentioned above. Nor should you includeea
,eat
,ats
, etc., as those are also all covered byeatsleep
.In this, you don't have to make use of String utility methods like: contains, indexOf, StringTokenizer, split and replace.
编写一个程序来查找两个给定字符串之间的公共子字符串。但是,不要包括包含在较长公共子字符串中的子字符串。
例如,给定输入字符串
eatsleepnightxyz
和eatsleepabcxyz
,结果应该是:
eatsleep
(由于eatsleepnightxyz
eatsleepabcxyz
)xyz
(由于)eatsleepnightxyz
eatsleepabcxyz
a
(由于)eatsleepnightxyz
eatsleepabcxyz
t
(由于)eatsleepnightxyz
eatsleepabcxyz
但是,结果集应不包括
e
从 ,因为这两个s的已经包含在上面提到的。您也不应该包括, ,等,因为这些也都包含在.eatsleepnightxyz
eatsleepabcxyz
e
eatsleep
ea
eat
ats
eatsleep
在这种情况下,您不必使用 String 实用程序方法,例如:contains、indexOf、StringTokenizer、split 和 replace。
My Algorithm is as follows: I am starting with brute force and will switch to more optimized solution when I improve my basic understanding.
我的算法如下:我从蛮力开始,当我提高我的基本理解时会切换到更优化的解决方案。
For String S1:
Find all the substrings of S1 of all the lengths
While doing so: Check if it is also a substring of
S2.
Attempt to figure out the time complexity of my approach.
尝试弄清楚我的方法的时间复杂度。
Let the two given strings be n1-String and n2-String
让两个给定的字符串是 n1-String 和 n2-String
- The number of substrings of S1 is clearly n1(n1+1)/2.
- But we have got to find the average length a substring of S1.
- Let's say it is m. We'll find m separately.
- Time Complexity to check whether an m-String is a substring of an n-String is O(n*m).
- Now, we are checking for each m-String is a substring of S2, which is an n2-String.
- This, as we have seen above, is an O(n2m) algorithm.
- The time required by the overall algorithm then is
- Tn=(Number of substrings in S1) * (average substring lengthtime for character comparison procedure)
- By performing certain calculations, I came to conclusion that the time complexity is O(n3m2)
- Now, our job is to find m in terms of n1.
- S1 的子串数显然是 n1(n1+1)/2。
- 但是我们必须找到 S1 的子串的平均长度。
- 假设它是m。我们将分别找到 m。
- 检查 m-String 是否是 n-String 的子串的时间复杂度为 O(n*m)。
- 现在,我们正在检查每个 m-String 是否是 S2 的子串,也就是一个 n2-String。
- 正如我们在上面看到的,这是一个 O(n 2m) 算法。
- 整个算法所需的时间为
- Tn=(S1 中的子串数)*(字符比较过程的平均子串长度时间)
- 通过执行某些计算,我得出的结论是时间复杂度为 O(n 3m 2)
- 现在,我们的工作是根据 n1 找到 m。
Attempt to find m in terms of n1.
尝试根据 n1 找到 m。
Tn= (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
where Tnis the sum of lengths of all the substrings.
T n= (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
其中T n是所有子串的长度之和。
Average will be the division of this sum by the total number of Substrings produced.
平均值将是该总和除以产生的子串总数。
This, simply is a summation and division problem whose solution is as follows O(n)
这就是一个求和和除法问题,其解如下 O(n)
Therefore...
所以...
Running time of my algorithm is O(n^5).
我的算法的运行时间是 O(n^5)。
With this in mind I wrote the following code:
考虑到这一点,我编写了以下代码:
package pack.common.substrings;
import java.util.ArrayList;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Set;
public class FindCommon2 {
public static final Set<String> commonSubstrings = new LinkedHashSet<String>();
public static void main(String[] args) {
printCommonSubstrings("neerajisgreat", "neerajisnotgreat");
System.out.println(commonSubstrings);
}
public static void printCommonSubstrings(String s1, String s2) {
for (int i = 0; i < s1.length();) {
List<String> list = new ArrayList<String>();
for (int j = i; j < s1.length(); j++) {
String subStr = s1.substring(i, j + 1);
if (isSubstring(subStr, s2)) {
list.add(subStr);
}
}
if (!list.isEmpty()) {
String s = list.get(list.size() - 1);
commonSubstrings.add(s);
i += s.length();
}
}
}
public static boolean isSubstring(String s1, String s2) {
boolean isSubstring = true;
int strLen = s2.length();
int strToCheckLen = s1.length();
if (strToCheckLen > strLen) {
isSubstring = false;
} else {
for (int i = 0; i <= (strLen - strToCheckLen); i++) {
int index = i;
int startingIndex = i;
for (int j = 0; j < strToCheckLen; j++) {
if (!(s1.charAt(j) == s2.charAt(index))) {
break;
} else {
index++;
}
}
if ((index - startingIndex) < strToCheckLen) {
isSubstring = false;
} else {
isSubstring = true;
break;
}
}
}
return isSubstring;
}
}
Explanation for my code:
我的代码的解释:
printCommonSubstrings: Finds all the substrings of S1 and
checks if it is also a substring of
S2.
isSubstring : As the name suggests, it checks if the given string
is a substring of the other string.
Issue: Given the inputs
问题:给定输入
S1 = “neerajisgreat”;
S2 = “neerajisnotgreat”
S3 = “rajeatneerajisnotgreat”
In case of S1 and S2, the output should be: neerajis
and great
but in case of S1 and S3, the output should have been:
neerajis
, raj
, great
, eat
but still I am getting neerajis
and great
as output. I need to figure this out.
在S1和S2的情况下,输出应该是:neerajis
和great
,但在S1和S3的情况下,输出应该是:
neerajis
,raj
,great
,eat
但还是我得到neerajis
和great
作为输出。我需要弄清楚这一点。
How should I design my code?
我应该如何设计我的代码?
采纳答案by 200_success
You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-treeand dynamic-programming.
为任务使用适当的算法而不是蛮力方法会更好。维基百科描述了最长公共子串问题的两种常见解决方案:后缀树和动态规划。
The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:
动态规划解决方案需要 O( nm) 时间和 O( nm) 空间。这几乎是最长公共子串的维基百科伪代码的直接 Java 翻译:
public static Set<String> longestCommonSubstrings(String s, String t) {
int[][] table = new int[s.length()][t.length()];
int longest = 0;
Set<String> result = new HashSet<>();
for (int i = 0; i < s.length(); i++) {
for (int j = 0; j < t.length(); j++) {
if (s.charAt(i) != t.charAt(j)) {
continue;
}
table[i][j] = (i == 0 || j == 0) ? 1
: 1 + table[i - 1][j - 1];
if (table[i][j] > longest) {
longest = table[i][j];
result.clear();
}
if (table[i][j] == longest) {
result.add(s.substring(i - longest + 1, i + 1));
}
}
}
return result;
}
Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz
and eatsleepabcxyz
:
现在,您需要所有公共子字符串,而不仅仅是最长的。您可以增强此算法以包含更短的结果。让我们检查示例输入的表格,eatsleepnightxyz
并且eatsleepabcxyz
:
e a t s l e e p a b c x y z
e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 3
- The
eatsleep
result is obvious: that's the12345678
diagonal streak at the top-left. - The
xyz
result is the123
diagonal at the bottom-right. - The
a
result is indicated by the1
near the top (second row, ninth column). - The
t
result is indicated by the1
near the bottom left.
- 该
eatsleep
结果是显而易见的:那就是12345678
在左上角的斜条纹。 - 该
xyz
结果是123
在右下角的对角线。 - 该
a
结果由表示1
靠近顶部(第二行,第九列)。 - 该
t
结果是由指定的1
左下角附近。
What about the other 1
s at the left, the top, and next to the 6
and 7
? Those don't count because they appear within the rectangle formed by the 12345678
diagonal — in other words, they are already covered by eatsleep
.
那其他1
的左侧,顶部和旁边的S6
和7
?那些不算数,因为它们出现在12345678
对角线形成的矩形内——换句话说,它们已经被 覆盖了eatsleep
。
I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.
我建议只做一次传递,只做一张桌子。然后,进行第二遍,从右下角向后迭代,以收集结果集。
回答by ktbiz
Typically this type of substring matching is done with the assistance of a separate data structure called a Trie(pronounced try). The specific variant that best suits this problem is a suffix tree. Your first step should be to take your inputs and build a suffix tree. Then you'll need to use the suffix tree to determine the longest common substring, which is a good exercise.
通常,这种类型的子串匹配是在称为Trie(发音为 try)的单独数据结构的帮助下完成的。最适合此问题的特定变体是后缀树。您的第一步应该是接受您的输入并构建后缀树。然后你需要使用后缀树来确定最长的公共子串,这是一个很好的练习。