如何使用 C++ 找到最长公共子串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10248728/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 13:47:49  来源:igfitidea点击:

How to find Longest Common Substring using C++

c++algorithmlcs

提问by David Gomes

I searched online for a C++ Longest Common Substring implementation but failed to find a decent one. I need a LCS algorithm that returns the substring itself, so it's not just LCS.

我在网上搜索了 C++ 最长公共子串的实现,但没有找到合适的。我需要一个返回子串本身的 LCS 算法,所以它不仅仅是 LCS。

I was wondering, though, about how I can do this between multiple strings.

不过,我想知道如何在多个字符串之间执行此操作。

My idea was to check the longest one between 2 strings, and then go check all the others, but this is a very slow process which requires managing many long strings on the memory, making my program quite slow.

我的想法是检查 2 个字符串之间最长的一个,然后检查所有其他字符串,但这是一个非常缓慢的过程,需要在内存中管理许多长字符串,这使得我的程序非常慢。

Any idea of how this can be speeded up for multiple strings? Thank you.

知道如何为多个字符串加速吗?谢谢你。

Important EditOne of the variables I'm given determines the number of strings the longest common substring needs to be in, so I can be given 10 strings, and find the LCS of them all (K=10), or LCS of 4 of them, but I'm not told which 4, I have to find the best 4.

重要编辑我给出的变量之一决定了最长公共子串需要包含的字符串数,所以我可以得到 10 个字符串,并找到它们的 LCS(K=10),或 4 个的 LCS他们,但我没有被告知是哪 4 个,我必须找到最好的 4 个。

采纳答案by Adrian McCarthy

Here is an excellent article on finding all common substringsefficiently, with examples in C. This may be overkill if you need just the longest, but it may be easier to understand than the general articles about suffix trees.

这是一篇关于高效查找所有常见子字符串的优秀文章,并附有 C 语言示例。如果您只需要最长的,这可能有点矫枉过正,但它可能比关于后缀树的一般文章更容易理解。

回答by Lxcypp

The answer is GENERALISED SUFFIX TREE. http://en.wikipedia.org/wiki/Generalised_suffix_tree

答案是通用后缀树。http://en.wikipedia.org/wiki/Generalised_suffix_tree

You can build a generalised suffix tree with multiple string.

您可以构建具有多个字符串的广义后缀树。

Look at this http://en.wikipedia.org/wiki/Longest_common_substring_problem

看看这个http://en.wikipedia.org/wiki/Longest_common_substring_problem

The Suffix tree can be built in O(n) time for each string, k*O(n) in total. K is total number of strings.

后缀树可以在 O(n) 时间内为每个字符串构建,总共 k*O(n)。K 是字符串的总数。

So it's very quick to solve this problem.

所以解决这个问题非常快。

回答by snegi

This is a dynamic programming problem and can be solved in O(mn) time, where m is the length of one string and n is of other.

这是一个动态规划问题,可以在 O(mn) 时间内解决,其中 m 是一个字符串的长度,n 是另一个字符串的长度。

Like any other problem solved using Dynamic Programming, we will divide the problem into subproblem. Lets say if two strings are x1x2x3....xm and y1y2y3...yn

与使用动态规划解决的任何其他问题一样,我们将把问题分解为子问题。假设两个字符串是 x1x2x3....xm 和 y1y2y3...yn

S(i,j) is the longest common string for x1x2x3...xi and y1y2y3....yj, then

S(i,j) 是 x1x2x3...xi 和 y1y2y3....yj 的最长公共字符串,则

S(i,j) = max { length of longest common substring ending at xi/yj, if ( x[i] == y[j] ), S(i-1, j-1), S(i, j-1), S(i-1, j) }

S(i,j) = max { 以 xi/yj 结尾的最长公共子串的长度, if ( x[i] == y[j] ), S(i-1, j-1), S(i, j -1), S(i-1, j) }

Here is working program in Java. I am sure you can convert it to C++.:

这是 Java 中的工作程序。我相信您可以将其转换为 C++。:

public class LongestCommonSubstring {

    public static void main(String[] args) {
        String str1 = "abcdefgijkl";
        String str2 = "mnopabgijkw";
        System.out.println(getLongestCommonSubstring(str1,str2));
    }

    public static String getLongestCommonSubstring(String str1, String str2) {
        //Note this longest[][] is a standard auxialry memory space used in Dynamic
                //programming approach to save results of subproblems. 
                //These results are then used to calculate the results for bigger problems
        int[][] longest = new int[str2.length() + 1][str1.length() + 1];
        int min_index = 0, max_index = 0;

                //When one string is of zero length, then longest common substring length is 0
        for(int idx = 0; idx < str1.length() + 1; idx++) {
            longest[0][idx] = 0;
        }

        for(int idx = 0; idx < str2.length() + 1; idx++) {
            longest[idx][0] = 0;
        }

        for(int i = 0; i <  str2.length(); i++) {
            for(int j = 0; j < str1.length(); j++) {

                int tmp_min = j, tmp_max = j, tmp_offset = 0;

                if(str2.charAt(i) == str1.charAt(j)) {
                    //Find length of longest common substring ending at i/j
                    while(tmp_offset <= i && tmp_offset <= j &&
                            str2.charAt(i - tmp_offset) == str1.charAt(j - tmp_offset)) {

                        tmp_min--;
                        tmp_offset++;

                    }
                }
                //tmp_min will at this moment contain either < i,j value or the index that does not match
                //So increment it to the index that matches.
                tmp_min++;

                //Length of longest common substring ending at i/j
                int length = tmp_max - tmp_min + 1;
                //Find the longest between S(i-1,j), S(i-1,j-1), S(i, j-1)
                int tmp_max_length = Math.max(longest[i][j], Math.max(longest[i+1][j], longest[i][j+1]));

                if(length > tmp_max_length) {
                    min_index = tmp_min;
                    max_index = tmp_max;
                    longest[i+1][j+1] = length;
                } else {
                    longest[i+1][j+1] = tmp_max_length;
                }


            }
        }

        return str1.substring(min_index, max_index >= str1.length() - 1 ? str1.length() - 1 : max_index + 1);
    }
}

回答by Ankesh Anand

There is a very elegant Dynamic Programming solution to this.

对此有一个非常优雅的动态编程解决方案。

Let LCSuff[i][j]be the longest common suffix between X[1..m]and Y[1..n]. We have two cases here:

LCSuff[i][j]成为X[1..m]和之间最长的公共后缀Y[1..n]。我们这里有两种情况:

  • X[i] == Y[j], that means we can extend the longest common suffix between X[i-1]and Y[j-1]. Thus LCSuff[i][j] = LCSuff[i-1][j-1] + 1in this case.

  • X[i] != Y[j], since the last characters themselves are different, X[1..i]and Y[1..j]can't have a common suffix. Hence, LCSuff[i][j] = 0in this case.

  • X[i] == Y[j],这意味着我们可以在X[i-1]和之间扩展最长公共后缀Y[j-1]。因此LCSuff[i][j] = LCSuff[i-1][j-1] + 1在这种情况下。

  • X[i] != Y[j],因为最后一个字符本身是不同的,X[1..i]并且Y[1..j]不能有一个共同的后缀。因此,LCSuff[i][j] = 0在这种情况下。

We now need to check maximal of these longest common suffixes.

我们现在需要检查这些最长公共后缀中的最大值。

So, LCSubstr(X,Y) = max(LCSuff(i,j)), where 1<=i<=mand 1<=j<=n

所以LCSubstr(X,Y) = max(LCSuff(i,j)),哪里1<=i<=m1<=j<=n

The algorithm pretty much writes itself now.

该算法现在几乎可以自行编写。

string LCSubstr(string x, string y){
    int m = x.length(), n=y.length();

    int LCSuff[m][n];

    for(int j=0; j<=n; j++)
        LCSuff[0][j] = 0;
    for(int i=0; i<=m; i++)
        LCSuff[i][0] = 0;

    for(int i=1; i<=m; i++){
        for(int j=1; j<=n; j++){
            if(x[i-1] == y[j-1])
                LCSuff[i][j] = LCSuff[i-1][j-1] + 1;
            else
                LCSuff[i][j] = 0;
        }
    }

    string longest = "";
    for(int i=1; i<=m; i++){
        for(int j=1; j<=n; j++){
            if(LCSuff[i][j] > longest.length())
                longest = x.substr((i-LCSuff[i][j]+1) -1, LCSuff[i][j]);
        }
    }
    return longest;
}

回答by Cormac Guerin

I tried several different solutions for this but they all seemed really slow so I came up with the below, didn't really test much, but it seems to work a bit faster for me.

我为此尝试了几种不同的解决方案,但它们看起来都很慢,所以我想出了下面的方法,并没有真正进行太多测试,但对我来说似乎工作得更快一些。

#include <iostream>

std::string lcs( std::string a, std::string b )
{
    if( a.empty() || b.empty() ) return {} ;

    std::string current_lcs = "";

    for(int i=0; i< a.length(); i++) {
        size_t fpos = b.find(a[i], 0);
        while(fpos != std::string::npos) {
            std::string tmp_lcs = "";
            tmp_lcs += a[i];
            for (int x = fpos+1; x < b.length(); x++) {
                tmp_lcs+=b[x];
                size_t spos = a.find(tmp_lcs, 0);
                if (spos == std::string::npos) {
                    break;
                } else {
                    if (tmp_lcs.length() > current_lcs.length()) {
                        current_lcs = tmp_lcs;
                    }
                }
            }
            fpos = b.find(a[i], fpos+1);
        }
    }
    return current_lcs;
}

int main(int argc, char** argv)
{
    std::cout << lcs(std::string(argv[1]), std::string(argv[2])) << std::endl;
}

回答by Dreamer

Here is a C# version to find the Longest Common Substring using dynamic programming of two arrays (you may refer to: http://codingworkout.blogspot.com/2014/07/longest-common-substring.htmlfor more details)

这是一个 C# 版本,它使用两个数组的动态规划来查找最长公共子串(更多详细信息,您可以参考:http: //codingworkout.blogspot.com/2014/07/longest-common-substring.html

class LCSubstring
        {
            public int Length = 0;
            public List<Tuple<int, int>> indices = new List<Tuple<int, int>>();
        }
        public string[] LongestCommonSubStrings(string A, string B)
        {
            int[][] DP_LCSuffix_Cache = new int[A.Length+1][];
            for (int i = 0; i <= A.Length; i++)
            {
                DP_LCSuffix_Cache[i] = new int[B.Length + 1];
            }
            LCSubstring lcsSubstring = new LCSubstring();
            for (int i = 1; i <= A.Length; i++)
            {
                for (int j = 1; j <= B.Length; j++)
                {
                    //LCSuffix(Xi, Yj) = 0 if X[i] != X[j]
                    //                 = LCSuffix(Xi-1, Yj-1) + 1 if Xi = Yj
                    if (A[i - 1] == B[j - 1])
                    {
                        int lcSuffix = 1 + DP_LCSuffix_Cache[i - 1][j - 1];
                        DP_LCSuffix_Cache[i][j] = lcSuffix;
                        if (lcSuffix > lcsSubstring.Length)
                        {
                            lcsSubstring.Length = lcSuffix;
                            lcsSubstring.indices.Clear();
                            var t = new Tuple<int, int>(i, j);
                            lcsSubstring.indices.Add(t);
                        }
                        else if(lcSuffix == lcsSubstring.Length)
                        {
                            //may be more than one longest common substring
                            lcsSubstring.indices.Add(new Tuple<int, int>(i, j));
                        }
                    }
                    else
                    {
                        DP_LCSuffix_Cache[i][j] = 0;
                    }
                }
            }
            if(lcsSubstring.Length > 0)
            {
                List<string> substrings = new List<string>();
                foreach(Tuple<int, int> indices in lcsSubstring.indices)
                {
                    string s = string.Empty;
                    int i = indices.Item1 - lcsSubstring.Length;
                    int j = indices.Item2 - lcsSubstring.Length;
                    Assert.IsTrue(DP_LCSuffix_Cache[i][j] == 0);
                    for(int l =0; l<lcsSubstring.Length;l++)
                    {
                        s += A[i];
                        Assert.IsTrue(A[i] == B[j]);
                        i++;
                        j++;
                    }
                    Assert.IsTrue(i == indices.Item1);
                    Assert.IsTrue(j == indices.Item2);
                    Assert.IsTrue(DP_LCSuffix_Cache[i][j] == lcsSubstring.Length);
                    substrings.Add(s);
                }
                return substrings.ToArray();
            }
            return new string[0];
        }

Where unit tests are:

单元测试在哪里:

[TestMethod]
        public void LCSubstringTests()
        {
            string A = "ABABC", B = "BABCA";
            string[] substrings = this.LongestCommonSubStrings(A, B);
            Assert.IsTrue(substrings.Length == 1);
            Assert.IsTrue(substrings[0] == "BABC");
            A = "ABCXYZ"; B = "XYZABC";
            substrings = this.LongestCommonSubStrings(A, B);
            Assert.IsTrue(substrings.Length == 2);
            Assert.IsTrue(substrings.Any(s => s == "ABC"));
            Assert.IsTrue(substrings.Any(s => s == "XYZ"));
            A = "ABC"; B = "UVWXYZ";
            string substring = "";
            for(int i =1;i<=10;i++)
            {
                A += i;
                B += i;
                substring += i;
                substrings = this.LongestCommonSubStrings(A, B);
                Assert.IsTrue(substrings.Length == 1);
                Assert.IsTrue(substrings[0] == substring);
            }
        }

回答by Iceman

Find the largest substring from all strings under consideration. From N strings, you'll have N substrings. Choose the largest of those N.

从所有考虑的字符串中找出最大的子字符串。从 N 个字符串中,您将有 N 个子字符串。选择 N 中最大的一个。