Python 查找两个字符串之间的公共子字符串

Question

提问by NorthSide

I'd like to compare 2 strings and keep the matched, splitting off where the comparison fails.

我想比较 2 个字符串并保持匹配，在比较失败的地方分开。

So if I have 2 strings -

所以如果我有 2 个字符串 -

string1 = apples
string2 = appleses

answer = apples

Another example, as the string could have more than one word.

另一个例子，因为字符串可能有多个单词。

string1 = apple pie available
string2 = apple pies

answer = apple pie

I'm sure there is a simple Python way of doing this but I can't work it out, any help and explanation appreciated.

我确信有一种简单的 Python 方法可以做到这一点，但我无法解决，感谢任何帮助和解释。

Answer 1

采纳答案by thefourtheye

Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).

它被称为最长公共子串问题。在这里，我提出了一个简单易懂但效率低下的解决方案。为大字符串生成正确的输出需要很长时间，因为该算法的复杂度为 O(N^2)。

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer

print longestSubstringFinder("apple pie available", "apple pies")
print longestSubstringFinder("apples", "appleses")
print longestSubstringFinder("bapples", "cappleses")

Output

输出

apple pie
apples
apples

Answer 2

回答by Eric

def common_start(sa, sb):
    """ returns the longest common substring from the beginning of sa and sb """
    def _iter():
        for a, b in zip(sa, sb):
            if a == b:
                yield a
            else:
                return

    return ''.join(_iter())

>>> common_start("apple pie available", "apple pies")
'apple pie'

Or a slightly stranger way:

或者稍微奇怪的方式：

def stop_iter():
    """An easy way to break out of a generator"""
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a if a == b else stop_iter() for a, b in zip(sa, sb))

Which might be more readable as

哪个可能更具可读性

def terminating(cond):
    """An easy way to break out of a generator"""
    if cond:
        return True
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a for a, b in zip(sa, sb) if terminating(a == b))

Answer 3

回答by Birei

Try:

尝试：

import itertools as it
''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], zip(string1, string2)))

It does the comparison from the beginning of both strings.

它从两个字符串的开头进行比较。

Answer 4

回答by modulus

Returns the first longest common substring:

返回第一个最长公共子串：

def compareTwoStrings(string1, string2):
    list1 = list(string1)
    list2 = list(string2)

    match = []
    output = ""
    length = 0

    for i in range(0, len(list1)):

        if list1[i] in list2:
            match.append(list1[i])

            for j in range(i + 1, len(list1)):

                if ''.join(list1[i:j]) in string2:
                    match.append(''.join(list1[i:j]))

                else:
                    continue
        else:
            continue

    for string in match:

        if length < len(list(string)):
            length = len(list(string))
            output = string

        else:
            continue

    return output

Answer 5

回答by Rali Tsanova

This isn't the most efficient way to do it but it's what I could come up with and it works. If anyone can improve it, please do. What it does is it makes a matrix and puts 1 where the characters match. Then it scans the matrix to find the longest diagonal of 1s, keeping track of where it starts and ends. Then it returns the substring of the input string with the start and end positions as arguments.

这不是最有效的方法，但这是我能想出的，而且很有效。如果有人可以改进它，请做。它的作用是创建一个矩阵并将 1 放在字符匹配的位置。然后它扫描矩阵以找到 1s 的最长对角线，跟踪它的起点和终点。然后它以开始和结束位置作为参数返回输入字符串的子字符串。

Note: This only finds one longest common substring. If there's more than one, you could make an array to store the results in and return that Also, it's case sensitive so (Apple pie, apple pie) will return pple pie.

注意：这只会找到一个最长的公共子串。如果有多个，您可以创建一个数组来存储结果并返回该数组此外，它区分大小写，因此（Apple pie，apple pie）将返回 pple pie。

def longestSubstringFinder(str1, str2):
answer = ""

if len(str1) == len(str2):
    if str1==str2:
        return str1
    else:
        longer=str1
        shorter=str2
elif (len(str1) == 0 or len(str2) == 0):
    return ""
elif len(str1)>len(str2):
    longer=str1
    shorter=str2
else:
    longer=str2
    shorter=str1

matrix = numpy.zeros((len(shorter), len(longer)))

for i in range(len(shorter)):
    for j in range(len(longer)):               
        if shorter[i]== longer[j]:
            matrix[i][j]=1

longest=0

start=[-1,-1]
end=[-1,-1]    
for i in range(len(shorter)-1, -1, -1):
    for j in range(len(longer)):
        count=0
        begin = [i,j]
        while matrix[i][j]==1:

            finish=[i,j]
            count=count+1 
            if j==len(longer)-1 or i==len(shorter)-1:
                break
            else:
                j=j+1
                i=i+1

        i = i-count
        if count>longest:
            longest=count
            start=begin
            end=finish
            break

answer=shorter[int(start[0]): int(end[0])+1]
return answer

Answer 6

回答by SergeyR

The same as Evo's, but with arbitrary number of strings to compare:

与Evo 的相同，但可以比较任意数量的字符串：

def common_start(*strings):
    """ Returns the longest common substring
        from the beginning of the `strings`
    """
    def _iter():
        for z in zip(*strings):
            if z.count(z[0]) == len(z):  # check all elements in `z` are the same
                yield z[0]
            else:
                return

    return ''.join(_iter())

Answer 7

回答by wwii

First a helperfunction adapted from the itertools pairwise recipeto produce substrings.

首先是一个改编自itertools 成对配方的辅助函数来生成子字符串。

import itertools
def n_wise(iterable, n = 2):
    '''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ...

    n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...'''
    a = itertools.tee(iterable, n)
    for x, thing in enumerate(a[1:]):
        for _ in range(x+1):
            next(thing, None)
    return zip(*a)

Then a function the iterates over substrings, longest first, and tests for membership. (efficiency not considered)

然后一个函数迭代子串，最长的在前，并测试成员资格。（不考虑效率）

def foo(s1, s2):
    '''Finds the longest matching substring
    '''
    # the longest matching substring can only be as long as the shortest string
    #which string is shortest?
    shortest, longest = sorted([s1, s2], key = len)
    #iterate over substrings, longest substrings first
    for n in range(len(shortest)+1, 2, -1):
        for sub in n_wise(shortest, n):
            sub = ''.join(sub)
            if sub in longest:
                #return the first one found, it should be the longest
                return sub

s = "fdomainster"
t = "exdomainid"
print(foo(s,t))

>>> 
domain
>>>

Answer 8

回答by RickardSjogren

For completeness, difflibin the standard-library provides loads of sequence-comparison utilities. For instance find_longest_matchwhich finds the longest common substring when used on strings. Example use:

为了完整起见，difflib标准库中提供了大量序列比较实用程序。例如find_longest_match，它在字符串上使用时找到最长的公共子字符串。使用示例：

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"

match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

print(match)  # -> Match(a=0, b=15, size=9)
print(string1[match.a: match.a + match.size])  # -> apple pie
print(string2[match.b: match.b + match.size])  # -> apple pie

Answer 9

回答by user7733798

Fix bugs with the first's answer:

使用第一个答案修复错误：

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        for j in range(len2):
            lcs_temp=0
            match=''
            while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
                match += string2[j+lcs_temp]
                lcs_temp+=1
            if (len(match) > len(answer)):
                answer = match
    return answer

print longestSubstringFinder("dd apple pie available", "apple pies")
print longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000")
print longestSubstringFinder("bapples", "cappleses")
print longestSubstringFinder("apples", "apples")

Answer 10

回答by radhikesh93

def matchingString(x,y):
    match=''
    for i in range(0,len(x)):
        for j in range(0,len(y)):
            k=1
            # now applying while condition untill we find a substring match and length of substring is less than length of x and y
            while (i+k <= len(x) and j+k <= len(y) and x[i:i+k]==y[j:j+k]):
                if len(match) <= len(x[i:i+k]):
                   match = x[i:i+k]
                k=k+1
    return match  

print matchingString('apple','ale') #le
print matchingString('apple pie available','apple pies') #apple pie

Python 查找两个字符串之间的公共子字符串

提问by NorthSide

采纳答案by thefourtheye

回答by Eric

回答by Birei

回答by modulus

回答by Rali Tsanova

回答by SergeyR

回答by wwii

回答by RickardSjogren

回答by user7733798

回答by radhikesh93

相关推荐

最近更新

标签

Python 查找两个字符串之间的公共子字符串

提问by NorthSide

采纳答案by thefourtheye

回答by Eric

回答by Birei

回答by modulus

回答by Rali Tsanova

回答by SergeyR

回答by wwii

回答by RickardSjogren

回答by user7733798

回答by radhikesh93

相关推荐

Python 来自 url 的 Pandas read_csv

Python -1 在 numpy reshape 中是什么意思？

\r（回车）如何在 Python 中工作

Python 如何在 scikit-learn 中创建/自定义您自己的评分器功能？

相关推荐

最近更新

标签