如何判断字符串是否在 Python 中重复?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29481088/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:38:37  来源:igfitidea点击:

How can I tell if a string repeats itself in Python?

pythonstringpattern-matching

提问by John

I'm looking for a way to test whether or not a given string repeats itself for the entire string or not.

我正在寻找一种方法来测试给定的字符串是否在整个字符串中重复。

Examples:

例子:

[
    '0045662100456621004566210045662100456621',             # '00456621'
    '0072992700729927007299270072992700729927',             # '00729927'
    '001443001443001443001443001443001443001443',           # '001443'
    '037037037037037037037037037037037037037037037',        # '037'
    '047619047619047619047619047619047619047619',           # '047619'
    '002457002457002457002457002457002457002457',           # '002457'
    '001221001221001221001221001221001221001221',           # '001221'
    '001230012300123001230012300123001230012300123',        # '00123'
    '0013947001394700139470013947001394700139470013947',    # '0013947'
    '001001001001001001001001001001001001001001001001001',  # '001'
    '001406469760900140646976090014064697609',              # '0014064697609'
]

are strings which repeat themselves, and

是重复自己的字符串,并且

[
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

are examples of ones that do not.

是不这样做的例子。

The repeating sections of the strings I'm given can be quite long, and the strings themselves can be 500 or more characters, so looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow. Multiply that by potentially hundreds of strings and I can't see any intuitive solution.

我给出的字符串的重复部分可能很长,并且字符串本身可以是 500 个或更多字符,因此循环遍历每个字符尝试构建模式然后检查模式与字符串的其余部分似乎非常慢。乘以数百个字符串,我看不到任何直观的解决方案。

I've looked into regexes a bit and they seem good for when you know what you're looking for, or at least the length of the pattern you're looking for. Unfortunately, I know neither.

我对正则表达式进行了一些研究,当您知道要查找的内容或至少知道要查找的模式的长度时,它们似乎很有用。不幸的是,我都不知道。

How can I tell if a string is repeating itself and if it is, what the shortest repeating subsequence is?

如何判断一个字符串是否在重复自己,如果是,最短的重复子序列是什么?

采纳答案by David Zhang

Here's a concise solution which avoids regular expressions and slow in-Python loops:

这是一个简洁的解决方案,它避免了正则表达式和缓慢的 Python 循环:

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

See the Community Wiki answerstarted by @davidism for benchmark results. In summary,

有关基准测试结果,请参阅由 @davidism 发起的社区 Wiki 答案。总之,

David Zhang's solution is the clear winner, outperforming all others by at least 5x for the large example set.

David Zhang 的解决方案显然是赢家,对于大型示例集,其性能至少比其他所有解决方案高 5 倍。

(That answer's words, not mine.)

(那个答案的话,不是我的。)

This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of sin (s+s)[1:-1], and for informing me of the optional startand endarguments of Python's string.find.

这是基于观察到一个字符串是周期性的,当且仅当它等于自身的非平凡旋转。荣誉对@AleksiTorhamo实现,我们可以再从第一次出现的索引收回本金周期s(s+s)[1:-1],并通知我可选的startendPython的论点string.find

回答by Zero Piraeus

Here's a solution using regular expressions.

这是使用正则表达式的解决方案。

import re

REPEATER = re.compile(r"(.+?)+$")

def repeated(s):
    match = REPEATER.match(s)
    return match.group(1) if match else None

Iterating over the examples in the question:

迭代问题中的示例:

examples = [
    '0045662100456621004566210045662100456621',
    '0072992700729927007299270072992700729927',
    '001443001443001443001443001443001443001443',
    '037037037037037037037037037037037037037037037',
    '047619047619047619047619047619047619047619',
    '002457002457002457002457002457002457002457',
    '001221001221001221001221001221001221001221',
    '001230012300123001230012300123001230012300123',
    '0013947001394700139470013947001394700139470013947',
    '001001001001001001001001001001001001001001001001001',
    '001406469760900140646976090014064697609',
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

for e in examples:
    sub = repeated(e)
    if sub:
        print("%r: %r" % (e, sub))
    else:
        print("%r does not repeat." % e)

... produces this output:

...产生这个输出:

'0045662100456621004566210045662100456621': '00456621'
'0072992700729927007299270072992700729927': '00729927'
'001443001443001443001443001443001443001443': '001443'
'037037037037037037037037037037037037037037037': '037'
'047619047619047619047619047619047619047619': '047619'
'002457002457002457002457002457002457002457': '002457'
'001221001221001221001221001221001221001221': '001221'
'001230012300123001230012300123001230012300123': '00123'
'0013947001394700139470013947001394700139470013947': '0013947'
'001001001001001001001001001001001001001001001001001': '001'
'001406469760900140646976090014064697609': '0014064697609'
'004608294930875576036866359447' does not repeat.
'00469483568075117370892018779342723' does not repeat.
'004739336492890995260663507109' does not repeat.
'001508295625942684766214177978883861236802413273' does not repeat.
'007518796992481203' does not repeat.
'0071942446043165467625899280575539568345323741' does not repeat.
'0434782608695652173913' does not repeat.
'0344827586206896551724137931' does not repeat.
'002481389578163771712158808933' does not repeat.
'002932551319648093841642228739' does not repeat.
'0035587188612099644128113879' does not repeat.
'003484320557491289198606271777' does not repeat.
'00115074798619102416570771' does not repeat.

The regular expression (.+?)\1+$is divided into three parts:

正则表达式(.+?)\1+$分为三部分:

  1. (.+?)is a matching group containing at least one (but as few as possible) of any character (because +?is non-greedy).

  2. \1+checks for at least one repetition of the matching group in the first part.

  3. $checks for the end of the string, to ensure that there's no extra, non-repeating content after the repeated substrings (and using re.match()ensures that there's no non-repeating text beforethe repeated substrings).

  1. (.+?)是包含至少一个(但尽可能少)任何字符的匹配组(因为+?是 non-greedy)。

  2. \1+检查第一部分中匹配组的至少一个重复。

  3. $检查字符串的结尾,以确保在重复子字符串之后没有额外的非重复内容(并且使用re.match()确保在重复子字符串之前没有非重复文本)。

In Python 3.4 and later, you could drop the $and use re.fullmatch()instead, or (in any Python at least as far back as 2.3) go the other way and use re.search()with the regex ^(.+?)\1+$, all of which are more down to personal taste than anything else.

在 Python 3.4 及更高版本中,您可以放弃$and use re.fullmatch(),或者(在任何 Python 中至少早在 2.3 中)转而使用re.search()regex ^(.+?)\1+$,所有这些都更取决于个人喜好而不是其他任何东西。

回答by TigerhawkT3

Non-regex solution:

非正则表达式解决方案:

def repeat(string):
    for i in range(1, len(string)//2+1):
        if not len(string)%len(string[0:i]) and string[0:i]*(len(string)//len(string[0:i])) == string:
            return string[0:i]

Faster non-regex solution, thanks to @ThatWeirdo (see comments):

更快的非正则表达式解决方案,感谢@ThatWeirdo(见评论):

def repeat(string):
    l = len(string)
    for i in range(1, len(string)//2+1):
        if l%i: continue
        s = string[0:i]
        if s*(l//i) == string:
            return s

The above solution is very rarely slower than the original by a few percent, but it's usually a good bit faster - sometimes a whole lot faster. It's still not faster than davidism's for longer strings, and zero's regex solution is superior for short strings. It comes out to the fastest (according to davidism's test on github - see his answer) with strings of about 1000-1500 characters. Regardless, it's reliably second-fastest (or better) in all cases I tested. Thanks, ThatWeirdo.

上面的解决方案很少比原始解决方案慢几个百分点,但通常要快一些 - 有时要快得多。对于较长的字符串,它仍然不比 davidism 快,并且对于短字符串,零的正则表达式解决方案更好。它的速度最快(根据 davidism 在 github 上的测试 - 请参阅他的回答),字符串大约为 1000-1500 个字符。无论如何,在我测试的所有情况下,它都是可靠的第二快(或更好)。谢谢,那个怪人。

Test:

测试:

print(repeat('009009009'))
print(repeat('254725472547'))
print(repeat('abcdeabcdeabcdeabcde'))
print(repeat('abcdefg'))
print(repeat('09099099909999'))
print(repeat('02589675192'))

Results:

结果:

009
2547
abcde
None
None
None

回答by Shashank

You can make the observation that for a string to be considered repeating, its length must be divisible by the length of its repeated sequence. Given that, here is a solution that generates divisors of the length from 1to n / 2inclusive, divides the original string into substrings with the length of the divisors, and tests the equality of the result set:

您可以观察到,对于被视为重复的字符串,其长度必须能被其重复序列的长度整除。鉴于此,这里有一个解决方案,它生成长度为 from1n / 2inclusive 的除数,将原始字符串划分为具有除数长度的子字符串,并测试结果集的相等性:

from math import sqrt, floor

def divquot(n):
    if n > 1:
        yield 1, n
    swapped = []
    for d in range(2, int(floor(sqrt(n))) + 1):
        q, r = divmod(n, d)
        if r == 0:
            yield d, q
            swapped.append((q, d))
    while swapped:
        yield swapped.pop()

def repeats(s):
    n = len(s)
    for d, q in divquot(n):
        sl = s[0:d]
        if sl * q == s:
            return sl
    return None

EDIT:In Python 3, the /operator has changed to do float division by default. To get the intdivision from Python 2, you can use the //operator instead. Thank you to @TigerhawkT3 for bringing this to my attention.

编辑:在 Python 3 中,/运算符已更改为默认进行浮点除法。要从intPython 2 中获取除法,您可以改用//运算符。感谢@TigerhawkT3 引起我的注意。

The //operator performs integer division in both Python 2 and Python 3, so I've updated the answer to support both versions. The part where we test to see if all the substrings are equal is now a short-circuiting operation using alland a generator expression.

//运营商执行整数除法两个Python 2和Python 3中,所以我已经更新了答案,以支持这两个版本。我们测试以查看所有子字符串是否相等的部分现在是使用all和生成器表达式的短路操作。

UPDATE:In response to a change in the original question, the code has now been updated to return the smallest repeating substring if it exists and Noneif it does not. @godlygeek has suggested using divmodto reduce the number of iterations on the divisorsgenerator, and the code has been updated to match that as well. It now returns all positive divisors of nin ascending order, exclusive of nitself.

更新:为了响应原始问题的更改,现在已更新代码以返回最小的重复子字符串(如果存在,None如果不存在)。@godlygeek 建议使用divmod来减少divisors生成器上的迭代次数,并且代码也已更新以匹配该次数。它现在n按升序返回所有正除数,不包括n自身。

Further update for high performance:After multiple tests, I've come to the conclusion that simply testing for string equality has the best performance out of any slicing or iterator solution in Python. Thus, I've taken a leaf out of @TigerhawkT3 's book and updated my solution. It's now over 6x as fast as before, noticably faster than Tigerhawk's solution but slower than David's.

高性能的进一步更新:经过多次测试,我得出的结论是,在 Python 中的任何切片或迭代器解决方案中,简单地测试字符串相等性具有最佳性能。因此,我从@TigerhawkT3 的书中摘下了一页并更新了我的解决方案。它现在比以前快 6 倍多,明显比 Tigerhawk 的解决方案快,但比 David 的慢。

回答by Saksham Varma

Here's a straight forward solution, without regexes.

这是一个直接的解决方案,没有正则表达式。

For substrings of sstarting from zeroth index, of lengths 1 through len(s), check if that substring, substris the repeated pattern. This check can be performed by concatenating substrwith itself ratiotimes, such that the length of the string thus formed is equal to the length of s. Hence ratio=len(s)/len(substr).

对于s从第零个索引开始、长度为 1 到 的len(s)子字符串,检查该子字符串substr是否是重复模式。这种检查可以通过substr与自身的ratio时间连接来执行,这样形成的字符串的长度等于 的长度s。因此ratio=len(s)/len(substr)

Return when first such substring is found. This would provide the smallest possible substring, if one exists.

当找到第一个这样的子字符串时返回。如果存在的话,这将提供最小的可能子串。

def check_repeat(s):
    for i in range(1, len(s)):
        substr = s[:i]
        ratio = len(s)/len(substr)
        if substr * ratio == s:
            print 'Repeating on "%s"' % substr
            return
    print 'Non repeating'

>>> check_repeat('254725472547')
Repeating on "2547"
>>> check_repeat('abcdeabcdeabcdeabcde')
Repeating on "abcde"

回答by davidism

First, halve the string as long as it's a "2 part" duplicate. This reduces the search space if there are an even number of repeats. Then, working forwards to find the smallest repeating string, check if splitting the full string by increasingly larger sub-string results in only empty values. Only sub-strings up to length // 2need to be tested since anything over that would have no repeats.

首先,将字符串减半,只要它是“2 部分”重复项。如果重复次数为偶数,则这会减少搜索空间。然后,继续寻找最小的重复字符串,检查通过越来越大的子字符串拆分完整字符串是否只产生空值。只length // 2需要测试最多的子字符串,因为超过的任何内容都不会重复。

def shortest_repeat(orig_value):
    if not orig_value:
        return None

    value = orig_value

    while True:
        len_half = len(value) // 2
        first_half = value[:len_half]

        if first_half != value[len_half:]:
            break

        value = first_half

    len_value = len(value)
    split = value.split

    for i in (i for i in range(1, len_value // 2) if len_value % i == 0):
        if not any(split(value[:i])):
            return value[:i]

    return value if value != orig_value else None

This returns the shortest match or None if there is no match.

如果没有匹配,则返回最短匹配或 None 。

回答by davidism

Here are some benchmarks for the various answers to this question. There were some surprising results, including wildly different performance depending on the string being tested.

以下是针对此问题的各种答案的一些基准。有一些令人惊讶的结果,包括根据被测试字符串的不同表现。

Some functions were modified to work with Python 3 (mainly by replacing /with //to ensure integer division). If you see something wrong, want to add your function, or want to add another test string, ping @ZeroPiraeus in the Python chatroom.

一些函数被修改为与 Python 3 一起使用(主要是通过替换/with//以确保整数除法)。如果您发现有问题,想要添加您的函数,或者想要添加另一个测试字符串,请在Python 聊天室ping @ZeroPiraeus 。

In summary: there's about a 50x difference between the best- and worst-performing solutions for the large set of example data supplied by OP here(via thiscomment). David Zhang's solutionis the clear winner, outperforming all others by around 5x for the large example set.

总而言之:对于 OP此处提供的大量示例数据(通过评论),性能最佳和最差的解决方案之间存在大约 50 倍的差异。David Zhang 的解决方案显然是赢家,对于大型示例集,其性能比其他所有解决方案高出约 5 倍。

A couple of the answers are veryslow in extremely large "no match" cases. Otherwise, the functions seem to be equally matched or clear winners depending on the test.

在非常大的“不匹配”情况下,有几个答案非常缓慢。否则,根据测试,这些功能似乎是同等匹配或明显的赢家。

Here are the results, including plots made using matplotlib and seaborn to show the different distributions:

以下是结果,包括使用 matplotlib 和 seaborn 绘制的图以显示不同的分布:



Corpus 1 (supplied examples - small set)

语料库 1(提供的示例 - 小集)

mean performance:
 0.0003  david_zhang
 0.0009  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0015  carpetpython
 0.0029  tigerhawk_1
 0.0031  davidism
 0.0035  saksham
 0.0046  shashank
 0.0052  riad
 0.0056  piotr

median performance:
 0.0003  david_zhang
 0.0008  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0014  carpetpython
 0.0027  tigerhawk_1
 0.0031  davidism
 0.0038  saksham
 0.0044  shashank
 0.0054  riad
 0.0058  piotr

Corpus 1 graph

语料库 1 图



Corpus 2 (supplied examples - large set)

语料库 2(提供的示例 - 大集)

mean performance:
 0.0006  david_zhang
 0.0036  tigerhawk_2
 0.0036  antti
 0.0037  zero
 0.0039  carpetpython
 0.0052  shashank
 0.0056  piotr
 0.0066  davidism
 0.0120  tigerhawk_1
 0.0177  riad
 0.0283  saksham

median performance:
 0.0004  david_zhang
 0.0018  zero
 0.0022  tigerhawk_2
 0.0022  antti
 0.0024  carpetpython
 0.0043  davidism
 0.0049  shashank
 0.0055  piotr
 0.0061  tigerhawk_1
 0.0077  riad
 0.0109  saksham

Corpus 1 graph

语料库 1 图



Corpus 3 (edge cases)

语料库 3(边缘情况)

mean performance:
 0.0123  shashank
 0.0375  david_zhang
 0.0376  piotr
 0.0394  carpetpython
 0.0479  antti
 0.0488  tigerhawk_2
 0.2269  tigerhawk_1
 0.2336  davidism
 0.7239  saksham
 3.6265  zero
 6.0111  riad

median performance:
 0.0107  tigerhawk_2
 0.0108  antti
 0.0109  carpetpython
 0.0135  david_zhang
 0.0137  tigerhawk_1
 0.0150  shashank
 0.0229  saksham
 0.0255  piotr
 0.0721  davidism
 0.1080  zero
 1.8539  riad

Corpus 3 graph

语料库3图



The tests and raw results are available here.

测试和原始结果可在此处获得

回答by Antti Haapala

This version tries only those candidate sequence lengths that are factors of the string length; and uses the *operator to build a full-length string from the candidate sequence:

此版本仅尝试作为字符串长度因子的候选序列长度;并使用*运算符从候选序列构建全长字符串:

def get_shortest_repeat(string):
    length = len(string)
    for i in range(1, length // 2 + 1):
        if length % i:  # skip non-factors early
            continue

        candidate = string[:i]
        if string == candidate * (length // i):
            return candidate

    return None

Thanks to TigerhawkT3 for noticing that length // 2without + 1would fail to match the ababcase.

感谢 TigerhawkT3 注意到length // 2without+ 1将无法匹配abab案例。

回答by RiaD

The problem may also be solved in O(n)in worst case with prefix function.

这个问题也可以O(n)在最坏的情况下用前缀函数解决。

Note, it may be slower in general case(UPD: and is much slower) than other solutions which depend on number of divisors of n, but usually find fails sooner, I think one of bad cases for them will be aaa....aab, where there are n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1a's

请注意,在一般情况下(UPD:并且慢得多),它可能比其他取决于 的除数的解决方案慢n,但通常会更快地发现失败,我认为对他们来说不好的情况之一是aaa....aab,那里有n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1a's

First of all you need to calculate prefix function

首先你需要计算前缀函数

def prefix_function(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    return pi

then either there's no answer or the shortest period is

那么要么没有答案,要么最短的时间是

k = len(s) - prefix_function(s[-1])

and you just have to check if k != n and n % k == 0(if k != n and n % k == 0then answer is s[:k], else there's no answer

你只需要检查是否k != n and n % k == 0(如果k != n and n % k == 0那么答案是s[:k],否则没有答案

You may check the proof here(in Russian, but online translator will probably do the trick)

您可以在此处查看证明(俄语,但在线翻译可能会成功)

def riad(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    k = n - pi[-1]
    return s[:k] if (n != k and n % k == 0) else None

回答by Logic Knight

I started with more than eight solutions to this problem. Some were bases on regex (match, findall, split), some of string slicing and testing, and some with string methods (find, count, split). Each had benefits in code clarity, code size, speed and memory consumption. I was going to post my answer here when I noticed that execution speed was ranked as important, so I did more testing and improvement to arrive at this:

我从超过 8 个解决这个问题的方法开始。有些基于正则表达式(match、findall、split),有些基于字符串切片和测试,有些基于字符串方法(find、count、split)。每个都在代码清晰度、代码大小、速度和内存消耗方面都有好处。当我注意到执行速度被列为重要时,我打算在这里发布我的答案,所以我做了更多的测试和改进来达到这个目的:

def repeating(s):
    size = len(s)
    incr = size % 2 + 1
    for n in xrange(1, size//2+1, incr):
        if size % n == 0:
            if s[:n] * (size//n) == s:
                return s[:n]

This answer seems similar to a few other answers here, but it has a few speed optimisations others have not used:

这个答案似乎与这里的其他一些答案相似,但它有一些其他人没有使用过的速度优化:

  • xrangeis a little faster in this application,
  • if an input string is an odd length, do not check any even length substrings,
  • by using s[:n]directly, we avoid creating a variable in each loop.
  • xrange在此应用程序中速度稍快,
  • 如果输入字符串是奇数长度,则不检查任何偶数长度子字符串,
  • 通过s[:n]直接使用,我们避免在每个循环中创建一个变量。

I would be interested to see how this performs in the standard tests with common hardware. I believe it will be well short of David Zhang's excellent algorithm in most tests, but should be quite fast otherwise.

我很想看看它在使用通用硬件的标准测试中的表现如何。我相信在大多数测试中,它远不及 David Zhang 的优秀算法,但否则应该会很快。

I found this problem to be very counter-intuitive. The solutions I thought would be fast were slow. The solutions that looked slow were fast! It seems that Python's string creation with the multiply operator and string comparisons are highly optimised.

我发现这个问题非常违反直觉。我认为很快的解决方案很慢。看起来很慢的解决方案很快!似乎 Python 使用乘法运算符和字符串比较创建字符串是高度优化的。