如何判断字符串是否在 Python 中重复?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29481088/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I tell if a string repeats itself in Python?
提问by John
I'm looking for a way to test whether or not a given string repeats itself for the entire string or not.
我正在寻找一种方法来测试给定的字符串是否在整个字符串中重复。
Examples:
例子:
[
'0045662100456621004566210045662100456621', # '00456621'
'0072992700729927007299270072992700729927', # '00729927'
'001443001443001443001443001443001443001443', # '001443'
'037037037037037037037037037037037037037037037', # '037'
'047619047619047619047619047619047619047619', # '047619'
'002457002457002457002457002457002457002457', # '002457'
'001221001221001221001221001221001221001221', # '001221'
'001230012300123001230012300123001230012300123', # '00123'
'0013947001394700139470013947001394700139470013947', # '0013947'
'001001001001001001001001001001001001001001001001001', # '001'
'001406469760900140646976090014064697609', # '0014064697609'
]
are strings which repeat themselves, and
是重复自己的字符串,并且
[
'004608294930875576036866359447',
'00469483568075117370892018779342723',
'004739336492890995260663507109',
'001508295625942684766214177978883861236802413273',
'007518796992481203',
'0071942446043165467625899280575539568345323741',
'0434782608695652173913',
'0344827586206896551724137931',
'002481389578163771712158808933',
'002932551319648093841642228739',
'0035587188612099644128113879',
'003484320557491289198606271777',
'00115074798619102416570771',
]
are examples of ones that do not.
是不这样做的例子。
The repeating sections of the strings I'm given can be quite long, and the strings themselves can be 500 or more characters, so looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow. Multiply that by potentially hundreds of strings and I can't see any intuitive solution.
我给出的字符串的重复部分可能很长,并且字符串本身可以是 500 个或更多字符,因此循环遍历每个字符尝试构建模式然后检查模式与字符串的其余部分似乎非常慢。乘以数百个字符串,我看不到任何直观的解决方案。
I've looked into regexes a bit and they seem good for when you know what you're looking for, or at least the length of the pattern you're looking for. Unfortunately, I know neither.
我对正则表达式进行了一些研究,当您知道要查找的内容或至少知道要查找的模式的长度时,它们似乎很有用。不幸的是,我都不知道。
How can I tell if a string is repeating itself and if it is, what the shortest repeating subsequence is?
如何判断一个字符串是否在重复自己,如果是,最短的重复子序列是什么?
采纳答案by David Zhang
Here's a concise solution which avoids regular expressions and slow in-Python loops:
这是一个简洁的解决方案,它避免了正则表达式和缓慢的 Python 循环:
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
See the Community Wiki answerstarted by @davidism for benchmark results. In summary,
有关基准测试结果,请参阅由 @davidism 发起的社区 Wiki 答案。总之,
David Zhang's solution is the clear winner, outperforming all others by at least 5x for the large example set.
David Zhang 的解决方案显然是赢家,对于大型示例集,其性能至少比其他所有解决方案高 5 倍。
(That answer's words, not mine.)
(那个答案的话,不是我的。)
This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s
in (s+s)[1:-1]
, and for informing me of the optional start
and end
arguments of Python's string.find
.
这是基于观察到一个字符串是周期性的,当且仅当它等于自身的非平凡旋转。荣誉对@AleksiTorhamo实现,我们可以再从第一次出现的索引收回本金周期s
中(s+s)[1:-1]
,并通知我可选的start
和end
Python的论点string.find
。
回答by Zero Piraeus
Here's a solution using regular expressions.
这是使用正则表达式的解决方案。
import re
REPEATER = re.compile(r"(.+?)+$")
def repeated(s):
match = REPEATER.match(s)
return match.group(1) if match else None
Iterating over the examples in the question:
迭代问题中的示例:
examples = [
'0045662100456621004566210045662100456621',
'0072992700729927007299270072992700729927',
'001443001443001443001443001443001443001443',
'037037037037037037037037037037037037037037037',
'047619047619047619047619047619047619047619',
'002457002457002457002457002457002457002457',
'001221001221001221001221001221001221001221',
'001230012300123001230012300123001230012300123',
'0013947001394700139470013947001394700139470013947',
'001001001001001001001001001001001001001001001001001',
'001406469760900140646976090014064697609',
'004608294930875576036866359447',
'00469483568075117370892018779342723',
'004739336492890995260663507109',
'001508295625942684766214177978883861236802413273',
'007518796992481203',
'0071942446043165467625899280575539568345323741',
'0434782608695652173913',
'0344827586206896551724137931',
'002481389578163771712158808933',
'002932551319648093841642228739',
'0035587188612099644128113879',
'003484320557491289198606271777',
'00115074798619102416570771',
]
for e in examples:
sub = repeated(e)
if sub:
print("%r: %r" % (e, sub))
else:
print("%r does not repeat." % e)
... produces this output:
...产生这个输出:
'0045662100456621004566210045662100456621': '00456621'
'0072992700729927007299270072992700729927': '00729927'
'001443001443001443001443001443001443001443': '001443'
'037037037037037037037037037037037037037037037': '037'
'047619047619047619047619047619047619047619': '047619'
'002457002457002457002457002457002457002457': '002457'
'001221001221001221001221001221001221001221': '001221'
'001230012300123001230012300123001230012300123': '00123'
'0013947001394700139470013947001394700139470013947': '0013947'
'001001001001001001001001001001001001001001001001001': '001'
'001406469760900140646976090014064697609': '0014064697609'
'004608294930875576036866359447' does not repeat.
'00469483568075117370892018779342723' does not repeat.
'004739336492890995260663507109' does not repeat.
'001508295625942684766214177978883861236802413273' does not repeat.
'007518796992481203' does not repeat.
'0071942446043165467625899280575539568345323741' does not repeat.
'0434782608695652173913' does not repeat.
'0344827586206896551724137931' does not repeat.
'002481389578163771712158808933' does not repeat.
'002932551319648093841642228739' does not repeat.
'0035587188612099644128113879' does not repeat.
'003484320557491289198606271777' does not repeat.
'00115074798619102416570771' does not repeat.
The regular expression (.+?)\1+$
is divided into three parts:
正则表达式(.+?)\1+$
分为三部分:
(.+?)
is a matching group containing at least one (but as few as possible) of any character (because+?
is non-greedy).\1+
checks for at least one repetition of the matching group in the first part.$
checks for the end of the string, to ensure that there's no extra, non-repeating content after the repeated substrings (and usingre.match()
ensures that there's no non-repeating text beforethe repeated substrings).
(.+?)
是包含至少一个(但尽可能少)任何字符的匹配组(因为+?
是 non-greedy)。\1+
检查第一部分中匹配组的至少一个重复。$
检查字符串的结尾,以确保在重复子字符串之后没有额外的非重复内容(并且使用re.match()
确保在重复子字符串之前没有非重复文本)。
In Python 3.4 and later, you could drop the $
and use re.fullmatch()
instead, or (in any Python at least as far back as 2.3) go the other way and use re.search()
with the regex ^(.+?)\1+$
, all of which are more down to personal taste than anything else.
在 Python 3.4 及更高版本中,您可以放弃$
and use re.fullmatch()
,或者(在任何 Python 中至少早在 2.3 中)转而使用re.search()
regex ^(.+?)\1+$
,所有这些都更取决于个人喜好而不是其他任何东西。
回答by TigerhawkT3
Non-regex solution:
非正则表达式解决方案:
def repeat(string):
for i in range(1, len(string)//2+1):
if not len(string)%len(string[0:i]) and string[0:i]*(len(string)//len(string[0:i])) == string:
return string[0:i]
Faster non-regex solution, thanks to @ThatWeirdo (see comments):
更快的非正则表达式解决方案,感谢@ThatWeirdo(见评论):
def repeat(string):
l = len(string)
for i in range(1, len(string)//2+1):
if l%i: continue
s = string[0:i]
if s*(l//i) == string:
return s
The above solution is very rarely slower than the original by a few percent, but it's usually a good bit faster - sometimes a whole lot faster. It's still not faster than davidism's for longer strings, and zero's regex solution is superior for short strings. It comes out to the fastest (according to davidism's test on github - see his answer) with strings of about 1000-1500 characters. Regardless, it's reliably second-fastest (or better) in all cases I tested. Thanks, ThatWeirdo.
上面的解决方案很少比原始解决方案慢几个百分点,但通常要快一些 - 有时要快得多。对于较长的字符串,它仍然不比 davidism 快,并且对于短字符串,零的正则表达式解决方案更好。它的速度最快(根据 davidism 在 github 上的测试 - 请参阅他的回答),字符串大约为 1000-1500 个字符。无论如何,在我测试的所有情况下,它都是可靠的第二快(或更好)。谢谢,那个怪人。
Test:
测试:
print(repeat('009009009'))
print(repeat('254725472547'))
print(repeat('abcdeabcdeabcdeabcde'))
print(repeat('abcdefg'))
print(repeat('09099099909999'))
print(repeat('02589675192'))
Results:
结果:
009
2547
abcde
None
None
None
回答by Shashank
You can make the observation that for a string to be considered repeating, its length must be divisible by the length of its repeated sequence. Given that, here is a solution that generates divisors of the length from 1
to n / 2
inclusive, divides the original string into substrings with the length of the divisors, and tests the equality of the result set:
您可以观察到,对于被视为重复的字符串,其长度必须能被其重复序列的长度整除。鉴于此,这里有一个解决方案,它生成长度为 from1
到n / 2
inclusive 的除数,将原始字符串划分为具有除数长度的子字符串,并测试结果集的相等性:
from math import sqrt, floor
def divquot(n):
if n > 1:
yield 1, n
swapped = []
for d in range(2, int(floor(sqrt(n))) + 1):
q, r = divmod(n, d)
if r == 0:
yield d, q
swapped.append((q, d))
while swapped:
yield swapped.pop()
def repeats(s):
n = len(s)
for d, q in divquot(n):
sl = s[0:d]
if sl * q == s:
return sl
return None
EDIT:In Python 3, the /
operator has changed to do float division by default. To get the int
division from Python 2, you can use the //
operator instead. Thank you to @TigerhawkT3 for bringing this to my attention.
编辑:在 Python 3 中,/
运算符已更改为默认进行浮点除法。要从int
Python 2 中获取除法,您可以改用//
运算符。感谢@TigerhawkT3 引起我的注意。
The //
operator performs integer division in both Python 2 and Python 3, so I've updated the answer to support both versions. The part where we test to see if all the substrings are equal is now a short-circuiting operation using all
and a generator expression.
该//
运营商执行整数除法两个Python 2和Python 3中,所以我已经更新了答案,以支持这两个版本。我们测试以查看所有子字符串是否相等的部分现在是使用all
和生成器表达式的短路操作。
UPDATE:In response to a change in the original question, the code has now been updated to return the smallest repeating substring if it exists and None
if it does not. @godlygeek has suggested using divmod
to reduce the number of iterations on the divisors
generator, and the code has been updated to match that as well. It now returns all positive divisors of n
in ascending order, exclusive of n
itself.
更新:为了响应原始问题的更改,现在已更新代码以返回最小的重复子字符串(如果存在,None
如果不存在)。@godlygeek 建议使用divmod
来减少divisors
生成器上的迭代次数,并且代码也已更新以匹配该次数。它现在n
按升序返回所有正除数,不包括n
自身。
Further update for high performance:After multiple tests, I've come to the conclusion that simply testing for string equality has the best performance out of any slicing or iterator solution in Python. Thus, I've taken a leaf out of @TigerhawkT3 's book and updated my solution. It's now over 6x as fast as before, noticably faster than Tigerhawk's solution but slower than David's.
高性能的进一步更新:经过多次测试,我得出的结论是,在 Python 中的任何切片或迭代器解决方案中,简单地测试字符串相等性具有最佳性能。因此,我从@TigerhawkT3 的书中摘下了一页并更新了我的解决方案。它现在比以前快 6 倍多,明显比 Tigerhawk 的解决方案快,但比 David 的慢。
回答by Saksham Varma
Here's a straight forward solution, without regexes.
这是一个直接的解决方案,没有正则表达式。
For substrings of s
starting from zeroth index, of lengths 1 through len(s)
, check if that substring, substr
is the repeated pattern. This check can be performed by concatenating substr
with itself ratio
times, such that the length of the string thus formed is equal to the length of s
. Hence ratio=len(s)/len(substr)
.
对于s
从第零个索引开始、长度为 1 到 的len(s)
子字符串,检查该子字符串substr
是否是重复模式。这种检查可以通过substr
与自身的ratio
时间连接来执行,这样形成的字符串的长度等于 的长度s
。因此ratio=len(s)/len(substr)
。
Return when first such substring is found. This would provide the smallest possible substring, if one exists.
当找到第一个这样的子字符串时返回。如果存在的话,这将提供最小的可能子串。
def check_repeat(s):
for i in range(1, len(s)):
substr = s[:i]
ratio = len(s)/len(substr)
if substr * ratio == s:
print 'Repeating on "%s"' % substr
return
print 'Non repeating'
>>> check_repeat('254725472547')
Repeating on "2547"
>>> check_repeat('abcdeabcdeabcdeabcde')
Repeating on "abcde"
回答by davidism
First, halve the string as long as it's a "2 part" duplicate. This reduces the search space if there are an even number of repeats. Then, working forwards to find the smallest repeating string, check if splitting the full string by increasingly larger sub-string results in only empty values. Only sub-strings up to length // 2
need to be tested since anything over that would have no repeats.
首先,将字符串减半,只要它是“2 部分”重复项。如果重复次数为偶数,则这会减少搜索空间。然后,继续寻找最小的重复字符串,检查通过越来越大的子字符串拆分完整字符串是否只产生空值。只length // 2
需要测试最多的子字符串,因为超过的任何内容都不会重复。
def shortest_repeat(orig_value):
if not orig_value:
return None
value = orig_value
while True:
len_half = len(value) // 2
first_half = value[:len_half]
if first_half != value[len_half:]:
break
value = first_half
len_value = len(value)
split = value.split
for i in (i for i in range(1, len_value // 2) if len_value % i == 0):
if not any(split(value[:i])):
return value[:i]
return value if value != orig_value else None
This returns the shortest match or None if there is no match.
如果没有匹配,则返回最短匹配或 None 。
回答by davidism
Here are some benchmarks for the various answers to this question. There were some surprising results, including wildly different performance depending on the string being tested.
以下是针对此问题的各种答案的一些基准。有一些令人惊讶的结果,包括根据被测试字符串的不同表现。
Some functions were modified to work with Python 3 (mainly by replacing /
with //
to ensure integer division). If you see something wrong, want to add your function, or want to add another test string, ping @ZeroPiraeus in the Python chatroom.
一些函数被修改为与 Python 3 一起使用(主要是通过替换/
with//
以确保整数除法)。如果您发现有问题,想要添加您的函数,或者想要添加另一个测试字符串,请在Python 聊天室ping @ZeroPiraeus 。
In summary: there's about a 50x difference between the best- and worst-performing solutions for the large set of example data supplied by OP here(via thiscomment). David Zhang's solutionis the clear winner, outperforming all others by around 5x for the large example set.
总而言之:对于 OP此处提供的大量示例数据(通过此评论),性能最佳和最差的解决方案之间存在大约 50 倍的差异。David Zhang 的解决方案显然是赢家,对于大型示例集,其性能比其他所有解决方案高出约 5 倍。
A couple of the answers are veryslow in extremely large "no match" cases. Otherwise, the functions seem to be equally matched or clear winners depending on the test.
在非常大的“不匹配”情况下,有几个答案非常缓慢。否则,根据测试,这些功能似乎是同等匹配或明显的赢家。
Here are the results, including plots made using matplotlib and seaborn to show the different distributions:
以下是结果,包括使用 matplotlib 和 seaborn 绘制的图以显示不同的分布:
Corpus 1 (supplied examples - small set)
语料库 1(提供的示例 - 小集)
mean performance:
0.0003 david_zhang
0.0009 zero
0.0013 antti
0.0013 tigerhawk_2
0.0015 carpetpython
0.0029 tigerhawk_1
0.0031 davidism
0.0035 saksham
0.0046 shashank
0.0052 riad
0.0056 piotr
median performance:
0.0003 david_zhang
0.0008 zero
0.0013 antti
0.0013 tigerhawk_2
0.0014 carpetpython
0.0027 tigerhawk_1
0.0031 davidism
0.0038 saksham
0.0044 shashank
0.0054 riad
0.0058 piotr
Corpus 2 (supplied examples - large set)
语料库 2(提供的示例 - 大集)
mean performance:
0.0006 david_zhang
0.0036 tigerhawk_2
0.0036 antti
0.0037 zero
0.0039 carpetpython
0.0052 shashank
0.0056 piotr
0.0066 davidism
0.0120 tigerhawk_1
0.0177 riad
0.0283 saksham
median performance:
0.0004 david_zhang
0.0018 zero
0.0022 tigerhawk_2
0.0022 antti
0.0024 carpetpython
0.0043 davidism
0.0049 shashank
0.0055 piotr
0.0061 tigerhawk_1
0.0077 riad
0.0109 saksham
Corpus 3 (edge cases)
语料库 3(边缘情况)
mean performance:
0.0123 shashank
0.0375 david_zhang
0.0376 piotr
0.0394 carpetpython
0.0479 antti
0.0488 tigerhawk_2
0.2269 tigerhawk_1
0.2336 davidism
0.7239 saksham
3.6265 zero
6.0111 riad
median performance:
0.0107 tigerhawk_2
0.0108 antti
0.0109 carpetpython
0.0135 david_zhang
0.0137 tigerhawk_1
0.0150 shashank
0.0229 saksham
0.0255 piotr
0.0721 davidism
0.1080 zero
1.8539 riad
The tests and raw results are available here.
测试和原始结果可在此处获得。
回答by Antti Haapala
This version tries only those candidate sequence lengths that are factors of the string length; and uses the *
operator to build a full-length string from the candidate sequence:
此版本仅尝试作为字符串长度因子的候选序列长度;并使用*
运算符从候选序列构建全长字符串:
def get_shortest_repeat(string):
length = len(string)
for i in range(1, length // 2 + 1):
if length % i: # skip non-factors early
continue
candidate = string[:i]
if string == candidate * (length // i):
return candidate
return None
Thanks to TigerhawkT3 for noticing that length // 2
without + 1
would fail to match the abab
case.
感谢 TigerhawkT3 注意到length // 2
without+ 1
将无法匹配abab
案例。
回答by RiaD
The problem may also be solved in O(n)
in worst case with prefix function.
这个问题也可以O(n)
在最坏的情况下用前缀函数解决。
Note, it may be slower in general case(UPD: and is much slower) than other solutions which depend on number of divisors of n
, but usually find fails sooner, I think one of bad cases for them will be aaa....aab
, where there are n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1
a
's
请注意,在一般情况下(UPD:并且慢得多),它可能比其他取决于 的除数的解决方案慢n
,但通常会更快地发现失败,我认为对他们来说不好的情况之一是aaa....aab
,那里有n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1
a
's
First of all you need to calculate prefix function
首先你需要计算前缀函数
def prefix_function(s):
n = len(s)
pi = [0] * n
for i in xrange(1, n):
j = pi[i - 1]
while(j > 0 and s[i] != s[j]):
j = pi[j - 1]
if (s[i] == s[j]):
j += 1
pi[i] = j;
return pi
then either there's no answer or the shortest period is
那么要么没有答案,要么最短的时间是
k = len(s) - prefix_function(s[-1])
and you just have to check if k != n and n % k == 0
(if k != n and n % k == 0
then answer is s[:k]
, else there's no answer
你只需要检查是否k != n and n % k == 0
(如果k != n and n % k == 0
那么答案是s[:k]
,否则没有答案
You may check the proof here(in Russian, but online translator will probably do the trick)
您可以在此处查看证明(俄语,但在线翻译可能会成功)
def riad(s):
n = len(s)
pi = [0] * n
for i in xrange(1, n):
j = pi[i - 1]
while(j > 0 and s[i] != s[j]):
j = pi[j - 1]
if (s[i] == s[j]):
j += 1
pi[i] = j;
k = n - pi[-1]
return s[:k] if (n != k and n % k == 0) else None
回答by Logic Knight
I started with more than eight solutions to this problem. Some were bases on regex (match, findall, split), some of string slicing and testing, and some with string methods (find, count, split). Each had benefits in code clarity, code size, speed and memory consumption. I was going to post my answer here when I noticed that execution speed was ranked as important, so I did more testing and improvement to arrive at this:
我从超过 8 个解决这个问题的方法开始。有些基于正则表达式(match、findall、split),有些基于字符串切片和测试,有些基于字符串方法(find、count、split)。每个都在代码清晰度、代码大小、速度和内存消耗方面都有好处。当我注意到执行速度被列为重要时,我打算在这里发布我的答案,所以我做了更多的测试和改进来达到这个目的:
def repeating(s):
size = len(s)
incr = size % 2 + 1
for n in xrange(1, size//2+1, incr):
if size % n == 0:
if s[:n] * (size//n) == s:
return s[:n]
This answer seems similar to a few other answers here, but it has a few speed optimisations others have not used:
这个答案似乎与这里的其他一些答案相似,但它有一些其他人没有使用过的速度优化:
xrange
is a little faster in this application,- if an input string is an odd length, do not check any even length substrings,
- by using
s[:n]
directly, we avoid creating a variable in each loop.
xrange
在此应用程序中速度稍快,- 如果输入字符串是奇数长度,则不检查任何偶数长度子字符串,
- 通过
s[:n]
直接使用,我们避免在每个循环中创建一个变量。
I would be interested to see how this performs in the standard tests with common hardware. I believe it will be well short of David Zhang's excellent algorithm in most tests, but should be quite fast otherwise.
我很想看看它在使用通用硬件的标准测试中的表现如何。我相信在大多数测试中,它远不及 David Zhang 的优秀算法,但否则应该会很快。
I found this problem to be very counter-intuitive. The solutions I thought would be fast were slow. The solutions that looked slow were fast! It seems that Python's string creation with the multiply operator and string comparisons are highly optimised.
我发现这个问题非常违反直觉。我认为很快的解决方案很慢。看起来很慢的解决方案很快!似乎 Python 使用乘法运算符和字符串比较创建字符串是高度优化的。