Python 在字符串中查找第一个非重复字符的最佳方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15495731/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best way to find first non repeating character in a string
提问by Sandy
What would be the best space and time efficient solution to find the first non repeating character for a string like aabccbdcbe?
为字符串找到第一个非重复字符的最佳空间和时间效率解决方案是aabccbdcbe什么?
The answer here is d. So the point that strikes me is that it can be done in two ways:
这里的答案是 d。所以让我印象深刻的一点是它可以通过两种方式完成:
- For every index i loop i-1 times and check if that character occurs ever again. But this is not efficient: growth of this method is O(N^2) where N is the length of the string.
- Another possible good way could be if I could form a tree or any other ds such that I could order the character based on the weights (the occurrence count). This could take me just one loop of length N through the string to form the structure. That is just O(N) + O(time to build tree or any ds).
- 对于每个索引,我循环 i-1 次并检查该字符是否再次出现。但这效率不高:这种方法的增长是 O(N^2),其中 N 是字符串的长度。
- 另一种可能的好方法是,如果我可以形成一棵树或任何其他 ds,以便我可以根据权重(出现次数)对角色进行排序。这可能只需要我通过字符串形成一个长度为 N 的循环来形成结构。那只是 O(N) + O(构建树或任何 ds 的时间)。
采纳答案by Aaron Dufour
Here's a very straightforward O(n)solution:
这是一个非常简单的O(n)解决方案:
def fn(s):
order = []
counts = {}
for x in s:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
order.append(x)
for x in order:
if counts[x] == 1:
return x
return None
We loop through the string once. When we come across a new character, we store it in countswith a value of 1, and append it to order. When we come across a character we've seen before, we increment its value in counts. Finally, we loop through orderuntil we find a character with a value of 1in countsand return it.
我们遍历字符串一次。当我们遇到一个新字符时,我们将它counts的值存储在中1,并将其附加到 中order。当我们遇到一个我们以前见过的字符时,我们将它的值增加到counts. 最后,我们循环order直到找到一个值为1in的字符counts并返回它。
回答by TyrantWave
A list comprehension will give you the characters in the order they appear if they only appear once:
如果字符只出现一次,列表解析将按照它们出现的顺序为您提供字符:
In [61]: s = 'aabccbdcbe'
In [62]: [a for a in s if s.count(a) == 1]
Out[62]: ['d', 'e']
Then just return the first entry of this:
然后只返回这个的第一个条目:
In [63]: [a for a in s if s.count(a) == 1][0]
Out[63]: 'd'
If you just need the first entry, a generator would work as well:
如果您只需要第一个条目,生成器也可以工作:
In [69]: (a for a in s if s.count(a) == 1).next()
Out[69]: 'd'
回答by Oin
So from the definition of the problem, it's clear that you need an O(n) solution, which means only going through the list once. All of the solutions which use a form of count are wrong, since they go through the list again in that operation. So you need to keep track of the counts yourself.
所以从问题的定义来看,很明显你需要一个 O(n) 的解决方案,这意味着只遍历列表一次。所有使用计数形式的解决方案都是错误的,因为它们在该操作中再次遍历列表。所以你需要自己跟踪计数。
If you only had characters in that string, then you don't need to worry about storage and you could just use the character as a key in a dict. The values in that dict will be the index of the character in the string s. At the end we have to see which one was the first by calculating the minimum of the dictionary's values. This is an O(n) operation on a (possibly) shorter list than the first.
如果该字符串中只有字符,则无需担心存储问题,只需将该字符用作 dict 中的键即可。该字典中的值将是字符串 s 中字符的索引。最后,我们必须通过计算字典值的最小值来查看哪个是第一个。这是一个(可能)比第一个更短的列表上的 O(n) 操作。
The total will still be O(c*n) therefore O(n).
总数仍为 O(c*n),因此为 O(n)。
from operator import itemgetter
seen = set()
only_appear_once = dict()
for i, x in enumerate(s):
if x in seen and x in only_appear_once:
only_appear_once.pop(x)
else:
seen.add(x)
only_appear_once[x] = i
first_count_of_one = only_appear_once[min(only_appear_once.values(), key=itemgetter(1))]
回答by Roman Fursenko
I think that removing of the repeating characters from the string may significantly reduce the number of operations. For example:
我认为从字符串中删除重复字符可能会显着减少操作次数。例如:
s = "aabccbdcbe"
while s != "":
slen0 = len(s)
ch = s[0]
s = s.replace(ch, "")
slen1 = len(s)
if slen1 == slen0-1:
print ch
break;
else:
print "No answer"
回答by dmg
Here is an approach with goodset of chars and badset of chars (which appear more than one time):
这是一good组字符和一bad组字符(出现多次)的方法:
import timeit
import collections
import operator
import random
s = [chr(i) for i in range(ord('a'), ord('z')) for j in range(100)] + ['z']
random.shuffle(s)
s = ''.join(s)
def good_bad_sets(s):
setbad = set()
setgood = set()
for char in s:
if(char not in setbad):
if(char in setgood):
setgood.remove(char)
setbad.add(char)
else:
setgood.add(char)
return s[min([s.index(char) for char in setgood])] if len(s) > 0 else None
def app_once(s):
seen = set()
only_appear_once = set()
for i in s:
if i in seen:
only_appear_once.discard(i)
else:
seen.add(i)
only_appear_once.add(i)
return s[min([s.index(char) for char in only_appear_once])] if len(only_appear_once) > 0 else None
print('Good bad sets: %ss' % timeit.Timer(lambda : good_bad_sets(s)).timeit(100))
print('Oin\'s approach: %ss' % timeit.Timer(lambda : app_once(s)).timeit(100))
print('LC: %ss' % timeit.Timer(lambda : [a for a in s if s.count(a) == 1][0]).timeit(100))
I've compared it to the LC approach and somewhere around 50 chars the goodand badset approach becomes faster. Comparison of this approach vs. Oin'svs LC:
我将它与 LC 方法进行了比较,并且在大约 50 个字符的地方,good并且badset 方法变得更快。这种方法与Oin与 LC的比较:
Good bad sets: 0.0419239997864s
Oin's approach: 0.0803039073944s
LC: 0.647999048233s
回答by Janne Karila
collections.Countercounts efficiently(*) and collections.OrderedDictremembers the order in which items were first seen. Let's use multiple inheritance to combine the benefits:
collections.Counter有效计数(*)并collections.OrderedDict记住第一次看到项目的顺序。让我们使用多重继承来结合好处:
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
pass
def first_unique(iterable):
c = OrderedCounter(iterable)
for item, count in c.iteritems():
if count == 1:
return item
print first_unique('aabccbdcbe')
#d
print first_unique('abccbdcbe')
#a
Counteruses its superclass dictto store the counts. Defining class OrderedCounter(Counter, OrderedDict)inserts OrderedDictbetween Counterand dictin method resolution order, adding the ability to remember insertion order.
Counter使用其超类dict来存储计数。在方法解析顺序之间和中定义class OrderedCounter(Counter, OrderedDict)插入,添加记住插入顺序的能力。OrderedDictCounterdict
(*) This is O(n) and efficient in that sense, but not the fastest solution, as benchmarks show.
(*) 从这个意义上说,这是 O(n) 并且高效,但不是最快的解决方案,如基准测试所示。
回答by eyquem
The speed of the search depends on several factors:
搜索速度取决于几个因素:
- the length of the string
- the position before which there is not a one-time-occuring character
- the size of the string after this position
- the number of different characters occuring in the string
- 字符串的长度
- 之前没有一次性出现的字符的位置
- 此位置后字符串的大小
- 字符串中出现的不同字符的数量
.
.
In the following code, I firstly define a string s
with the help of random.choice()and the a group of one-time-occurring characters named unik,
from two strings s1and s2that I concatenate : s1 + s2
where:
在下面的代码,我首先定义了一个字符串s
的帮助下random.choice(),并命名一批一次性出现的人物unik,
两个字符串s1和s2我连击:s1 + s2
其中:
s1is a string of lengthnwoin which there is NOT ANY one-time-occurring characters2is a string of lengthnwiin which THERE IS one-time-occurring characters
s1是一个长度的字符串,nwo其中没有任何一次性出现的字符s2是一个长度的字符串,nwi其中有一次出现的字符
.
.
#### creation of s from s1 and s2 #########
from random import choice
def without(u,n):
letters = list('abcdefghijklmnopqrstuvwxyz')
for i in xrange(n):
c = choice(letters)
if c not in unik:
yield c
def with_un(u,n):
letters = list('abcdefghijklmnopqrstuvwxyz')
ecr = []
for i in xrange(n):
c = choice(letters)
#ecr.append('%d %s len(letters) == %d' % (i,c,len(letters)))
yield c
if c in unik:
letters.remove(c)
#print '\n'.join(ecr)
unik = 'ekprw'
nwo,nwi = 0,500
s1 = ''.join(c for c in without(unik,nwo))
s2 = ''.join(c for c in with_un(unik,nwi))
s = s1 + s2
if s1:
print '%-27ss2 : %d chars' % ('s1 : %d chars' % len(s1),len(s2))
for el in 'ekprw':
print ('s1.count(%s) == %-12ds2.count(%s) == %d'
% (el,s1.count(el),el,s2.count(el)))
others = [c for c in 'abcdefghijklmnopqrstuvwxyz' if c not in unik]
print 's1.count(others)>1 %s' % all(s1.count(c)>1 for c in others)
else:
print "s1 == '' len(s2) == %d" % len(s2)
for el in 'ekprw':
print (' - s2.count(%s) == %d'
% (el,s2.count(el)))
print 'len of s == %d\n' % len(s)
Then there is the benchmarking.
Varying the numbers nwoand nwiwe see the influence on the speed:
然后是基准测试。
改变数字nwo,nwi我们看到对速度的影响:
### benchmark of three solutions #################
from time import clock
# Janne Karila
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
pass
te = clock()
c = OrderedCounter(s)
rjk = (item for item, count in c.iteritems() if count == 1).next()
tf = clock()-te
print 'Janne Karila %.5f found: %s' % (tf,rjk)
# eyquem
te = clock()
candidates = set(s)
li = []
for x in s:
if x in candidates:
li.append(x)
candidates.remove(x)
elif x in li:
li.remove(x)
rey = li[0]
tf = clock()-te
print 'eyquem %.5f found: %s' % (tf,rey)
# TyrantWave
te = clock()
rty = (a for a in s if s.count(a) == 1).next()
tf = clock()-te
print 'TyrantWave %.5f found: %s' % (tf,rty)
.
.
Some results
一些结果
With s1of null length, nwo = 0 and nwi = 50:
对于s1空长度,nwo = 0 和 nwi = 50:
s1 == '' len(s2) == 50
- s2.count(e) == 1
- s2.count(k) == 1
- s2.count(p) == 1
- s2.count(r) == 1
- s2.count(w) == 1
len of s == 50
Janne Karila 0.00077 found: e
eyquem 0.00013 found: e
TyrantWave 0.00005 found: e
TyrantWave's solutions is the faster because the first one-occurring-char is found rapidly in the first positions of the string
TyrantWave 的解决方案更快,因为在字符串的第一个位置快速找到第一个出现的字符
.
.
With nwo = 300 and nwi = 50
(hereafter 401 chars for s1because occurrences of one-time-occurring chars weren't retained during construct of s1, see function without() )
使用 nwo = 300 和 nwi = 50
(以下为 401 个字符,s1因为在 的构造期间未保留一次性出现的字符的出现s1,请参阅函数 without() )
s1 : 245 chars s2 : 50 chars
s1.count(e) == 0 s2.count(e) == 1
s1.count(k) == 0 s2.count(k) == 1
s1.count(p) == 0 s2.count(p) == 1
s1.count(r) == 0 s2.count(r) == 1
s1.count(w) == 0 s2.count(w) == 1
s1.count(others)>1 True
len of s == 295
Janne Karila 0.00167 found: e
eyquem 0.00030 found: e
TyrantWave 0.00042 found: e
This time TyrantWave's solution is longer than mine because it has to count occurrences of all the characters in the first part of sthat is to say in s1in which there are no one-time-occurring characters (they are in the second part s2)
However, to obtain a more short time with my solution, nwoneeds to be notably greater than nwi
这次 TyrantWave 的解决方案比我的要长,因为它必须计算第一部分中所有字符的出现次数,s也就是说s1其中没有一次性出现的字符(它们在第二部分中s2)
但是,为了使用我的解决方案获得更短的时间,nwo需要明显大于nwi
.
.
With nwo = 300 and nwi = 5000
nwo = 300 和 nwi = 5000
s1 : 240 chars s2 : 5000 chars
s1.count(e) == 0 s2.count(e) == 1
s1.count(k) == 0 s2.count(k) == 1
s1.count(p) == 0 s2.count(p) == 1
s1.count(r) == 0 s2.count(r) == 1
s1.count(w) == 0 s2.count(w) == 1
s1.count(others)>1 True
len of s == 5240
Janne Karila 0.01510 found: p
eyquem 0.00534 found: p
TyrantWave 0.00294 found: p
If length of s2is raised, then TyrantWave's solution is better again.
如果s2提高了 的长度,那么 TyrantWave 的解决方案再次更好。
.
.
Conclude what you want
总结你想要的
.
.
EDIT
编辑
Terrific idea of Roman !
I added the solution of Roman in my benchmarking, and it won !
罗马的好主意!
我在我的基准测试中添加了 Roman 的解决方案,它赢了!
I also did some tiny modifications that improve his solution.
我还做了一些微小的修改来改进他的解决方案。
# Roman Fursenko
srf = s[:]
te = clock()
while srf != "":
slen0 = len(srf)
ch = srf[0]
srf = srf.replace(ch, "")
slen1 = len(srf)
if slen1 == slen0-1:
rrf = ch
break
else:
rrf = "No answer"
tf = clock()-te
print 'Roman Fursenko %.6f found: %s' % (tf,rrf)
# Roman Fursenko improved
srf = s[:]
te = clock()
while not(srf is ""):
slen0 = len(srf)
srf = srf.replace(srf[0], "")
if len(srf) == slen0-1:
rrf = ch
break
else:
rrf = "No answer"
tf = clock()-te
print 'Roman improved %.6f found: %s' % (tf,rrf)
print '\nindex of %s in the string : %d' % (rty,s.index(rrf))
.
.
The results are:
结果是:
.
.
s1 == '' len(s2) == 50
- s2.count(e) == 1
- s2.count(k) == 1
- s2.count(p) == 1
- s2.count(r) == 1
- s2.count(w) == 1
len of s == 50
Janne Karila 0.0032538 found: r
eyquem 0.0001249 found: r
TyrantWave 0.0000534 found: r
Roman Fursenko 0.0000299 found: r
Roman improved 0.0000263 found: r
index of r in the string : 1
s1 == '' len(s2) == 50
- s2.count(e) == 1
- s2.count(k) == 0
- s2.count(p) == 1
- s2.count(r) == 1
- s2.count(w) == 1
len of s == 50
Janne Karila 0.0008183 found: a
eyquem 0.0001285 found: a
TyrantWave 0.0000550 found: a
Roman Fursenko 0.0000433 found: a
Roman improved 0.0000391 found: a
index of a in the string : 4
>
>
s1 : 240 chars s2 : 50 chars
s1.count(e) == 0 s2.count(e) == 1
s1.count(k) == 0 s2.count(k) == 0
s1.count(p) == 0 s2.count(p) == 1
s1.count(r) == 0 s2.count(r) == 1
s1.count(w) == 0 s2.count(w) == 1
s1.count(others)>1 True
len of s == 290
Janne Karila 0.0016390 found: e
eyquem 0.0002956 found: e
TyrantWave 0.0004112 found: e
Roman Fursenko 0.0001428 found: e
Roman improved 0.0001277 found: e
index of e in the string : 242
s1 : 241 chars s2 : 5000 chars
s1.count(e) == 0 s2.count(e) == 1
s1.count(k) == 0 s2.count(k) == 1
s1.count(p) == 0 s2.count(p) == 1
s1.count(r) == 0 s2.count(r) == 1
s1.count(w) == 0 s2.count(w) == 1
s1.count(others)>1 True
len of s == 5241
Janne Karila 0.0148231 found: r
eyquem 0.0053283 found: r
TyrantWave 0.0030166 found: r
Roman Fursenko 0.0007414 found: r
Roman improved 0.0007230 found: r
index of r in the string : 250
I've learned something thanks to the code of Roman:s.replace()creates a new string and I thought that, because of that, it was a slow method.
But, I don't know for which reason, it is a really fast method.
由于 Roman 的代码,我学到了一些东西:s.replace()创建一个新字符串,因此我认为这是一种缓慢的方法。
但是,我不知道出于什么原因,这是一种非常快速的方法。
.
.
EDIT 2
编辑 2
The Oin's solution is worst:
Oin 的解决方案是最糟糕的:
# Oin
from operator import itemgetter
seen = set()
only_appear_once = dict()
te = clock()
for i, x in enumerate(s):
if x in seen and x in only_appear_once:
only_appear_once.pop(x)
else:
seen.add(x)
only_appear_once[x] = i
fco = min(only_appear_once.items(),key=itemgetter(1))[0]
tf = clock()-te
print 'Oin %.7f found: %s' % (tf,fco)
results
结果
s1 == '' len(s2) == 50
Oin 0.0007124 found: e
Janne Karila 0.0008057 found: e
eyquem 0.0001252 found: e
TyrantWave 0.0000712 found: e
Roman Fursenko 0.0000335 found: e
Roman improved 0.0000335 found: e
index of e in the string : 2
s1 : 237 chars s2 : 50 chars
Oin 0.0029783 found: k
Janne Karila 0.0014714 found: k
eyquem 0.0002889 found: k
TyrantWave 0.0005598 found: k
Roman Fursenko 0.0001458 found: k
Roman improved 0.0001372 found: k
index of k in the string : 246
s1 : 236 chars s2 : 5000 chars
Oin 0.0801739 found: e
Janne Karila 0.0155715 found: e
eyquem 0.0044623 found: e
TyrantWave 0.0027548 found: e
Roman Fursenko 0.0007255 found: e
Roman improved 0.0007199 found: e
index of e in the string : 244

