Python 在字符串中查找第一个非重复字符的最佳方法

Question

提问by Sandy

What would be the best space and time efficient solution to find the first non repeating character for a string like aabccbdcbe?

为字符串找到第一个非重复字符的最佳空间和时间效率解决方案是aabccbdcbe什么？

The answer here is d. So the point that strikes me is that it can be done in two ways:

这里的答案是 d。所以让我印象深刻的一点是它可以通过两种方式完成：

For every index i loop i-1 times and check if that character occurs ever again. But this is not efficient: growth of this method is O(N^2) where N is the length of the string.
Another possible good way could be if I could form a tree or any other ds such that I could order the character based on the weights (the occurrence count). This could take me just one loop of length N through the string to form the structure. That is just O(N) + O(time to build tree or any ds).

对于每个索引，我循环 i-1 次并检查该字符是否再次出现。但这效率不高：这种方法的增长是 O(N^2)，其中 N 是字符串的长度。
另一种可能的好方法是，如果我可以形成一棵树或任何其他 ds，以便我可以根据权重（出现次数）对角色进行排序。这可能只需要我通过字符串形成一个长度为 N 的循环来形成结构。那只是 O(N) + O（构建树或任何 ds 的时间）。

Answer 1

采纳答案by Aaron Dufour

Here's a very straightforward O(n)solution:

这是一个非常简单的O(n)解决方案：

def fn(s):
  order = []
  counts = {}
  for x in s:
    if x in counts:
      counts[x] += 1
    else:
      counts[x] = 1 
      order.append(x)
  for x in order:
    if counts[x] == 1:
      return x
  return None

We loop through the string once. When we come across a new character, we store it in countswith a value of 1, and append it to order. When we come across a character we've seen before, we increment its value in counts. Finally, we loop through orderuntil we find a character with a value of 1in countsand return it.

我们遍历字符串一次。当我们遇到一个新字符时，我们将它counts的值存储在中1，并将其附加到中order。当我们遇到一个我们以前见过的字符时，我们将它的值增加到counts. 最后，我们循环order直到找到一个值为1in的字符counts并返回它。

Answer 2

回答by TyrantWave

A list comprehension will give you the characters in the order they appear if they only appear once:

如果字符只出现一次，列表解析将按照它们出现的顺序为您提供字符：

In [61]: s = 'aabccbdcbe'

In [62]: [a for a in s if s.count(a) == 1]
Out[62]: ['d', 'e']

Then just return the first entry of this:

然后只返回这个的第一个条目：

In [63]: [a for a in s if s.count(a) == 1][0]
Out[63]: 'd'

If you just need the first entry, a generator would work as well:

如果您只需要第一个条目，生成器也可以工作：

In [69]: (a for a in s if s.count(a) == 1).next()
Out[69]: 'd'

Answer 3

回答by Oin

So from the definition of the problem, it's clear that you need an O(n) solution, which means only going through the list once. All of the solutions which use a form of count are wrong, since they go through the list again in that operation. So you need to keep track of the counts yourself.

所以从问题的定义来看，很明显你需要一个 O(n) 的解决方案，这意味着只遍历列表一次。所有使用计数形式的解决方案都是错误的，因为它们在该操作中再次遍历列表。所以你需要自己跟踪计数。

If you only had characters in that string, then you don't need to worry about storage and you could just use the character as a key in a dict. The values in that dict will be the index of the character in the string s. At the end we have to see which one was the first by calculating the minimum of the dictionary's values. This is an O(n) operation on a (possibly) shorter list than the first.

如果该字符串中只有字符，则无需担心存储问题，只需将该字符用作 dict 中的键即可。该字典中的值将是字符串 s 中字符的索引。最后，我们必须通过计算字典值的最小值来查看哪个是第一个。这是一个（可能）比第一个更短的列表上的 O(n) 操作。

The total will still be O(c*n) therefore O(n).

总数仍为 O(c*n)，因此为 O(n)。

from operator import itemgetter

seen = set()
only_appear_once = dict()

for i, x in enumerate(s):
  if x in seen and x in only_appear_once:
    only_appear_once.pop(x)
  else:
    seen.add(x)
    only_appear_once[x] = i

first_count_of_one = only_appear_once[min(only_appear_once.values(), key=itemgetter(1))]

Answer 4

回答by Roman Fursenko

I think that removing of the repeating characters from the string may significantly reduce the number of operations. For example:

我认为从字符串中删除重复字符可能会显着减少操作次数。例如：

s = "aabccbdcbe"
while s != "":
    slen0 = len(s)
    ch = s[0]
    s = s.replace(ch, "")
    slen1 = len(s)
    if slen1 == slen0-1:
        print ch
        break;
else:
    print "No answer"

Answer 5

回答by dmg

Here is an approach with goodset of chars and badset of chars (which appear more than one time):

这是一good组字符和一bad组字符（出现多次）的方法：

import timeit
import collections
import operator
import random

s = [chr(i) for i in range(ord('a'), ord('z')) for j in range(100)] + ['z']

random.shuffle(s)
s = ''.join(s)

def good_bad_sets(s):
    setbad = set()
    setgood = set()
    for char in s:
        if(char not in setbad):
            if(char in setgood):
                setgood.remove(char)
                setbad.add(char)
            else:
                setgood.add(char)
    return s[min([s.index(char) for char in setgood])] if len(s) > 0 else None

def app_once(s):
    seen = set()
    only_appear_once = set()
    for i in s:
      if i in seen:
        only_appear_once.discard(i)
      else:
        seen.add(i)
        only_appear_once.add(i)
    return s[min([s.index(char) for char in only_appear_once])] if len(only_appear_once) > 0 else None

print('Good bad sets: %ss' % timeit.Timer(lambda : good_bad_sets(s)).timeit(100))
print('Oin\'s approach: %ss' % timeit.Timer(lambda : app_once(s)).timeit(100))
print('LC: %ss' % timeit.Timer(lambda : [a for a in s if s.count(a) == 1][0]).timeit(100))

I've compared it to the LC approach and somewhere around 50 chars the goodand badset approach becomes faster. Comparison of this approach vs. Oin'svs LC:

我将它与 LC 方法进行了比较，并且在大约 50 个字符的地方，good并且badset 方法变得更快。这种方法与Oin与 LC的比较：

Good bad sets: 0.0419239997864s
Oin's approach: 0.0803039073944s
LC: 0.647999048233s

Answer 6

回答by Janne Karila

collections.Countercounts efficiently(*) and collections.OrderedDictremembers the order in which items were first seen. Let's use multiple inheritance to combine the benefits:

collections.Counter有效计数（*）并collections.OrderedDict记住第一次看到项目的顺序。让我们使用多重继承来结合好处：

from collections import Counter, OrderedDict

class OrderedCounter(Counter, OrderedDict):
    pass

def first_unique(iterable):
    c = OrderedCounter(iterable)
    for item, count in c.iteritems():
        if count == 1:
            return item

print first_unique('aabccbdcbe')
#d            
print first_unique('abccbdcbe')            
#a

Counteruses its superclass dictto store the counts. Defining class OrderedCounter(Counter, OrderedDict)inserts OrderedDictbetween Counterand dictin method resolution order, adding the ability to remember insertion order.

Counter使用其超类dict来存储计数。在方法解析顺序之间和中定义class OrderedCounter(Counter, OrderedDict)插入，添加记住插入顺序的能力。OrderedDictCounterdict

(*) This is O(n) and efficient in that sense, but not the fastest solution, as benchmarks show.

(*) 从这个意义上说，这是 O(n) 并且高效，但不是最快的解决方案，如基准测试所示。

Answer 7

回答by eyquem

The speed of the search depends on several factors:

搜索速度取决于几个因素：

the length of the string
the position before which there is not a one-time-occuring character
the size of the string after this position
the number of different characters occuring in the string

字符串的长度
之前没有一次性出现的字符的位置
此位置后字符串的大小
字符串中出现的不同字符的数量

.

In the following code, I firstly define a string s
with the help of random.choice()and the a group of one-time-occurring characters named unik,
from two strings s1and s2that I concatenate : s1 + s2
where:

在下面的代码，我首先定义了一个字符串s
的帮助下random.choice()，并命名一批一次性出现的人物unik，
两个字符串s1和s2我连击：s1 + s2
其中：

s1is a string of length nwoin which there is NOT ANY one-time-occurring character
s2is a string of length nwiin which THERE IS one-time-occurring characters

s1是一个长度的字符串，nwo其中没有任何一次性出现的字符
s2是一个长度的字符串，nwi其中有一次出现的字符

.

#### creation of s from s1 and s2 #########

from random import choice

def without(u,n):
    letters = list('abcdefghijklmnopqrstuvwxyz')
    for i in xrange(n):
        c = choice(letters)
        if c not in unik:
            yield c

def with_un(u,n):
    letters = list('abcdefghijklmnopqrstuvwxyz')
    ecr = []
    for i in xrange(n):
        c = choice(letters)
        #ecr.append('%d %s  len(letters) == %d' % (i,c,len(letters)))
        yield c
        if c in unik:
            letters.remove(c)
    #print '\n'.join(ecr)

unik = 'ekprw'
nwo,nwi = 0,500
s1 = ''.join(c for c in without(unik,nwo))
s2 = ''.join(c for c in with_un(unik,nwi))
s = s1 + s2

if s1:
    print '%-27ss2 : %d chars' % ('s1 : %d chars' % len(s1),len(s2))
    for el in 'ekprw':
        print ('s1.count(%s) == %-12ds2.count(%s) == %d'
               % (el,s1.count(el),el,s2.count(el)))
    others = [c for c in 'abcdefghijklmnopqrstuvwxyz' if c not in unik]
    print 's1.count(others)>1 %s' % all(s1.count(c)>1 for c in others)
else:
    print "s1 == ''     len(s2) == %d" % len(s2)
    for el in 'ekprw':
        print ('   -         s2.count(%s) == %d'
               % (el,s2.count(el)))
print 'len of s  == %d\n' % len(s)

Then there is the benchmarking.
Varying the numbers nwoand nwiwe see the influence on the speed:

然后是基准测试。
改变数字nwo，nwi我们看到对速度的影响：

### benchmark of three solutions #################

from time import clock


# Janne Karila
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
    pass
te = clock()
c = OrderedCounter(s)
rjk = (item for item, count in c.iteritems() if count == 1).next()
tf = clock()-te
print 'Janne Karila  %.5f    found: %s' % (tf,rjk)

# eyquem
te = clock()
candidates = set(s)
li = []
for x in s:
    if x in candidates:
        li.append(x)
        candidates.remove(x)
    elif x in li:
        li.remove(x)
rey = li[0]
tf = clock()-te
print 'eyquem        %.5f    found: %s' % (tf,rey)

# TyrantWave
te = clock()
rty = (a for a in s if s.count(a) == 1).next()
tf = clock()-te
print 'TyrantWave    %.5f    found: %s' % (tf,rty)

.

Some results

一些结果

With s1of null length, nwo = 0 and nwi = 50:

对于s1空长度，nwo = 0 和 nwi = 50：

s1 == ''     len(s2) == 50
   -         s2.count(e) == 1
   -         s2.count(k) == 1
   -         s2.count(p) == 1
   -         s2.count(r) == 1
   -         s2.count(w) == 1
len of s  == 50

Janne Karila  0.00077    found: e
eyquem        0.00013    found: e
TyrantWave    0.00005    found: e

TyrantWave's solutions is the faster because the first one-occurring-char is found rapidly in the first positions of the string

TyrantWave 的解决方案更快，因为在字符串的第一个位置快速找到第一个出现的字符

.

With nwo = 300 and nwi = 50
(hereafter 401 chars for s1because occurrences of one-time-occurring chars weren't retained during construct of s1, see function without() )

使用 nwo = 300 和 nwi = 50
（以下为 401 个字符，s1因为在的构造期间未保留一次性出现的字符的出现s1，请参阅函数 without() ）

s1 : 245 chars             s2 : 50 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 1
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 295

Janne Karila  0.00167    found: e
eyquem        0.00030    found: e
TyrantWave    0.00042    found: e

This time TyrantWave's solution is longer than mine because it has to count occurrences of all the characters in the first part of sthat is to say in s1in which there are no one-time-occurring characters (they are in the second part s2)
However, to obtain a more short time with my solution, nwoneeds to be notably greater than nwi

这次 TyrantWave 的解决方案比我的要长，因为它必须计算第一部分中所有字符的出现次数，s也就是说s1其中没有一次性出现的字符（它们在第二部分中s2）
但是，为了使用我的解决方案获得更短的时间，nwo需要明显大于nwi

.

With nwo = 300 and nwi = 5000

nwo = 300 和 nwi = 5000

s1 : 240 chars             s2 : 5000 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 1
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 5240

Janne Karila  0.01510    found: p
eyquem        0.00534    found: p
TyrantWave    0.00294    found: p

If length of s2is raised, then TyrantWave's solution is better again.

如果s2提高了的长度，那么 TyrantWave 的解决方案再次更好。

.

Conclude what you want

总结你想要的

.

EDIT

编辑

Terrific idea of Roman !
I added the solution of Roman in my benchmarking, and it won !

罗马的好主意！
我在我的基准测试中添加了 Roman 的解决方案，它赢了！

I also did some tiny modifications that improve his solution.

我还做了一些微小的修改来改进他的解决方案。

# Roman Fursenko
srf = s[:]
te = clock()
while srf != "":
    slen0 = len(srf)
    ch = srf[0]
    srf = srf.replace(ch, "")
    slen1 = len(srf)
    if slen1 == slen0-1:
        rrf = ch
        break
else:
    rrf = "No answer"
tf = clock()-te
print 'Roman Fursenko %.6f    found: %s' % (tf,rrf)

# Roman Fursenko improved
srf = s[:]
te = clock()
while not(srf is ""):
    slen0 = len(srf)
    srf = srf.replace(srf[0], "")
    if len(srf) == slen0-1:
        rrf = ch
        break
else:
    rrf = "No answer"
tf = clock()-te
print 'Roman improved %.6f    found: %s' % (tf,rrf)

print '\nindex of %s in the string :  %d' % (rty,s.index(rrf))

.

The results are:

结果是：

.

s1 == ''     len(s2) == 50
   -         s2.count(e) == 1
   -         s2.count(k) == 1
   -         s2.count(p) == 1
   -         s2.count(r) == 1
   -         s2.count(w) == 1
len of s  == 50

Janne Karila   0.0032538    found: r
eyquem         0.0001249    found: r
TyrantWave     0.0000534    found: r
Roman Fursenko 0.0000299    found: r
Roman improved 0.0000263    found: r

index of r in the string :  1

s1 == ''     len(s2) == 50
   -         s2.count(e) == 1
   -         s2.count(k) == 0
   -         s2.count(p) == 1
   -         s2.count(r) == 1
   -         s2.count(w) == 1
len of s  == 50

Janne Karila   0.0008183    found: a
eyquem         0.0001285    found: a
TyrantWave     0.0000550    found: a
Roman Fursenko 0.0000433    found: a
Roman improved 0.0000391    found: a

index of a in the string :  4

>

s1 : 240 chars             s2 : 50 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 0
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 290

Janne Karila   0.0016390    found: e
eyquem         0.0002956    found: e
TyrantWave     0.0004112    found: e
Roman Fursenko 0.0001428    found: e
Roman improved 0.0001277    found: e

index of e in the string :  242

s1 : 241 chars             s2 : 5000 chars
s1.count(e) == 0           s2.count(e) == 1
s1.count(k) == 0           s2.count(k) == 1
s1.count(p) == 0           s2.count(p) == 1
s1.count(r) == 0           s2.count(r) == 1
s1.count(w) == 0           s2.count(w) == 1
s1.count(others)>1 True
len of s  == 5241

Janne Karila   0.0148231    found: r
eyquem         0.0053283    found: r
TyrantWave     0.0030166    found: r
Roman Fursenko 0.0007414    found: r
Roman improved 0.0007230    found: r

index of r in the string :  250

I've learned something thanks to the code of Roman:
s.replace()creates a new string and I thought that, because of that, it was a slow method.
But, I don't know for which reason, it is a really fast method.

由于 Roman 的代码，我学到了一些东西：
s.replace()创建一个新字符串，因此我认为这是一种缓慢的方法。
但是，我不知道出于什么原因，这是一种非常快速的方法。

.

EDIT 2

编辑 2

The Oin's solution is worst:

Oin 的解决方案是最糟糕的：

# Oin
from operator import itemgetter
seen = set()
only_appear_once = dict()
te = clock()
for i, x in enumerate(s):
  if x in seen and x in only_appear_once:
    only_appear_once.pop(x)
  else:
    seen.add(x)
    only_appear_once[x] = i
  fco = min(only_appear_once.items(),key=itemgetter(1))[0]
tf = clock()-te
print 'Oin            %.7f    found: %s' % (tf,fco)

results

结果

s1 == ''     len(s2) == 50
Oin            0.0007124    found: e
Janne Karila   0.0008057    found: e
eyquem         0.0001252    found: e
TyrantWave     0.0000712    found: e
Roman Fursenko 0.0000335    found: e
Roman improved 0.0000335    found: e

index of e in the string :  2


s1 : 237 chars             s2 : 50 chars
Oin            0.0029783    found: k
Janne Karila   0.0014714    found: k
eyquem         0.0002889    found: k
TyrantWave     0.0005598    found: k
Roman Fursenko 0.0001458    found: k
Roman improved 0.0001372    found: k

index of k in the string :  246


s1 : 236 chars             s2 : 5000 chars
Oin            0.0801739    found: e
Janne Karila   0.0155715    found: e
eyquem         0.0044623    found: e
TyrantWave     0.0027548    found: e
Roman Fursenko 0.0007255    found: e
Roman improved 0.0007199    found: e

index of e in the string :  244

Python 在字符串中查找第一个非重复字符的最佳方法

提问by Sandy

采纳答案by Aaron Dufour

回答by TyrantWave

回答by Oin

回答by Roman Fursenko

回答by dmg

回答by Janne Karila

回答by eyquem

EDIT

编辑

EDIT 2

编辑 2

相关推荐

最近更新

标签

Python 在字符串中查找第一个非重复字符的最佳方法

提问by Sandy

采纳答案by Aaron Dufour

回答by TyrantWave

回答by Oin

回答by Roman Fursenko

回答by dmg

回答by Janne Karila

回答by eyquem

EDIT

编辑

EDIT 2

编辑 2

相关推荐

Python 在烧瓶中重定向时发出 POST 请求

Python QLayout: 试图将 QLayout "" 添加到已经有布局的 QWidget ""

Python matplotlib 绘图集 x_ticks

Python 洗牌 vs 置换 numpy

相关推荐

最近更新

标签