Python Fastest way to check if a string contains specific characters in any of the items in a list

Question

提问by Muhammad Taha

What is the fastest way to check if a string contains some characters from any items of a list?

Currently, I'm using this method:

lestring = "Text123"

lelist = ["Text", "foo", "bar"]

for x in lelist:
    if lestring.count(x):
        print 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)

Is there any way to do it without iteration (which will make it faster I suppose.)?

Answer 1

采纳答案by Abhijit

You can try list comprehension with membership check

>>> lestring = "Text123"
>>> lelist = ["Text", "foo", "bar"]
>>> [e for e in lelist if e in lestring]
['Text']

Compared to your implementation, though LC has an implicit loop but its faster as there is no explicit function call as in your case with count

Compared to Joe's implementation, yours is way faster, as the filter function would require to call two functions in a loop, lambdaand count

>>> def joe(lelist, lestring):
    return ''.join(random.sample(x + 'b'*len(x), len(x)))

>>> def uz(lelist, lestring):
    for x in lelist:
        if lestring.count(x):
            return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)


>>> def ab(lelist, lestring):
    return [e for e in lelist if e in lestring]

>>> t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
>>> t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
>>> t_joe = timeit.Timer("joe(lelist, lestring)", setup="from __main__ import lelist, lestring, joe")
>>> t_ab.timeit(100000)
0.09391469893125759
>>> t_uz.timeit(100000)
0.1528471407273173
>>> t_joe.timeit(100000)
1.4272649857800843

Jamie's commented solution is slower for shorter string's. Here is the test result

>>> def jamie(lelist, lestring):
    return next(itertools.chain((e for e in lelist if e in lestring), (None,))) is not None

>>> t_jamie = timeit.Timer("jamie(lelist, lestring)", setup="from __main__ import lelist, lestring, jamie")
>>> t_jamie.timeit(100000)
0.22237164127909637

If you need Boolean values, for shorter strings, just modify the above LC expression

[e in lestring for e in lelist if e in lestring]

Or for longer strings, you can do the following

>>> next(e in lestring for e in lelist if e in lestring)
True

or

>>> any(e in lestring for e in lelist)

Answer 2

回答by Joe

filter(lambda x: lestring.count(x), lelist)

That will return all the strings that you're trying to find as a list.

Answer 3

回答by theodox

if the test is to see if there are any characters in common (not words or segments), create a set out of the letters in the list and then check the letters agains the string:

char_list = set(''.join(list_of_words))
test_set = set(string_to_teat)
common_chars = char_list.intersection(test_set)

However I'm assuming you're looking for as little as one character in common...

Answer 4

回答by moomima

The esmrelibrary does the trick. In your case, the simpler, esm(part of esmre) is what you want.

https://pypi.python.org/pypi/esmre/

https://code.google.com/p/esmre/

They have good documentation and examples: Taken from their examples:

>>> import esm
>>> index = esm.Index()
>>> index.enter("he")
>>> index.enter("she")
>>> index.enter("his")
>>> index.enter("hers")
>>> index.fix()
>>> index.query("this here is history")
[((1, 4), 'his'), ((5, 7), 'he'), ((13, 16), 'his')]
>>> index.query("Those are his sheep!")
[((10, 13), 'his'), ((14, 17), 'she'), ((15, 17), 'he')]
>>>

I ran some performance tests:

import random, timeit, string, esm

def uz(lelist, lestring):
    for x in lelist:
        if lestring.count(x):
            return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)



def ab(lelist, lestring):
    return [e for e in lelist if e in lestring]


def use_esm(index, lestring):
    return index.query(lestring)

for TEXT_LEN in [5, 50, 1000]:
    for SEARCH_LEN in [5, 20]:
        for N in [5, 50, 1000, 10000]:
            if TEXT_LEN < SEARCH_LEN:
                continue

            print 'TEXT_LEN:', TEXT_LEN, 'SEARCH_LEN:', SEARCH_LEN, 'N:', N

            lestring = ''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(TEXT_LEN)))
            lelist = [''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(SEARCH_LEN))) for _
                      in range(N)]

            index = esm.Index()
            for i in lelist:
                index.enter(i)
            index.fix()

            t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
            t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
            t_esm = timeit.Timer("use_esm(index, lestring)", setup="from __main__ import index, lestring, use_esm")

            ab_time = t_ab.timeit(1000)
            uz_time = t_uz.timeit(1000)
            esm_time = t_esm.timeit(1000)

            min_time = min(ab_time, uz_time, esm_time)
            print '  ab%s: %f' % ('*' if ab_time == min_time else '', ab_time)
            print '  uz%s: %f' % ('*' if uz_time == min_time else '', uz_time)
            print '  esm%s %f:' % ('*' if esm_time == min_time else '', esm_time)

And got that results depends mostly on the number of items that one is looking for (in my case, 'N'):

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 5
  ab*: 0.001733
  uz: 0.002512
  esm 0.126853:

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 50
  ab*: 0.017564
  uz: 0.023701
  esm 0.079925:

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 1000
  ab: 0.370371
  uz: 0.489523
  esm* 0.133783:

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 10000
  ab: 3.678790
  uz: 4.883575
  esm* 0.259605:

Python Fastest way to check if a string contains specific characters in any of the items in a list

提问by Muhammad Taha

采纳答案by Abhijit

回答by Joe

回答by theodox

回答by moomima

相关推荐

最近更新

标签

Python Fastest way to check if a string contains specific characters in any of the items in a list

提问by Muhammad Taha

采纳答案by Abhijit

回答by Joe

回答by theodox

回答by moomima

相关推荐

Python 从 csv 文件中删除换行符

将一串逗号分隔的整数解析为整数列表的“pythonic”方法？

Python 熊猫系列的重塑？

在 Python 中解压列表？

相关推荐

最近更新

标签