在 Python 中查找多个子字符串之一的最有效方法是什么？

Question

提问by Roee Adler

I have a list of possible substrings, e.g. ['cat', 'fish', 'dog']. In practice, the list contains hundreds of entries.

我有一个可能的子字符串列表，例如['cat', 'fish', 'dog']. 实际上，该列表包含数百个条目。

I'm processing a string, and what I'm looking for is to find the index of the first appearance of any of these substrings.

我正在处理一个字符串，我正在寻找的是找到任何这些子字符串第一次出现的索引。

To clarify, for '012cat'the result is 3, and for '0123dog789cat'the result is 4.

澄清'012cat'一下，结果是3，'0123dog789cat'结果是4。

I also need to know which substring was found (e.g. its index in the substring list or the text itself), or at least the length of the substring matched.

我还需要知道找到了哪个子字符串（例如它在子字符串列表中的索引或文本本身），或者至少是匹配的子字符串的长度。

There are obvious brute-force ways to achieve this, I wondered if there's any elegant Python/regex solution for this.

有明显的蛮力方法可以实现这一点，我想知道是否有任何优雅的 Python/regex 解决方案。

Answer 1

回答by Tom

I would assume a regex is better than checking for each substring individually because conceptuallythe regular expression is modeled as a DFA, and so as the input is consumed all matches are being tested for at the same time (resulting in one scan of the input string).

我认为正则表达式比单独检查每个子字符串更好，因为从概念上讲，正则表达式被建模为 DFA，因此当输入被消耗时，所有匹配项都被同时测试（导致对输入字符串的一次扫描）。

So, here is an example:

所以，这是一个例子：

import re

def work():
  to_find = re.compile("cat|fish|dog")
  search_str = "blah fish cat dog haha"
  match_obj = to_find.search(search_str)
  the_index = match_obj.start()  # produces 5, the index of fish
  which_word_matched = match_obj.group()  # "fish"
  # Note, if no match, match_obj is None

UPDATE:Some care should be taken when combining words in to a single pattern of alternative words. The following code builds a regex, but escapes any regex special charactersand sorts the words so that longer words get a chance to match before any shorter prefixes of the same word:

更新：将单词组合成单一模式的替代单词时应该小心。以下代码构建了一个正则表达式，但会转义任何正则表达式特殊字符并对单词进行排序，以便较长的单词有机会在同一单词的任何较短前缀之前进行匹配：

def wordlist_to_regex(words):
    escaped = map(re.escape, words)
    combined = '|'.join(sorted(escaped, key=len, reverse=True))
    return re.compile(combined)

>>> r.search('smash atomic particles').span()
(6, 10)
>>> r.search('visit usenet:comp.lang.python today').span()
(13, 29)
>>> r.search('a north\south division').span()
(2, 13)
>>> r.search('012cat').span()
(3, 6)
>>> r.search('0123dog789cat').span()
(4, 7)

END UPDATE

结束更新

It should be noted that you will want to form the regex (ie - call to re.compile()) as little as possible. The best case would be you know ahead of time what your searches are (or you compute them once/infrequently) and then save the result of re.compile somewhere. My example is just a simple nonsense function so you can see the usage of the regex. There are some more regex docs here:

应该注意的是，您将希望尽可能少地形成正则表达式（即 - 调用 re.compile()）。最好的情况是你提前知道你的搜索是什么（或者你计算一次/不频繁）然后将 re.compile 的结果保存在某处。我的例子只是一个简单的废话函数，所以你可以看到正则表达式的用法。这里还有一些正则表达式文档：

http://docs.python.org/library/re.html

Hope this helps.

希望这可以帮助。

UPDATE:I am unsure about how python implements regular expressions, but to answer Rax's question about whether or not there are limitations of re.compile() (for example, how many words you can try to "|" together to match at once), and the amount of time to run compile: neither of these seem to be an issue. I tried out this code, which is good enough to convince me. (I could have made this better by adding timing and reporting results, as well as throwing the list of words into a set to ensure there are no duplicates... but both of these improvements seem like overkill). This code ran basically instantaneously, and convinced me that I am able to search for 2000 words (of size 10), and that and of them will match appropriately. Here is the code:

更新：我不确定python如何实现正则表达式，但要回答Rax关于re.compile()是否存在限制的问题（例如，您可以尝试将多少个单词“|”一起匹配一次），以及运行编译的时间：这些似乎都不是问题。我尝试了这段代码，这足以让我信服。（我本可以通过添加计时和报告结果，以及将单词列表放入一个集合中以确保没有重复来使这更好……但这两种改进似乎都有些矫枉过正）。这段代码基本上是即时运行的，让我确信我能够搜索 2000 个单词（大小为 10），并且它们中的一个会适当匹配。这是代码：

import random
import re
import string
import sys

def main(args):
    words = []
    letters_and_digits = "%s%s" % (string.letters, string.digits)
    for i in range(2000):
        chars = []
        for j in range(10):
            chars.append(random.choice(letters_and_digits))
        words.append(("%s"*10) % tuple(chars))
    search_for = re.compile("|".join(words))
    first, middle, last = words[0], words[len(words) / 2], words[-1]
    search_string = "%s, %s, %s" % (last, middle, first)
    match_obj = search_for.search(search_string)
    if match_obj is None:
        print "Ahhhg"
        return
    index = match_obj.start()
    which = match_obj.group()
    if index != 0:
        print "ahhhg"
        return
    if words[-1] != which:
        print "ahhg"
        return

    print "success!!! Generated 2000 random words, compiled re, and was able to perform matches."

if __name__ == "__main__":
    main(sys.argv)

UPDATE:It should be noted that the order of of things ORed together in the regex matters. Have a look at the following test inspired by TZOTZIOY:

更新：应该注意的是，正则表达式中 ORed 的顺序很重要。看看以下受TZOTZIOY启发的测试：

>>> search_str = "01catdog"
>>> test1 = re.compile("cat|catdog")
>>> match1 = test1.search(search_str)
>>> match1.group()
'cat'
>>> match1.start()
2
>>> test2 = re.compile("catdog|cat")  # reverse order
>>> match2 = test2.search(search_str)
>>> match2.group()
'catdog'
>>> match2.start()
2

This suggests the order matters :-/. I am not sure what this means for Rax's application, but at least the behavior is known.

这表明顺序很重要：-/。我不确定这对 Rax 的应用程序意味着什么，但至少行为是已知的。

UPDATE:I posted this questions about the implementation of regular expressions in Pythonwhich will hopefully give us some insight into the issues found with this question.

更新：我发布了这个关于在 Python 中实现正则表达式的问题，希望能让我们深入了解这个问题中发现的问题。

Answer 2

回答by Unknown

subs = ['cat', 'fish', 'dog']
sentences = ['0123dog789cat']

import re

subs = re.compile("|".join(subs))
def search():
    for sentence in sentences:
        result = subs.search(sentence)
        if result != None:
            return (result.group(), result.span()[0])

# ('dog', 4)

Answer 3

回答by Nick Presta

I just want to point out the time difference between DisplacedAussie's answer and Tom's answer. Both were fast when used once, so you shouldn't have any noticeable wait for either, but when you time them:

我只想指出 DisplacedAussie 的回答和 Tom 的回答之间的时差。两者都在使用一次时很快，所以你不应该有任何明显的等待，但是当你计时时：

import random
import re
import string

words = []
letters_and_digits = "%s%s" % (string.letters, string.digits)
for i in range(2000):
    chars = []
    for j in range(10):
        chars.append(random.choice(letters_and_digits))
    words.append(("%s"*10) % tuple(chars))
search_for = re.compile("|".join(words))
first, middle, last = words[0], words[len(words) / 2], words[-1]
search_string = "%s, %s, %s" % (last, middle, first)

def _search():
    match_obj = search_for.search(search_string)
    # Note, if no match, match_obj is None
    if match_obj is not None:
         return (match_obj.start(), match_obj.group())

def _map():
    search_for = search_for.pattern.split("|")
    found = map(lambda x: (search_string.index(x), x), filter(lambda x: x in search_string, search_for))
    if found:
        return min(found, key=lambda x: x[0])


if __name__ == '__main__':
    from timeit import Timer


    t = Timer("_search(search_for, search_string)", "from __main__ import _search, search_for, search_string")
    print _search(search_for, search_string)
    print t.timeit()

    t = Timer("_map(search_for, search_string)", "from __main__ import _map, search_for, search_string")
    print _map(search_for, search_string)
    print t.timeit()

Outputs:

输出：

(0, '841EzpjttV')
14.3660159111
(0, '841EzpjttV')
# I couldn't wait this long

I would go with Tom's answer, for both readability, and speed.

我会同意汤姆的答案，无论是可读性还是速度。

Answer 4

回答by Wesley

This is a vague, theoretical answer with no code provided, but I hope it can point you in the right direction.

这是一个模糊的理论答案，没有提供代码，但我希望它可以为您指明正确的方向。

First, you will need a more efficient lookup for your substring list. I would recommend some sort of tree structure. Start with a root, then add an 'a'node if any substrings start with 'a', add a 'b'node if any substrings start with 'b', and so on. For each of these nodes, keep adding subnodes.

首先，您需要对子字符串列表进行更有效的查找。我会推荐某种树结构。从一个根开始，'a'如果有任何子串以开头，则添加一个节点，如果任何子串以开头'a'，则添加一个'b'节点'b'，依此类推。对于这些节点中的每一个，继续添加子节点。

For example, if you have a substring with the word "ant", you should have a root node, a child node 'a', a grandchild node 'n', and a great grandchild node 't'.

例如，如果您有一个包含单词“ant”的子字符串，则您应该有一个根节点、一个子节点'a'、一个孙节点'n'和一个曾孙节点't'。

Nodes should be easy enough to make.

节点应该很容易制作。

class Node(object):
    children = []

    def __init__(self, name):
        self.name = name

where nameis a character.

哪里name是一个字符。

Iterate through your strings letter by letter. Keep track of which letter you're on. At each letter, try to use the next few letters to traverse the tree. If you're successful, your letter number will be the position of the substring, and your traversal order will indicate the substring that was found.

一个字母一个字母地遍历你的字符串。跟踪你在写哪封信。在每个字母处，尝试使用接下来的几个字母来遍历树。如果您成功，您的字母编号将是子字符串的位置，您的遍历顺序将指示找到的子字符串。

Clarifying edit: DFAs should be much faster than this method, and so I should endorse Tom's answer. I'm only keeping this answer up in case your substring list changes often, in which case using a tree mightbe faster.

澄清编辑：DFA 应该比这种方法快得多，所以我应该赞同Tom 的回答。我只会保留这个答案，以防您的子字符串列表经常更改，在这种情况下，使用树可能会更快。

Answer 5

回答by Anonymous

First of all, I would suggest you to sort the initial list in ascending order. Because scanning for a shorter substring is faster that scanning for a longer substring.

首先，我建议您按升序对初始列表进行排序。因为扫描较短的子串比扫描较长的子串更快。

Answer 6

回答by DisplacedAussie

How about this one.

这个怎么样。

>>> substrings = ['cat', 'fish', 'dog']
>>> _string = '0123dog789cat'
>>> found = map(lambda x: (_string.index(x), x), filter(lambda x: x in _string, substrings))
[(10, 'cat'), (4, 'dog')]
>>> if found:
>>>     min(found, key=lambda x: x[0])
(4, 'dog')

Obviously, you could return something other than a tuple.

显然，您可以返回元组以外的其他内容。

This works by:

这通过以下方式工作：

Filtering the list of substrings down to those that are in the string
Building a list of tuples containing the index of the substring, and the substring
If a substring has been found, find the minimum value based on the index

将子字符串列表过滤到字符串中的那些
构建包含子字符串索引的元组列表，以及子字符串
如果找到了子串，则根据索引找到最小值

在 Python 中查找多个子字符串之一的最有效方法是什么？

提问by Roee Adler

回答by Tom

回答by Unknown

回答by Nick Presta

回答by Wesley

回答by Anonymous

回答by DisplacedAussie

相关推荐

最近更新

标签

在 Python 中查找多个子字符串之一的最有效方法是什么？

提问by Roee Adler

回答by Tom

回答by Unknown

回答by Nick Presta

回答by Wesley

回答by Anonymous

回答by DisplacedAussie

相关推荐

打开 python 3 urllib 的调试输出

最好的 Python 播客？

如何从 Python 调用 Perl 脚本，将输入传递给它？

python 在python中打印长整数

相关推荐

最近更新

标签