string 查找多个字符串匹配的算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3260962/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Algorithm to find multiple string matches
提问by Dwight Kelly
I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.
我正在寻找有关在大量文本中查找所有匹配项的有效算法的建议。要搜索的术语将包含在一个列表中,并且可以有 1000 多种可能性。搜索词可以是 1 个或多个词。
Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.
显然,我可以多次通过文本与每个搜索词进行比较。效率不高。
I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.
我已经考虑过对搜索词进行排序并组合常见的子细分市场。这样我就可以快速消除大量的术语。语言是 C++,我可以使用 boost。
An example of search terms could be a list of Fortune 500 company names.
搜索词的示例可以是财富 500 强公司名称列表。
Ideas?
想法?
回答by Dr. belisarius
Don′t reinvent the wheel
不要重新发明轮子
This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.
这个问题已被深入研究。奇怪的是,搜索 ONE 模式/字符串的最佳算法并不能轻易外推到多字符串匹配。
The "grep"family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.
该“grep的”家庭实现了非常高效的方式多字符串搜索。如果您可以将它们用作外部程序,那就去做吧。
In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Hereare the source and executable files.
如果您真的需要实现该算法,我认为最快的方法是重现 agrep 的功能(agrep 在多字符串匹配方面表现出色!)。 这是源文件和可执行文件。
And hereyou will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.
而在这里,你会发现描述关于字符串匹配使用的算法,理论背景,以及大量的信息和指针的论文。
A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.
注意:多字符串匹配已经被 Knuth、Boyer、Moore、Baeza-Yates 等人进行了大量研究。如果你需要一个非常快速的算法,不要犹豫站在他们宽阔的肩膀上。不要重新发明轮子。
回答by Pedro Gimeno
As in the case of single patterns, there are several algorithms for multiple-pattern matching, and you will have to find the one that fits your purpose best. The paper A fast algorithm for multi-pattern searching (archived copy)does a review of most of them, including Aho-Corasick (which is kind of the multi-pattern version of the Knuth-Morris-Pratt algorithm, with linear complexity) and Commentz-Walter (a combination of Boyer-Moore and Aho-Corasick), and introduces a new one, which uses ideas from Boyer-Moore for the task of matching multiple patterns.
与单一模式的情况一样,多模式匹配有多种算法,您必须找到最适合您目的的算法。论文A fast algorithm for multi-pattern search (archived copy)对其中的大部分进行了回顾,包括 Aho-Corasick(这是 Knuth-Morris-Pratt 算法的一种多模式版本,具有线性复杂性)和Commentz-Walter(Boyer-Moore 和 Aho-Corasick 的组合),并引入了一个新的,它使用 Boyer-Moore 的思想来匹配多个模式的任务。
An alternative, hash-based algorithm not mentioned in that paper, is the Rabin-Karp algorithm, which has a worst-case complexity bigger than other algorithms, but compensates it by reducing the linear factor via hashing. Which one is better depends ultimately on your use case. You may need to implement several of them and compare them in your application if you want to choose the fastest one.
该论文中未提及的另一种基于散列的算法是Rabin-Karp 算法,它的最坏情况复杂度比其他算法大,但通过散列减少线性因子来补偿它。哪个更好最终取决于您的用例。如果您想选择最快的一个,您可能需要实现其中的几个并在您的应用程序中比较它们。
回答by Pedro Gimeno
Assuming that the large body of text is static english text and you need to match whole words you can try the following (you should really clarify what exactly is a 'match', what kind of text you are looking at etc in your question).
假设大量文本是静态英文文本并且您需要匹配整个单词,您可以尝试以下操作(您应该真正澄清什么是“匹配”,您正在查看的文本类型等)。
First preprocess the whole document into a Trieor a DAWG.
Trie/Dawg has the following property:
Trie/Dawg 具有以下属性:
Given a trie/dawg and a search term of length K, you can in O(K) time lookup the data associated with the word (or tell if there is no match).
给定一个 trie/dawg 和一个长度为 K 的搜索词,您可以在 O(K) 时间内查找与该词相关的数据(或者判断是否没有匹配项)。
Using a DAWG could save you more space as compared to a trie. Tries exploit the fact that many words will have a common prefix and DAWGs exploit the common prefix as well as the common suffix property.
与 trie 相比,使用 DAWG 可以为您节省更多空间。尝试利用这样一个事实,即许多单词将具有共同的前缀,而 DAWG 则利用共同的前缀和共同的后缀属性。
In the trie, also maintain exactly the list of positions of the word. For example if the text is
在trie 中,也准确维护单词的位置列表。例如,如果文本是
That is that and so it is.
The node for the last t in that
will have the list {1,3} and the node for s in is
will have the list {2,7} associated.
最后 t in 的节点that
将具有列表 {1,3},s in 的节点is
将具有关联的列表 {2,7}。
Now when you get a single word search term, you can walk the trie and get the list of matches for that word easily.
现在,当您获得单个词搜索词时,您可以遍历搜索词并轻松获取该词的匹配列表。
If you get a multiple word search term, you can do the following.
如果您得到一个多词搜索词,您可以执行以下操作。
Walk the trie with the first word in the search term. Get the list of matches and insert into a hashTable H1.
使用搜索词中的第一个单词遍历尝试。获取匹配列表并插入到哈希表 H1 中。
Now walk the trie with the second word in the search term. Get the list of matches. For each match position x, check if x-1 exists in the HashTable H1. If so, add x to new hashtable H2.
现在用搜索词中的第二个词遍历搜索词条。获取匹配列表。对于每个匹配位置 x,检查 HashTable H1 中是否存在 x-1。如果是,则将 x 添加到新的哈希表 H2。
Walk the trie with the third word, get list of matches. For each match position y, check if y-1 exists in H3, if so add to new hashtable H3.
用第三个单词遍历尝试,获取匹配列表。对于每个匹配位置 y,检查 H3 中是否存在 y-1,如果存在,则添加到新的哈希表 H3。
Continue so forth.
继续如此。
At the end you get a list of matches for the search phrase, which give the positions of the last word of the phrase.
最后,您将获得搜索短语的匹配列表,其中给出了该短语最后一个单词的位置。
You could potentially optimize the phrase matching step by maintaining a sorted list of positions in the list and doing a binary search: i.e for eg. for each key k in H2, you binary search for k+1 in the sorted list for search term 3 and add k+1 to H3 if you find it etc.
您可以通过维护列表中位置的排序列表并进行二分搜索来优化短语匹配步骤:即例如。对于 H2 中的每个键 k,您在搜索项 3 的排序列表中二分搜索 k+1,如果找到,则将 k+1 添加到 H3 等。
回答by polygenelubricants
An optimal solution for this problem is to use a suffix tree(or a suffix array). It's essentially a trie of all suffixes of a string. For a text of length O(N)
, this can be built in O(N)
.
此问题的最佳解决方案是使用后缀树(或后缀数组)。它本质上是一个字符串的所有后缀的 trie。对于长度的文本O(N)
,这可以内置O(N)
。
Then all k
occurrences of a string of length m
can be answered optimally in O(m + k)
.
然后所有k
出现的长度字符串都m
可以在 中得到最佳回答O(m + k)
。
Suffix trees can also be used to efficiently find e.g. the longest palindrome, the longest common substring, the longest repeated substring, etc.
后缀树也可以用来有效地找到例如最长的回文、最长的公共子串、最长的重复子串等。
This is the typical data structure to use when analyzing DNA strings which can be millions/billions of bases long.
这是分析可能有数百万/数十亿碱基长度的 DNA 字符串时使用的典型数据结构。
See also
也可以看看
- Wikipedia/Suffix tree
- Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology(Dan Gusfield).
- 维基百科/后缀树
- 字符串、树和序列的算法:计算机科学和计算生物学(Dan Gusfield)。
回答by eruciform
So you have lots of search terms and want to see if any of them are in the document?
所以您有很多搜索词并想查看文档中是否有任何搜索词?
Purely algorithmically, you could sort all your possibilities in alphabetical order, join them with pipes, and use them as a regular expression, if the regex engine will look at /ant|ape/
and properly short-circuit the a in "ape" if it didn't find it in "ant". If not, you could do a "precompile" of a regex and "squish" the results down to their minimum overlap. I.e. in the above case /a(nt|pe)/
and so on, recursively for each letter.
纯粹从算法上讲,您可以按字母顺序对所有可能性进行排序,用管道将它们连接起来,并将它们用作正则表达式,如果正则表达式引擎会查看/ant|ape/
并正确短路“ape”中的 a 如果没有找到它在“蚂蚁”中。如果没有,您可以对正则表达式进行“预编译”并将结果“压缩”到它们的最小重叠。即在上述情况下/a(nt|pe)/
等等,对每个字母递归。
However, doing the above is pretty much like putting all your search strings in a 26-ary tree (26 characters, more if also numbers). Push your strings onto the tree, using one level of depth per character of length.
但是,执行上述操作非常类似于将所有搜索字符串放入一个 26 叉树(26 个字符,如果也是数字,则更多)。将您的字符串推到树上,使用每个长度字符的深度级别。
You can do this with your search terms to make a hyper-fast "does this word match anything in my list of search terms" if your search terms number large.
如果您的搜索词数量很大,您可以使用您的搜索词来执行此操作,以实现“该词是否与我的搜索词列表中的任何内容相匹配”的超快速度。
You could theoretically do the reverse as well -- pack your document into the tree and then use the search terms on it -- if your document is static and the search terms change a lot.
理论上,您也可以反过来做——将您的文档打包到树中,然后在其上使用搜索词——如果您的文档是静态的并且搜索词变化很大。
Depends on how much optimization you need...
取决于你需要多少优化...
回答by gillyb
Are the search terms words that you are looking for or can it be full sentances too ?
您正在寻找的搜索词是词还是完整的句子?
If it's only words, then i would suggest building a Red-Black Treefrom all the words, and then searching for each word in the tree.
如果只是单词,那么我建议从所有单词构建一个红黑树,然后搜索树中的每个单词。
If it could be sentances, then it could get a lot more complex... (?)
如果它可以是句子,那么它可能会变得更加复杂......(?)