string 近似字符串匹配算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49263/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Approximate string matching algorithms
提问by Paulius
Here at work, we often need to find a string from the list of strings that is the closest match to some other input string. Currently, we are using Needleman-Wunsch algorithm. The algorithm often returns a lot of false-positives (if we set the minimum-score too low), sometimes it doesn't find a match when it should (when the minimum-score is too high) and, most of the times, we need to check the results by hand. We thought we should try other alternatives.
在工作中,我们经常需要从字符串列表中找到与其他输入字符串最匹配的字符串。目前,我们正在使用 Needleman-Wunsch 算法。该算法通常会返回很多误报(如果我们将最低分数设置得太低),有时它应该找不到匹配项(当最低分数太高时),并且大多数时候,我们需要手动检查结果。我们认为我们应该尝试其他选择。
Do you have any experiences with the algorithms? Do you know how the algorithms compare to one another?
你对算法有什么经验吗?您知道算法之间的比较吗?
I'd really appreciate some advice.
我真的很感激一些建议。
PS: We're coding in C#, but you shouldn't care about it - I'm asking about the algorithms in general.
PS:我们用 C# 编码,但你不应该关心它 - 我在问一般的算法。
Oh, I'm sorry I forgot to mention that.
哦,对不起,我忘了提到这一点。
No, we're not using it to match duplicate data. We have a list of strings that we are looking for - we call it search-list. And then we need to process texts from various sources (like RSS feeds, web-sites, forums, etc.) - we extract parts of those texts (there are entire sets of rules for that, but that's irrelevant) and we need to match those against the search-list. If the string matches one of the strings in search-list - we need to do some further processing of the thing (which is also irrelevant).
不,我们没有使用它来匹配重复数据。我们有一个我们正在寻找的字符串列表——我们称之为搜索列表。然后我们需要处理来自各种来源(如 RSS 提要、网站、论坛等)的文本 - 我们提取这些文本的一部分(有整套规则,但这无关紧要)并且我们需要匹配那些反对搜索列表的人。如果字符串与搜索列表中的字符串之一匹配 - 我们需要对事物进行一些进一步的处理(这也是无关紧要的)。
We can not perform the normal comparison, because the strings extracted from the outside sources, most of the times, include some extra words etc.
我们无法进行正常的比较,因为从外部来源提取的字符串,大多数情况下,包括一些额外的单词等。
Anyway, it's not for duplicate detection.
无论如何,它不是用于重复检测。
采纳答案by Thomas Kammeyer
OK, Needleman-Wunsch(NW) is a classic end-to-end ("global") aligner from the bioinformatics literature. It was long ago available as "align" and "align0" in the FASTA package. The difference was that the "0" version wasn't as biased about avoiding end-gapping, which often allowed favoring high-quality internal matches easier. Smith-Waterman, I suspect you're aware, is a local aligner and is the original basis of BLAST. FASTA had it's own local aligner as well that was slightly different. All of these are essentially heuristic methods for estimating Levenshtein distance relevant to a scoring metric for individual character pairs (in bioinformatics, often given by Dayhoff/"PAM", Henikoff&Henikoff, or other matrices and usually replaced with something simpler and more reasonably reflective of replacements in linguistic word morphology when applied to natural language).
好的,Needleman-Wunsch(NW) 是来自生物信息学文献的经典端到端(“全局”)对准器。很久以前,它在 FASTA 包中以“align”和“align0”的形式出现。不同之处在于“0”版本并没有那么偏向于避免结束间隙,这通常允许更容易地偏爱高质量的内部匹配。我怀疑你知道,Smith-Waterman 是一个局部矫正器,是 BLAST 的原始基础。FASTA 也有自己的局部矫正器,略有不同。所有这些本质上都是用于估计与单个字符对的评分指标相关的 Levenshtein 距离的启发式方法(在生物信息学中,通常由 Dayhoff/“PAM”、Henikoff&Henikoff、
Let's not be precious about labels: Levenshtein distance, as referenced in practice at least, is basically edit distance and you have to estimate it because it's not feasible to compute it generally, and it's expensive to compute exactly even in interesting special cases: the water gets deep quick there, and thus we have heuristic methods of long and good repute.
让我们不要珍惜标签:至少在实践中引用的 Levenshtein 距离基本上是编辑距离,您必须估计它,因为一般计算它是不可行的,即使在有趣的特殊情况下精确计算也很昂贵:水在那里快速深入,因此我们有长期和良好声誉的启发式方法。
Now as to your own problem: several years ago, I had to check the accuracy of short DNA reads against reference sequence known to be correct and I came up with something I called "anchored alignments".
现在关于你自己的问题:几年前,我不得不根据已知正确的参考序列检查短 DNA 读数的准确性,我想出了一种我称之为“锚定比对”的东西。
The idea is to take your reference string set and "digest" it by finding all locations where a given N-character substring occurs. Choose N so that the table you build is not too big but also so that substrings of length N are not too common. For small alphabets like DNA bases, it's possible to come up with a perfect hash on strings of N characters and make a table and chain the matches in a linked list from each bin. The list entries must identify the sequence and start position of the substring that maps to the bin in whose list they occur. These are "anchors" in the list of strings to be searched at which an NW alignment is likely to be useful.
这个想法是通过查找给定的 N 字符子字符串出现的所有位置来获取您的参考字符串集并“消化”它。选择 N 以便您构建的表不会太大,而且长度为 N 的子串不会太常见。对于像 DNA 碱基这样的小字母表,可以对 N 个字符的字符串进行完美的散列,并制作一个表格并将匹配项链接到每个 bin 的链表中。列表条目必须标识映射到它们出现在列表中的 bin 的子字符串的序列和开始位置。这些是要搜索的字符串列表中的“锚点”,其中 NW 对齐可能有用。
When processing a query string, you take the N characters starting at some offset K in the query string, hash them, look up their bin, and if the list for that bin is nonempty then you go through all the list records and perform alignments between the query string and the search string referenced in the record. When doing these alignments, you line up the query string and the search string atthe anchor and extract a substring of the search string that is the same length as the query string and which contains that anchor at the same offset, K.
在处理查询字符串时,您从查询字符串中的某个偏移量 K 处取 N 个字符,对它们进行散列,查找它们的 bin,如果该 bin 的列表非空,那么您将遍历所有列表记录并在它们之间执行对齐记录中引用的查询字符串和搜索字符串。在进行这些对齐时,您将查询字符串和搜索字符串在锚点处对齐,并提取搜索字符串的子字符串,该子字符串与查询字符串的长度相同,并且在相同的偏移量 K 处包含该锚点。
If you choose a long enough anchor length N, and a reasonable set of values of offset K (they can be spread across the query string or be restricted to low offsets) you should get a subset of possible alignments and often will get clearer winners. Typically you will want to use the less end-biased align0-like NW aligner.
如果您选择足够长的锚点长度 N 和一组合理的偏移 K 值(它们可以分布在查询字符串中或限制为低偏移),您应该获得可能对齐的子集,并且通常会获得更清晰的获胜者。通常,您会希望使用末端偏向较小的类似 align0 的 NW 对准器。
This method tries to boost NW a bit by restricting it's input and this has a performance gain because you do less alignments and they are more often between similar sequences. Another good thing to do with your NW aligner is to allow it to give up after some amount or length of gapping occurs to cut costs, especially if you know you're not going to see or be interested in middling-quality matches.
这种方法试图通过限制它的输入来稍微提高 NW 并且这有一个性能增益,因为你做的比对更少,而且它们更经常出现在相似的序列之间。使用 NW 矫正器的另一个好处是让它在出现一定数量或长度的间隙后放弃以降低成本,特别是如果您知道自己不会看到中等质量的比赛或对中等质量的比赛不感兴趣。
Finally, this method was used on a system with small alphabets, with K restricted to the first 100 or so positions in the query string and with search strings much larger than the queries (the DNA reads were around 1000 bases and the search strings were on the order of 10000, so I was looking for approximate substring matches justified by an estimate of edit distance specifically). Adapting this methodology to natural language will require some careful thought: you lose on alphabet size but you gain if your query strings and search strings are of similar length.
最后,这种方法被用在一个小字母系统上,K 限制在查询字符串中的前 100 个左右的位置,并且搜索字符串比查询大得多(DNA 读数约为 1000 个碱基,搜索字符串位于10000 的顺序,所以我正在寻找近似的子串匹配,特别是对编辑距离的估计)。将此方法应用于自然语言需要仔细考虑:您会损失字母大小,但如果您的查询字符串和搜索字符串的长度相似,则您会受益。
Either way, allowing more than one anchor from different ends of the query string to be used simultaneously might be helpful in further filtering data fed to NW. If you do this, be prepared to possibly send overlapping strings each containing one of the two anchors to the aligner and then reconcile the alignments... or possibly further modify NW to emphasize keeping your anchors mostly intact during an alignment using penalty modification during the algorithm's execution.
无论哪种方式,允许同时使用来自查询字符串不同端的多个锚点可能有助于进一步过滤馈送到 NW 的数据。如果你这样做,准备好可能将重叠的字符串发送到对齐器,每个包含两个锚点之一,然后协调对齐......或者可能进一步修改 NW 以强调在对齐过程中使用惩罚修改在对齐过程中保持你的锚点大部分完整算法的执行。
Hope this is helpful or at least interesting.
希望这是有帮助的或至少是有趣的。
回答by Grey Panther
Related to the Levenstein distance: you might wish to normalize it by dividing the result with the length of the longer string, so that you always get a number between 0 and 1 and so that you can compare the distance of pair of strings in a meaningful way (the expression L(A, B) > L(A, C) - for example - is meaningless unless you normalize the distance).
与 Levenstein 距离相关:您可能希望通过将结果除以较长字符串的长度来对其进行归一化,以便您始终得到一个介于 0 和 1 之间的数字,以便您可以有意义地比较一对字符串的距离方式(表达式 L(A, B) > L(A, C) - 例如 - 除非对距离进行归一化,否则毫无意义)。
回答by Biri
We are using the Levenshtein distancemethod to check for duplicate customers in our database. It works quite well.
我们正在使用Levenshtein 距离方法来检查我们数据库中的重复客户。它运作良好。
回答by Yuval F
Alternative algorithms to look at are agrep(Wikipedia entry on agrep), FASTA and BLASTbiological sequence matching algorithms. These are special cases of approximate string matching, also in the Stony Brook algorithm repositry. If you can specify the ways the strings differ from each other, you could probably focus on a tailored algorithm. For example, aspell uses some variant of "soundslike" (soundex-metaphone) distance in combination with a "keyboard" distance to accomodate bad spellers and bad typers alike.
要查看的替代算法是agrep(agrep 上的维基百科条目)、 FASTA 和 BLAST生物序列匹配算法。这些是近似字符串匹配的特殊情况,也在Stony Brook 算法存储库中。如果您可以指定字符串彼此不同的方式,您可能可以专注于定制算法。例如,aspell 使用“soundslike”(soundex-metaphone)距离的一些变体与“键盘”距离相结合,以适应拼写错误和打字错误的人。
回答by alex
回答by R. Shilling
In order to minimize mismatches due to slight variations or errors in spelling, I've used the Metaphone algorithm, then Levenshtein distance (scaled to 0-100 as a percentage match) on the Metaphone encodings for a measure of closeness. That seems to have worked fairly well.
为了最大限度地减少由于拼写中的细微变化或错误而导致的不匹配,我使用了 Metaphone 算法,然后在 Metaphone 编码上使用了 Levenshtein 距离(缩放到 0-100 作为百分比匹配)来衡量接近度。这似乎运作得相当好。
回答by mortonjt
To expand on Cd-MaN's answer, it sounds like you're facing a normalization problem. It isn't obvious how to handle scores between alignments with varying lengths.
为了扩展 Cd-MaN 的答案,听起来您正面临规范化问题。如何处理不同长度的对齐之间的分数并不明显。
Given what you are interested in, you may want to obtain p-values for your alignment. If you are using Needleman-Wunsch, you can obtain these p-values using Karlin-Altschul statistics http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
考虑到您感兴趣的内容,您可能希望获得比对的 p 值。如果您使用的是 Needleman-Wunsch,您可以使用 Karlin-Altschul 统计获得这些 p 值http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
BLAST will can local alignment and evaluate them using these statistics. If you are concerned about speed, this would be a good tool to use.
BLAST 将可以使用这些统计数据进行局部对齐和评估。如果您关心速度,这将是一个很好的工具。
Another option is to use HMMER. HMMER uses Profile Hidden Markov Models to align sequences. Personally, I think this is a more powerful approach since it also provides positional information. http://hmmer.janelia.org/
另一种选择是使用 HMMER。HMMER 使用 Profile Hidden Markov 模型来对齐序列。就个人而言,我认为这是一种更强大的方法,因为它还提供了位置信息。http://hmmer.janelia.org/