string 模糊搜索算法(近似字符串匹配算法)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32337135/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fuzzy search algorithm (approximate string matching algorithm)
提问by Yahya Uddin
I wish to create a fuzzy search algorithm. However, upon hours of research I am really struggling.
我想创建一个模糊搜索算法。然而,经过数小时的研究,我真的很挣扎。
I want to create an algorithm that performs a fuzzy search on a list of names of schools.
我想创建一个算法,对学校名称列表执行模糊搜索。
This is what I have looked at so far:
这是我到目前为止所看到的:
Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:
我的大部分研究一直指向Google 和 Stackoverflow 上的“字符串指标”,例如:
- Levenshtein distance
- Damerau-Levenshtein distance
- Needleman–Wunsch algorithm
- 莱文斯坦距离
- Damerau-Levenshtein距离
- Needleman-Wunsch 算法
However this just gives a score of how similar2 strings are. The only way I can think of implementing it as a search algorithmis to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)
然而,这只是给出了2 个字符串的相似程度的分数。我能想到将它实现为搜索算法的唯一方法是执行线性搜索并为每个字符串执行字符串度量算法,并返回分数高于特定阈值的字符串。(最初我将字符串存储在特里树中,但这显然对我没有帮助!)
Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.
虽然这对于小列表来说并不是一个坏主意,但对于包含 100,000 个名称的列表来说,这将是一个问题,并且用户执行了许多查询。
Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.
我查看的另一种算法是拼写检查器方法,您只需搜索所有可能的拼写错误即可。然而,这也是非常低效的,因为对于长度为 7 且错误计数仅为 2 的单词,它需要超过 75,000 个单词。
What I need?
我需要的?
Can someone please suggest me a good efficient fuzzy search algorithm. with:
有人可以建议我一个好的高效模糊搜索算法。和:
- Name of the algorithm
- How it works or a link to how it works
- Pro's and cons and when it's best used (optional)
- 算法名称
- 它是如何工作的或它是如何工作的链接
- 优点和缺点以及最佳使用时间(可选)
I understand that all algorithms will have their pros and cons and there is no bestalgorithm.
我知道所有算法都有其优缺点,没有最好的算法。
回答by Jim Mischel
Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance. My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.
考虑到您正在尝试对学校名称列表进行模糊搜索,我认为您不想使用像 Levenshtein 距离这样的传统字符串相似度。我的假设是您正在接受用户的输入(键盘输入或通过电话输入),并且您想快速找到匹配的学校。
Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as wordsin a human language.
距离度量告诉您两个字符串基于替换、删除和插入的相似程度。但是这些算法并没有真正告诉你字符串与人类语言中的单词有多相似。
Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:
例如,考虑单词“smith”、“smythe”和“smote”。我可以分两步从“smythe”到“smith”:
smythe -> smithe -> smith
And from "smote" to "smith" in two steps:
而从“smote”到“smith”分两步:
smote -> smite -> smith
So the two have the same distance as strings, but as words, they're significantly different. If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.
所以两者的距离与strings相同,但作为words,它们有很大的不同。如果有人告诉你(口语)他正在寻找“Symthe College”,你几乎肯定会说,“哦,我想你是说史密斯。” 但是如果有人说“Smote College”,你就不会知道他在说什么。
What you need is a phonetic algorithmlike Soundexor Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.
您需要的是像Soundex或Metaphone这样的语音算法。基本上,这些算法将单词分解为音素,并创建该单词在口语中的发音方式。然后,您可以将结果与已知的单词列表进行比较以找到匹配项。
Such a system would be muchfaster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.
这样的系统将是很多比使用距离度量更快。考虑到距离度量,您需要将用户输入与列表中的每个单词进行比较以获得距离。这在计算上是昂贵的,而且结果,正如我用“smith”和“smote”所展示的那样糟糕得可笑。
Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie). That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary. That is a lot faster and produces much better results.
使用语音算法,您可以创建每个已知单词的音素表示,并将其放入字典(哈希映射或可能的字典树)中。这是一次性启动成本。然后,每当用户输入搜索词时,您就创建他输入的音素表示并在您的字典中查找。这要快得多并产生更好的结果。
Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds likethe actual word they were trying to spell. If that's the case, then the phonetic algorithms are definitely the way to go.
也可考虑,当人们拼错的专有名词,他们几乎总是第一个字母的权利,往往不是发音拼写错误听起来像是他们试图拼出实际的单词。如果是这样,那么语音算法肯定是要走的路。
回答by alfasin
You're confusing fuzzy search algorithms with implementation: a fuzzy search of a word may return 400 results of all the words that have Levenshtein distance of, say, 2. But, to the user you have to display only the top 5-10.
您将模糊搜索算法与实现混淆了:一个词的模糊搜索可能会返回 Levenshtein 距离为 2 的所有词的 400 个结果。但是,对于用户,您必须只显示前 5-10 个。
Implementation-wise, you'll pre-process all the words in the dictionary and save the results into a DB. The popular words (and their fuzzy-likes) will be saved into cache-layer - so you won't have to hit the DB for every request.
在实现方面,您将预处理字典中的所有单词并将结果保存到数据库中。流行的词(以及它们的模糊词)将保存到缓存层中 - 这样您就不必为每个请求访问数据库。
You may add an AI layer that will add the most common spelling mistakes and add them to the DB. And etc.
您可以添加一个 AI 层,该层将添加最常见的拼写错误并将其添加到数据库中。等等。
回答by Srekel
I wrote an article about how I implemented a fuzzy search:
我写了一篇关于我如何实现模糊搜索的文章:
https://medium.com/@Srekel/implementing-a-fuzzy-search-algorithm-for-the-debuginator-cacc349e6c55
https://medium.com/@Srekel/implementing-a-fuzzy-search-algorithm-for-the-debuginator-cacc349e6c55
The implementation is in Github and is in the public domain, so feel free to have a look.
实现在 Github 中并且在公共领域,所以请随意查看。
https://github.com/Srekel/the-debuginator/blob/master/the_debuginator.h#L1856
https://github.com/Srekel/the-debuginator/blob/master/the_debuginator.h#L1856
The basics of it is: Split all strings you'll be searching for into parts. So if you have paths, then "C:\documents\lol.txt" is maybe "C", "documents", "lol", "txt".
它的基本原理是:将您要搜索的所有字符串拆分为多个部分。所以如果你有路径,那么“C:\documents\lol.txt”可能是“C”、“documents”、“lol”、“txt”。
Ensure you lowercase these strings to ensure that you it's case insensitive. (Maybe only do it if the search string is all-lowercase).
确保将这些字符串小写以确保不区分大小写。(也许只有在搜索字符串全小写时才这样做)。
Then match your search string against this. In my case I want to match it regardless of order, so "loldoc" would still match the above path even though "lol" comes after "doc".
然后将您的搜索字符串与此匹配。在我的情况下,我想不管顺序匹配它,所以即使“lol”出现在“doc”之后,“loldoc”仍然会匹配上面的路径。
The matching needs to have some scoring to be good. The most important part I think is consecutive matching, so the more characters directly after one another that match, the better. So "doc" is better than "dcm".
匹配需要有一些得分才能很好。我认为最重要的部分是连续匹配,所以一个接一个匹配的字符越多越好。所以“doc”比“dcm”好。
Then you'll likely want to give extra score for a match that's at the startof a part. So you get more points for "doc" than "ocu".
然后,您可能希望为部分开始时的比赛提供额外的分数。因此,“doc”比“ocu”获得更多积分。
In my case I also give more points for matching the endof a part.
在我的情况下,我也会为匹配零件的末端提供更多积分。
And finally, you may want to consider giving extra points for matching the lastpart(s). This makes it so that matching the file name/ending scores higher than the folders leading up to it.
最后,您可能需要考虑为匹配最后一部分提供额外的分数。这使得匹配文件名/结尾的分数高于导致它的文件夹。
回答by asiby
A simple algorithm for "a kind of fuzzy search"
“一种模糊搜索”的简单算法
To be honest, in some cases, fuzzy search is mostly useless and I think that a simpler algorithm can improve the search result while providing the feeling that we are still performing a fuzzy search.
老实说,在某些情况下,模糊搜索大多没有用,我认为更简单的算法可以改善搜索结果,同时提供我们仍在执行模糊搜索的感觉。
Here is my use case: Filtering down a list of countries using "Fuzzy search".
这是我的用例:使用“模糊搜索”过滤国家/地区列表。
The list I was working with had two countries starting with Z: Zambia and Zimbabwe.
我正在处理的名单中有两个以 Z 开头的国家:赞比亚和津巴布韦。
I was using Fusejs.
我正在使用Fusejs。
In this case, when entering the needle "zam", the result set was having 19 matches and the most relevant one for any human (Zambia) at the bottom of the list. And most of the other countries in the result did not even have the letter z in their name.
在这种情况下,当输入针“zam”时,结果集有 19 个匹配项,并且与列表底部的任何人(赞比亚)最相关的匹配项。结果中的大多数其他国家甚至没有名字中的字母 z。
This was for a mobile app where you can pick a country from a list. It was supposed to be much like when you have to pick a contact from the phone's contacts. You can filter the contact list by entering some term in the search box.
这是一个移动应用程序,您可以在其中从列表中选择一个国家/地区。这应该很像当您必须从手机的联系人中选择联系人时。您可以通过在搜索框中输入一些术语来过滤联系人列表。
IMHO, this kind of limited content to search from should not be treated in a way that will have people asking "what the heck?!?".
恕我直言,这种有限的搜索内容不应该以人们问“这到底是什么?!?”的方式对待。
One might suggest to sort by most relevant match. But that's out of the question in this case because the user will then always have to visually find the "Item of Interest" in the reduced list. Keep in mind that this is supposed to be a filtering tool, not a search engine "à la Google". So the result should be sorted in a predictable way. And before filtering, the sorting was alphabetical. So the filtered list should just be an alphabetically sorted subset of the original list.
人们可能会建议按最相关的匹配进行排序。但在这种情况下这是不可能的,因为用户将始终必须在简化列表中直观地找到“感兴趣的项目”。请记住,这应该是一个过滤工具,而不是“à la Google”的搜索引擎。所以结果应该以可预测的方式排序。在过滤之前,排序是按字母顺序排列的。所以过滤后的列表应该只是原始列表的按字母顺序排序的子集。
So I came up with the following algorithm ...
所以我想出了以下算法......
- Grab the needle ... in this case: zam
- Insert the
.*
pattern at the beginning and end of the needle - Insert the
.*
pattern between each letter of the needle - Perform a Regex search in the haystack using the new needle which is now
.*z.*a.*m.*
- 抓住针头......在这种情况下:zam
.*
在针头和针尾插入花样.*
在针的每个字母之间插入图案- 使用现在的新针在大海捞针中执行正则表达式搜索
.*z.*a.*m.*
In this case, the user will have a much expected result by finding everything that has somehow the letters z, a and m appearing in this order. All the letters in the needles will be present in the matches in the same order.
在这种情况下,通过查找字母 z、a 和 m 按此顺序出现的所有内容,用户将获得非常期望的结果。针中的所有字母都将以相同的顺序出现在比赛中。
This will also match country names like Mozambique ... which is perfect.
这也将匹配像 Mo zambique这样的国家名称......这是完美的。
I just think that sometimes, we should not try to kill a fly with a bazooka.
我只是觉得有时候,我们不应该尝试用火箭筒杀死一只苍蝇。
回答by Tristano Ajmone
Can someone please suggest me a good efficient fuzzy search algorithm. with:
有人可以建议我一个好的高效模糊搜索算法。和:
In this repository I've collected a simple (but fast) algorithm to perform fuzzy-search the way it's done in editors like Sublime Text, VSCode, etc. (i.e. with just a few keystrokes you get filtered results of entries that match the typed chars in a fuzzy way):
在这个存储库中,我收集了一个简单(但快速)的算法来执行模糊搜索,就像在 Sublime Text、VSCode 等编辑器中完成的那样(即,只需敲几下键,您就可以获得与输入的条目匹配的过滤结果以模糊的方式显示字符):
The algorithm was written by Forrest Smith, and it's just called "fts_fuzzy_match". In the repository you'll find two variations of the same algorithm implemented in over 10 different languages.
该算法是由 Forrest Smith 编写的,它只是被称为“fts_fuzzy_match”。在存储库中,您会发现以 10 多种不同语言实现的同一算法的两种变体。
The original article can be found here:
原始文章可以在这里找到: