如何使用 PHP 对 MYSQL 中的公司名称进行模糊匹配以进行自动完成?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/369755/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?
提问by AFG
My users will import through cut and paste a large string that will contain company names.
我的用户将通过剪切和粘贴导入包含公司名称的大字符串。
I have an existing and growing MYSQL database of companies names, each with a unique company_id.
我有一个现有的和不断增长的公司名称 MYSQL 数据库,每个数据库都有一个唯一的 company_id。
I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.
我希望能够解析字符串并为每个用户输入的公司名称分配一个模糊匹配。
Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **
现在,仅进行直接的字符串匹配也很慢。** Soundex 索引会更快吗?我怎样才能在用户打字时给他们一些选项?**
For example, someone writes:
例如,有人写道:
Microsoft -> Microsoft Bare Essentials -> Bare Escentuals Polycom, Inc. -> Polycom
I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:
我发现以下线程似乎与此问题相似,但海报尚未批准,我不确定它们的用例是否适用:
How to find best fuzzy match for a string in a large string database
回答by Tomalak
You can start with using SOUNDEX()
, this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).
您可以从使用开始SOUNDEX()
,这可能会满足您的需要(我想象了一个自动建议框,其中包含用户正在输入的内容的现有替代方案)。
The drawbacks of SOUNDEX()
are:
的缺点SOUNDEX()
是:
- its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
- the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
- for MySQL, at least according to the docs, SOUNDEX is broken for unicode input
- 它无法区分更长的字符串。只考虑前几个字符,末尾发散的较长字符串生成相同的 SOUNDEX 值
- 第一个字母必须相同,否则您将无法轻松找到匹配项。SQL Server 有 DIFFERENCE() 函数来告诉你两个 SOUNDEX 值相差多少,但我认为 MySQL 没有内置这种类型。
- 对于 MySQL,至少根据文档,对于 unicode 输入,SOUNDEX 已损坏
Example:
例子:
SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')
/* all of these return 'M262' */
For more advanced needs, I think you need to look at the Levenshtein distance(also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.
对于更高级的需求,我认为您需要查看两个字符串的Levenshtein 距离(也称为“编辑距离”)并使用阈值。这是更复杂(=更慢)的解决方案,但它允许更大的灵活性。
Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.
主要缺点是,您需要两个字符串来计算它们之间的距离。使用 SOUNDEX,您可以在表中存储预先计算的 SOUNDEX,并对其进行比较/排序/分组/过滤。使用 Levenshtein 距离,您可能会发现“Microsoft”和“Nzcrosoft”之间的差异仅为 2,但要获得该结果需要更多时间。
In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).
在任何情况下,都可以在codejanitor.com上找到用于 MySQL 的 Levenshtein 距离函数示例:Levenshtein 距离作为 MySQL 存储函数(2007 年 2 月 10 日)。
回答by Cheese Daneish
SOUNDEX is an OK algorithm for this, but there have been recent advances on this topic. Another algorithm was created called the Metaphone, and it was later revised to a Double Metaphone algorithm. I have personally used the java apache commons implementation of double metaphone and it is customizable and accurate.
SOUNDEX 是一个不错的算法,但最近在这个主题上取得了进展。创建了另一种称为 Metaphone 的算法,后来修改为 Double Metaphone 算法。我个人使用了双元音的 java apache commons 实现,它是可定制且准确的。
They have implementations in lots of other languages on the wikipedia page for it, too. This question has been answered, but should you find any of the identified problems with the SOUNDEX appearing in your application, it's nice to know there are options. Sometimes it can generate the same code for two really different words. Double metaphone was created to help take care of that problem.
他们在维基百科页面上也有许多其他语言的实现。此问题已得到解答,但如果您发现任何已识别的 SOUNDEX 问题出现在您的应用程序中,很高兴知道有一些选项。有时它可以为两个完全不同的词生成相同的代码。双元音是为了帮助解决这个问题而创建的。
Stolen from wikipedia: http://en.wikipedia.org/wiki/Soundex
从维基百科被盗:http: //en.wikipedia.org/wiki/Soundex
As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone algorithm for the same purpose. Philips later developed an improvement to Metaphone, which he called Double-Metaphone. Double-Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.
作为对 Soundex 算法缺陷的回应,Lawrence Philips 出于同样的目的开发了 Metaphone 算法。飞利浦后来对 Metaphone 进行了改进,他将其称为 Double-Metaphone。Double-Metaphone 包含比其前身大得多的编码规则集,处理非拉丁字符的子集,并返回主要和次要编码以说明英语中单个单词的不同发音。
At the bottom of the double metaphone page, they have the implementations of it for all kinds of programming languages: http://en.wikipedia.org/wiki/Double-Metaphone
在双元音页面的底部,他们有各种编程语言的实现:http: //en.wikipedia.org/wiki/Double-Metaphone
Python & MySQL implementation: https://github.com/AtomBoy/double-metaphone
Python & MySQL 实现:https: //github.com/AtomBoy/double-metaphone
回答by Derren
Firstly, I would like to add that you should be very careful when using any form of Phonetic/Fuzzy Matching Algorithm, as this kind of logic is exactly that, Fuzzy or to put it more simply; potentially inaccurate. Especially true when used for matching company names.
首先,我想补充一点,在使用任何形式的语音/模糊匹配算法时都应该非常小心,因为这种逻辑正是这样,Fuzzy或者更简单地说;可能不准确。当用于匹配公司名称时尤其如此。
A good approach is to seek corroboration from other data, such as address information, postal codes, tel numbers, Geo Coordinates etc. This will help confirm the probability of your data being accurately matched.
一个好的方法是从其他数据中寻求佐证,例如地址信息、邮政编码、电话号码、地理坐标等。这将有助于确认您的数据被准确匹配的可能性。
There are a whole range of issues related to B2B Data Matching too many to be addressed here, I have written more about Company Name Matchingin my blog (also an updated article), but in summary the key issues are:
有很多与 B2B 数据匹配相关的问题在这里无法解决,我在我的博客中写了更多关于公司名称匹配的文章(也是一篇更新的文章),但总而言之,关键问题是:
- Looking at the whole string is unhelpful as the most important part of a Company Name is not necessarily at the beginning of the Company Name. i.e. ‘The Proctor and Gamble Company' or ‘United States Federal Reserve ‘
- Abbreviations are common place in Company Names i.e. HP, GM, GE, P&G, D&B etc..
- Some companies deliberately spell their names incorrectly as part of their branding and to differentiate themselves from other companies.
- 查看整个字符串是没有帮助的,因为公司名称最重要的部分不一定位于公司名称的开头。即“宝洁公司”或“美国联邦储备银行”
- 缩写在公司名称中很常见,例如 HP、GM、GE、P&G、D&B 等。
- 一些公司故意将其名称拼错作为其品牌的一部分,以将自己与其他公司区分开来。
Matching exact data is easy, but matching non-exact data can be much more time consuming and I would suggest that you should consider how you will be validating the non-exact matches to ensure these are of acceptable quality.
匹配精确数据很容易,但匹配非精确数据可能更耗时,我建议您应该考虑如何验证非精确匹配以确保这些数据具有可接受的质量。
Before we built Match2Lists.com, we used to spend an unhealthy amount of time validating fuzzy matches. In Match2Lists we incorporated a powerful Visualisation tool enabling us to review non-exact matches, this proved to be a real game changer in terms of match validation, reducing our costs and enabling us to deliver results much more quickly.
在我们构建 Match2Lists.com 之前,我们过去常常花费大量时间来验证模糊匹配。在 Match2Lists 中,我们整合了一个强大的可视化工具,使我们能够非精确匹配,事实证明,这在匹配验证方面是真正的游戏规则改变者,降低了我们的成本并使我们能够更快地交付结果。
Best of Luck!!
祝你好运!!
回答by dkretz
Here's a link to the php discussion of the soundex functionsin mysql and php. I'd start from there, then expand into your other not-so-well-defined requirements.
这是mysql 和 php 中soundex 函数的 php 讨论的链接。我会从那里开始,然后扩展到您的其他不太明确的要求。
Your reference references the Levenshtein methodology for matching. Two problems. 1. It's more appropriate for measuring the difference between two known words, not for searching. 2. It discusses a solution designed more to detect things like proofing errors (using "Levenshtien" for "Levenshtein") rather than spelling errors (where the user doesn't know how to spell, say "Levenshtein" and types in "Levinstein". I usually associate it with looking for a phrase in a book rather than a key value in a database.
您的参考文献引用了 Levenshtein 匹配方法。两个问题。1.更适合测量两个已知词的差异,而不是搜索。2. 它讨论了一种解决方案,旨在更多地检测校对错误(使用“Levenshtien”表示“Levenshtein”)而不是拼写错误(用户不知道如何拼写,说“Levenshtein”并输入“Levinstein”) . 我通常将它与在书中查找短语而不是在数据库中查找键值相关联。
EDIT: In response to comment--
编辑:回应评论——
- Can you at least get the users to put the company names into multiple text boxes; 2. or use an unambigous name delimiter (say backslash); 3. leave out articles ("The") and generic abbreviations (or you can filter for these); 4. Squoosh the spaces out and match for that also (so Micro Soft => microsoft, Bare Essentials => bareessentials); 5. Filter out punctuation; 6. Do "OR" searches on words ("bare" OR "essentials") - people will inevitably leave one or the other out sometimes.
- 你能不能至少让用户把公司名称放到多个文本框中;2. 或使用明确的名称分隔符(例如反斜杠);3. 省略冠词(“The”)和通用缩写(或者您可以过滤这些);4. 挤压空间并与之匹配(所以微软 => microsoft,Bare Essentials => bareessentials);5.过滤掉标点;6. 对单词进行“OR”搜索(“bare”或“essentials”)——人们有时不可避免地会遗漏一个或另一个。
Test like mad and use the feedback loop from users.
疯狂地测试并使用用户的反馈循环。
回答by dkretz
the best function for fuzzy matching is levenshtein. it's traditionally used by spell checkers, so that might be the way to go. there's a UDF for it available here: http://joshdrew.com/
模糊匹配的最佳函数是 levenshtein。它传统上由拼写检查器使用,所以这可能是要走的路。这里有一个 UDF 可用:http: //joshdrew.com/
the downside to using levenshtein is that it won't scale very well. a better idea might be to dump the whole table in to a spell checker custom dictionary file and do the suggestion from your application tier instead of the database tier.
使用 levenshtein 的缺点是它不会很好地扩展。更好的主意可能是将整个表转储到拼写检查器自定义字典文件中,并从应用程序层而不是数据库层执行建议。
回答by Rodney P. Barbati
This answer results in indexed lookup of almost any entity using input of 2 or 3 characters or more.
此答案会导致使用 2 或 3 个字符或更多字符的输入对几乎任何实体进行索引查找。
Basically, create a new table with 2 columns, word and key. Run a process on the original table containing the column to be fuzzy searched. This process will extract every individual word from the original column and write these words to the word table along with the original key. During this process, commonly occurring words like 'the','and', etc should be discarded.
基本上,创建一个包含 2 列、word 和 key 的新表。对包含要模糊搜索的列的原始表运行一个过程。此过程将从原始列中提取每个单独的单词,并将这些单词与原始键一起写入单词表。在此过程中,应丢弃诸如“the”、“and”等常见词。
We then create several indices on the word table, as follows...
然后我们在词表上创建几个索引,如下...
- A normal, lowercase index on word + key
- An index on the 2nd through 5th character + key
An index on the 3rd through 6th character + key
Alternately, create a SOUNDEX() index on the word column.
- word + key 上的普通小写索引
- 第 2 到第 5 个字符的索引 + 键
第 3 到第 6 个字符的索引 + 键
或者,在单词列上创建 SOUNDEX() 索引。
Once this is in place, we take any user input and search using normal word = input or LIKE input%. We never do a LIKE %input as we are always looking for a match on any of the first 3 characters, which are all indexed.
一旦这到位,我们接受任何用户输入并使用普通的 word = input 或 LIKE input% 进行搜索。我们从不做 LIKE %input ,因为我们总是在前 3 个字符中的任何一个上寻找匹配项,这些字符都已编入索引。
If your original table is massive, you could partition the word table by chunks of the alphabet to ensure the user's input is being narrowed down to candidate rows immediately.
如果您的原始表很大,您可以按字母表的块对词表进行分区,以确保立即将用户的输入范围缩小到候选行。