string 词比较算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/473522/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Word comparison algorithm
提问by disc0dancer
I am doing a CSV Import tool for the project I'm working on. The client needs to be able to enter the data in excel, export them as CSV and upload them to the database. For example I have this CSV record:
我正在为我正在处理的项目做一个 CSV 导入工具。客户端需要能够在excel中输入数据,将它们导出为CSV并上传到数据库。例如我有这个 CSV 记录:
1, John Doe, ACME Comapny (the typo is on purpose)
Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting. I plan to do this by comparing the company names in the database with the company names in the CSV. the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:
当然,公司保存在一个单独的表中并与外键链接,所以我需要在插入之前发现正确的公司 ID。我计划通过将数据库中的公司名称与 CSV 中的公司名称进行比较来做到这一点。如果字符串完全相同,则比较应该返回 0,并返回一些随着字符串变得更加不同而变得更大的值,但 strcmp 不会在这里截断它,因为:
"Acme Company" and "Acme Comapny" should have a very small difference index, but "Acme Company" and "Cmea Mpnyaco" should have a very big difference index Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different. Also, "Acme Company" and "Company Acme" should return 0.
“Acme Company”和“Acme Comapny”应该有一个非常小的差异指数,但是“Acme Company”和“Cmea Mpnyaco”应该有一个非常大的差异指数或者“Acme Company”和“Acme Comp”。即使字符数不同,也应该有一个小的差异索引。此外,“Acme Company”和“Company Acme”应返回 0。
So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.
因此,如果客户在输入数据时输入类型,我可以提示他选择他最有可能想要插入的名称。
Is there a known algorithm to do this, or maybe we can invent one :) ?
有没有已知的算法可以做到这一点,或者我们可以发明一个:)?
回答by MattK
You might want to check out the Levenshtein Distancealgorithm as a starting point. It will rate the "distance" between two words.
您可能想查看Levenshtein 距离算法作为起点。它将评价两个词之间的“距离”。
This SO threadon implementing a Google-style "Do you mean...?" system may provide some ideas as well.
这个关于实现谷歌风格的“你的意思是......?” 系统也可以提供一些想法。
回答by Phantom Watson
I don't know what language you're coding in, but if it's PHP, you should consider the following algorithms:
我不知道你用什么语言编码,但如果是 PHP,你应该考虑以下算法:
levenshtein(): Returns the minimal number of characters you have to replace, insert or delete to transform one string into another.
soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
similar_text(): Similar to levenshtein(), but it can return a percent value instead.
levenshtein():返回您必须替换、插入或删除以将一个字符串转换为另一个字符串的最少字符数。
soundex():返回一个单词的四字符 soundex 键,它应该与任何发音相似的单词的键相同。
metaphone():类似于 soundex,可能对您更有效。它比 soundex() 更准确,因为它知道英语发音的基本规则。变音素生成的键是可变长度的。
Similar_text():类似于 levenshtein(),但它可以返回一个百分比值。
回答by Rafa? Dowgird
I have actually implemented a similar system. I used the Levenshtein distance (as other posters already suggested), with some modifications. The problem with unmodified edit distance (applied to whole strings) is that it is sensitive to word reordering, so "Acme Digital Incorporated World Company" will match poorly against "Digital Incorporated World Company Acme" and such reorderings were quite common in my data.
我实际上已经实施了一个类似的系统。我使用了 Levenshtein 距离(正如其他海报已经建议的那样),并进行了一些修改。未修改的编辑距离(应用于整个字符串)的问题在于它对单词重新排序很敏感,因此“Acme Digital Incorporated World Company”与“Digital Incorporated World Company Acme”的匹配度很差,并且这种重新排序在我的数据中很常见。
I modified it so that if the edit distance of whole strings was too big, the algorithm fell back to matching words against each other to find a good word-to-word match (quadratic cost, but there was a cutoff if there were too many words, so it worked OK).
我修改了它,如果整个字符串的编辑距离太大,算法会退回到相互匹配单词以找到一个好的词对词匹配(二次成本,但如果有太多字,所以它工作正常)。
回答by Neil Aitken
I've had some success with the Levenshtein Distancealgorithm, there is also Soundex.
我在Levenshtein Distance算法上取得了一些成功,还有Soundex。
What language are you implementing this in? we may be able to point to specific examples
你用什么语言实现这个?我们也许可以指出具体的例子
回答by plinth
I've taken SoundEx, Levenshtein, PHP similarity, and double metaphone and packaged them up in C# in one set of extension methods on String.
我已经采用了 SoundEx、Levenshtein、PHP 相似性和双元音,并将它们打包在 C# 中的一组字符串扩展方法中。
回答by Loki
There's multiple algorithms to do just that, and most databases even include one by default. It is actually a quite common concern.
有多种算法可以做到这一点,大多数数据库甚至默认包含一种。这实际上是一个非常普遍的问题。
If its just about English words, SQL Server for example includes SOUNDEX which can be used to compare on the resulting sound of the word.
如果它只是关于英语单词,例如 SQL Server 包含 SOUNDEX,可用于比较单词的结果声音。
http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx
http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx
回答by disc0dancer
I'm implementing it in PHP, and I am now writing a piece of code that will break up 2 strings in words and compare each of the words from the first string with the words of the second string using levenshtein and accept the lowes possible values. Ill post it when I'm done.
我正在 PHP 中实现它,我现在正在编写一段代码,它将分解单词中的 2 个字符串,并使用 levenshtein 将第一个字符串中的每个单词与第二个字符串的单词进行比较,并接受最低可能值. 等我做完再发。
Thanks a lot.
非常感谢。
Update: Here's what I've come up with:
更新:这是我想出的:
function myLevenshtein( $str1, $str2 )
{
// prepare the words
$words1 = explode( " ", preg_replace( "/\s+/", " ", trim($str1) ) );
$words2 = explode( " ", preg_replace( "/\s+/", " ", trim($str2) ) );
$found = array(); // array that keeps the best matched words so we don't check them again
$score = 0; // total score
// In my case, strings that have different amount of words can be good matches too
// For example, Acme Company and International Acme Company Ltd. are the same thing
// I will just add the wordcount differencre to the total score, and weigh it more later if needed
$wordDiff = count( $words1 ) - count( $words2 );
foreach( $words1 as $word1 )
{
$minlevWord = "";
$minlev = 1000;
$return = 0;
foreach( $words2 as $word2 )
{
$return = 1;
if( in_array( $word2, $found ) )
continue;
$lev = levenshtein( $word1, $word2 );
if( $lev < $minlev )
{
$minlev = $lev;
$minlevWord = $word2;
}
}
if( !$return )
break;
$score += $minlev;
array_push( $found, $minlevWord );
}
return $score + $wordDiff;
}