C# 如何衡量两个字符串之间的相似度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1034622/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I measure the similarity between 2 strings?
提问by Zanoni
Given two strings text1
and text2
给定两个字符串text1
和text2
public SOMEUSABLERETURNTYPE Compare(string text1, string text2)
{
// DO SOMETHING HERE TO COMPARE
}
Examples:
例子:
First String: StackOverflow
Second String: StaqOverflow
Return: Similarity is 91%
The return can be in % or something like that.
First String: The simple text test
Second String: The complex text test
Return: The values can be considered equal
第一个字符串:StackOverflow
第二个字符串:StaqOverflow
回报:相似度为 91%
回报可以是 % 或类似的东西。
第一个字符串:简单的文本测试
第二个字符串:复杂的文本测试
返回: 值可以被认为是相等的
Any ideas? What is the best way to do this?
有任何想法吗?做这个的最好方式是什么?
采纳答案by Jon Skeet
There are various different ways of doing this. Have a look at the Wikipedia "String similarity measures" pagefor links to other pages with algorithms.
有多种不同的方法可以做到这一点。查看维基百科的“字符串相似性度量”页面,获取到其他带有算法的页面的链接。
I don't thinkany of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.
然而,我认为这些算法中的任何一个都没有考虑到声音——所以“staq 溢出”将与“堆栈溢出”和“staw 溢出”相似,尽管第一个在发音方面更相似。
I've just found another pagewhich gives rather more options... in particular, the Soundexalgorithm (Wikipedia) may be closer to what you're after.
回答by LiraNuna
Levenshtein distanceis probably what you're looking for.
Levenshtein distance可能就是你要找的。
回答by schnaader
You might look for string "distances", for example the Levenshtein distance.
您可能会查找字符串“距离”,例如Levenshtein distance。
回答by John Sheehan
Jeff Atwood wrote about looking for a similar solutionfor determining the authorship of wiki posts which may help you narrow your search.
Jeff Atwood 撰写了关于寻找类似的解决方案来确定 wiki 帖子的作者身份的文章,这可能会帮助您缩小搜索范围。
回答by bdk
To deal with 'sound alikes' you may want to look into encoding using a phonetic algorithm like Double Metaphone or soundex. I don't know if computing Levenshtein distances on phonetic encoded strings would be beneficial or not, but might be a possibility. Alternately, you could use a heuristic like: convert each word in the string to its encoded form and remove any words that occur in both strings and replace them with a single representation before computing the Levenshtein distance.
要处理“声音相似”,您可能需要研究使用语音算法(如 Double Metaphone 或 soundex)进行编码。我不知道在语音编码字符串上计算 Levenshtein 距离是否有益,但可能是一种可能性。或者,您可以使用启发式方法,例如:将字符串中的每个单词转换为其编码形式,并在计算 Levenshtein 距离之前删除出现在两个字符串中的任何单词并用单个表示替换它们。
回答by Sinan ünür
Perl module Text::Phonetichas implementations of various algorithms.
Perl 模块Text::Phonetic具有各种算法的实现。
回答by Rob
If you're comparing values in a SQL database you can use the SOUNDEXfunction. If you query Google for SOUNDEX and C#, some people have written a similar function for that and VB.
如果要比较 SQL 数据库中的值,则可以使用SOUNDEX函数。如果您在 Google 上查询 SOUNDEX 和 C#,有些人已经为此和 VB 编写了类似的函数。
回答by Rob
I have to recommend Soundex too, I have used it in the past to process misspelt city names. Here is a good link for usage: http://whitepapers.zdnet.com/abstract.aspx?docid=352953
我也必须推荐 Soundex,我过去曾用它来处理拼写错误的城市名称。这是一个很好的使用链接:http: //whitepapers.zdnet.com/abstract.aspx?docid= 352953
回答by anelson
I wrote a Double Metaphone implementation in C#a while back. You'll find it vastly superior to Soundex and the like.
不久前,我用 C#编写了一个Double Metaphone 实现。你会发现它远远优于 Soundex 等。
Levenshtein distance has also been suggested, and it's a great algorithm for a lot of uses, but phonetic matching is not really what it does; it only seems that way sometimes because phonetically similar words are also usually spelled similarly. I did an analysis of various fuzzy matching algorithmswhich you might also find useful.
也有人建议使用 Levenshtein 距离,它是一种很好的算法,适用于很多用途,但语音匹配并不是它真正的作用;有时似乎只是这样,因为语音相似的单词通常也拼写相似。我对各种模糊匹配算法进行了分析,您可能也会发现这些算法很有用。
回答by Jonathan Wood
If you want to compare phonetically, check out the Soundex and Metaphone algorithms: http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex
如果您想在语音上进行比较,请查看 Soundex 和 Metaphone 算法:http: //www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex