C# 如何衡量两个字符串之间的相似度?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1034622/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 06:38:11  来源:igfitidea点击:

How can I measure the similarity between 2 strings?

c#stringcomparisonphonetics

提问by Zanoni

Given two strings text1and text2

给定两个字符串text1text2

public SOMEUSABLERETURNTYPE Compare(string text1, string text2)
{
     // DO SOMETHING HERE TO COMPARE
}

Examples:

例子:

  1. First String: StackOverflow

    Second String: StaqOverflow

    Return: Similarity is 91%

    The return can be in % or something like that.

  2. First String: The simple text test

    Second String: The complex text test

    Return: The values can be considered equal

  1. 第一个字符串:StackOverflow

    第二个字符串:StaqOverflow

    回报:相似度为 91%

    回报可以是 % 或类似的东西。

  2. 第一个字符串:简单的文本测试

    第二个字符串:复杂的文本测试

    返回: 值可以被认为是相等的

Any ideas? What is the best way to do this?

有任何想法吗?做这个的最好方式是什么?

采纳答案by Jon Skeet

There are various different ways of doing this. Have a look at the Wikipedia "String similarity measures" pagefor links to other pages with algorithms.

有多种不同的方法可以做到这一点。查看维基百科的“字符串相似性度量”页面,获取到其他带有算法的页面的链接。

I don't thinkany of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.

然而,我认为这些算法中的任何一个都没有考虑到声音——所以“staq 溢出”将与“堆栈溢出”和“staw 溢出”相似,尽管第一个在发音方面更相似。

I've just found another pagewhich gives rather more options... in particular, the Soundexalgorithm (Wikipedia) may be closer to what you're after.

我刚刚找到了另一个提供更多选项的页面......特别是,Soundex算法(维基百科)可能更接近你所追求的。

回答by LiraNuna

Levenshtein distanceis probably what you're looking for.

Levenshtein distance可能就是你要找的。

回答by schnaader

You might look for string "distances", for example the Levenshtein distance.

您可能会查找字符串“距离”,例如Levenshtein distance

回答by John Sheehan

Jeff Atwood wrote about looking for a similar solutionfor determining the authorship of wiki posts which may help you narrow your search.

Jeff Atwood 撰写了关于寻找类似的解决方案来确定 wiki 帖子的作者身份的文章,这可能会帮助您缩小搜索范围。

回答by bdk

To deal with 'sound alikes' you may want to look into encoding using a phonetic algorithm like Double Metaphone or soundex. I don't know if computing Levenshtein distances on phonetic encoded strings would be beneficial or not, but might be a possibility. Alternately, you could use a heuristic like: convert each word in the string to its encoded form and remove any words that occur in both strings and replace them with a single representation before computing the Levenshtein distance.

要处理“声音相似”,您可能需要研究使用语音算法(如 Double Metaphone 或 soundex)进行编码。我不知道在语音编码字符串上计算 Levenshtein 距离是否有益,但可能是一种可能性。或者,您可以使用启发式方法,例如:将字符串中的每个单词转换为其编码形式,并在计算 Levenshtein 距离之前删除出现在两个字符串中的任何单词并用单个表示替换它们。

回答by Sinan ünür

Perl module Text::Phonetichas implementations of various algorithms.

Perl 模块Text::Phonetic具有各种算法的实现。

回答by Rob

If you're comparing values in a SQL database you can use the SOUNDEXfunction. If you query Google for SOUNDEX and C#, some people have written a similar function for that and VB.

如果要比较 SQL 数据库中的值,则可以使用SOUNDEX函数。如果您在 Google 上查询 SOUNDEX 和 C#,有些人已经为此和 VB 编写了类似的函数。

回答by Rob

I have to recommend Soundex too, I have used it in the past to process misspelt city names. Here is a good link for usage: http://whitepapers.zdnet.com/abstract.aspx?docid=352953

我也必须推荐 Soundex,我过去曾用它来处理拼写错误的城市名称。这是一个很好的使用链接:http: //whitepapers.zdnet.com/abstract.aspx?docid= 352953

回答by anelson

I wrote a Double Metaphone implementation in C#a while back. You'll find it vastly superior to Soundex and the like.

不久前,用 C#编写了一个Double Metaphone 实现。你会发现它远远优于 Soundex 等。

Levenshtein distance has also been suggested, and it's a great algorithm for a lot of uses, but phonetic matching is not really what it does; it only seems that way sometimes because phonetically similar words are also usually spelled similarly. I did an analysis of various fuzzy matching algorithmswhich you might also find useful.

也有人建议使用 Levenshtein 距离,它是一种很好的算法,适用于很多用途,但语音匹配并不是它真正的作用;有时似乎只是这样,因为语音相似的单词通常也拼写相似。我对各种模糊匹配算法进行了分析,您可能也会发现这些算法很有用。

回答by Jonathan Wood

If you want to compare phonetically, check out the Soundex and Metaphone algorithms: http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex

如果您想在语音上进行比较,请查看 Soundex 和 Metaphone 算法:http: //www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex