C# 如何计算给定2个字符串的距离相似度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9453731/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to calculate distance similarity measure of given 2 strings?
提问by MonsterMMORPG
I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
我需要计算 2 个字符串之间的相似度。那么我到底是什么意思呢?让我用一个例子来解释:
- The real word:
hospital - Mistaken word:
haspita
- 真话:
hospital - 错字:
haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
现在我的目标是确定我需要修改多少个字符才能得到真词。在这个例子中,我需要修改 2 个字母。那么百分比是多少呢?我总是取真实单词的长度。所以它变成 2 / 8 = 25% 所以这两个给定的字符串 DSM 是 75%。
How can I achieve this with performance being a key consideration?
如何在性能是关键考虑因素的情况下实现这一目标?
采纳答案by dasblinkenlight
What you are looking for is called edit distanceor Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
您正在寻找的称为edit distance或Levenshtein distance。维基百科文章解释了它是如何计算的,并在底部有一段很好的伪代码,可以帮助您非常轻松地在 C# 中编写此算法。
Here's an implementation from the first site linked below:
这是下面链接的第一个站点的实现:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
回答by Anastasiosyal
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
有大量的字符串相似距离算法可以使用。此处列出了一些(但未详尽列出):
- Levenstein
- Needleman Wunch
- Smith Waterman
- Smith Waterman Gotoh
- Jaro, Jaro Winkler
- Jaccard Similarity
- Euclidean Distance
- Dice Similarity
- Cosine Similarity
- Monge Elkan
A library that contains implementation to all of these is called SimMetricswhich has both java and c# implementations.
包含所有这些实现的库称为SimMetrics,它同时具有 java 和 c# 实现。
回答by Joshua Honig
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
几周前我刚刚解决了这个完全相同的问题。既然有人问现在,我将分享代码。在我详尽的测试中,即使没有提供最大距离,我的代码也比维基百科上的 C# 示例快 10 倍左右。当提供最大距离时,此性能增益增加到 30x - 100x +。请注意性能的几个关键点:
- If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and
intscompare much faster thanchars. - It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
- Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
- Make sure your arrays slice accross the shorter word width.
- 如果您需要反复比较相同的单词,请先将单词转换为整数数组。Damerau-Levenshtein 算法包括许多 >、<、== 比较,并且
ints比chars. - 它包括一个短路机制,如果距离超过提供的最大值,则退出
- 使用一组旋转的三个数组,而不是像我在其他地方看到的所有实现一样的大型矩阵
- 确保您的数组切片跨越较短的字宽。
Code (it works the exact same if you replace int[]with Stringin the parameter declarations:
代码(如果在参数声明中替换int[]为String,它的工作方式完全相同:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swapis:
在哪里Swap:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
回答by Gordon Linoff
Here is an alternative approach:
这是一种替代方法:
This is too long for a comment.
评论太长了。
A typical method for finding similarity is Levenshtein distance, and there is no doubt a library with code available.
寻找相似性的典型方法是 Levenshtein 距离,毫无疑问,有代码可用的库。
Unfortunately, this requires comparing to every string. You might be able to write a specialized version of the code to short-circuit the calculation if the distance is greater than some threshold, you would still have to do all the comparisons.
不幸的是,这需要比较每个字符串。如果距离大于某个阈值,您可能可以编写专门的代码版本来缩短计算过程,但您仍然需要进行所有比较。
Another idea is to use some variant of trigrams or n-grams. These are sequences of n characters (or n words or n genomic sequences or n whatever). Keep a mapping of trigrams to strings and choose the ones that have the biggest overlap. A typical choice of n is "3", hence the name.
另一个想法是使用trigrams或n-grams的一些变体。这些是 n 个字符的序列(或 n 个单词或 n 个基因组序列或 n 其他)。保留三元组到字符串的映射,并选择重叠最大的那些。n 的典型选择是“3”,因此得名。
For instance, English would have these trigrams:
例如,英语会有这些三元组:
- Eng
- ngl
- gli
- lis
- ish
- 英
- ngl
- 格利
- 李斯
- 是的
And England would have:
英格兰将有:
- Eng
- ngl
- gla
- lan
- and
- 英
- ngl
- 玻璃
- 兰
- 和
Well, 2 out of 7 (or 4 out of 10) match. If this works for you, and you can index the trigram/string table and get a faster search.
好吧,7 分之 2(或 10 分之 4)匹配。如果这对你有用,你可以索引三元组/字符串表并获得更快的搜索。
You can also combine this with Levenshtein to reduce the set of comparison to those that have some minimum number of n-grams in common.
您还可以将其与 Levenshtein 结合使用,以将比较集减少到那些具有一些最小数量的共同 n-gram 的比较集。
回答by Ivan Kochurkin
Here is my implementation of Damerau Levenshtein Distance, which returns not only similarity coefficient, but also returns error locations in corrected word (this feature can be used in text editors). Also my implementation supports different weights of errors (substitution, deletion, insertion, transposition).
这是我对 Damerau Levenshtein Distance 的实现,它不仅返回相似系数,还返回更正单词中的错误位置(此功能可以在文本编辑器中使用)。此外,我的实现支持不同权重的错误(替换、删除、插入、转置)。
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
This code is a part of my project: Yandex-Linguistics.NET.
这段代码是我项目的一部分:Yandex-Linguistics.NET。
I wrote some testsand it's seems to me that method is working.
我写了一些测试,在我看来这个方法是有效的。
But comments and remarks are welcome.
但欢迎评论和评论。
回答by GHC
Here's a VB.net implementation:
这是一个 VB.net 实现:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function
回答by joshweir
I have found that Levenshteinand Jaro Winklerare great for small differences betwen strings such as:
我发现Levenshtein和Jaro Winkler非常适合字符串之间的细微差异,例如:
- Spelling mistakes; or
- ?instead of oin a persons name.
- 拼写错误; 或者
- ? 而不是人名中的o。
However when comparing something like article titles where significant chunks of the text would be the same but with "noise" around the edges, Smith-Waterman-Gotohhas been fantastic:
然而,当比较文章标题之类的内容时,其中重要的文本块是相同的,但边缘有“噪音”,Smith-Waterman-Gotoh非常棒:
compare these 2 titles (that are the same but worded differently from different sources):
比较这两个标题(相同但不同来源的措辞不同):
An endonuclease from Escherichia coli that introduces single polynucleotide chain scissions in ultraviolet-irradiated DNA
Endonuclease III: An Endonuclease from Escherichia coli That Introduces Single Polynucleotide Chain Scissions in Ultraviolet-Irradiated DNA
一种来自大肠杆菌的核酸内切酶,可在紫外线照射的 DNA 中引入单多核苷酸链断裂
核酸内切酶 III:一种来自大肠杆菌的核酸内切酶,可在紫外线照射的 DNA 中引入单多核苷酸链断裂
This site that provides algorithm comparisonof the strings shows:
- Levenshtein: 81
- Smith-Waterman Gotoh 94
- Jaro Winkler 78
- 莱文斯坦:81
- 史密斯-沃特曼后藤 94
- 贾罗·温克勒 78
Jaro Winkler and Levenshtein are not as competent as Smith Waterman Gotoh in detecting the similarity. If we compare two titles that are not the same article, but have some matching text:
Jaro Winkler 和 Levenshtein 在检测相似性方面不如 Smith Waterman Gotoh。如果我们比较两个不是同一篇文章但有一些匹配文本的标题:
Fat metabolism in higher plants.The function of acyl thioesterases in the metabolism of acyl-coenzymesA and acyl-acyl carrier proteins
Fat metabolism in higher plants.The determination of acyl-acyl carrier proteinand acyl coenzymeA in a complex lipid mixture
高等植物的脂肪代谢。酰基硫酯酶在酰基辅酶A和酰基载体蛋白s代谢中的作用
高等植物的脂肪代谢。复杂脂质混合物中酰基-酰基载体蛋白和酰基辅酶A的测定
Jaro Winkler gives a false positive, but Smith Waterman Gotoh does not:
Jaro Winkler 给出了一个误报,但 Smith Waterman Gotoh 没有:
- Levenshtein: 54
- Smith-Waterman Gotoh 49
- Jaro Winkler 89
- 莱文斯坦:54
- 史密斯-沃特曼后藤 49
- 贾罗·温克勒 89
As Anastasiosyalpointed out, SimMetricshas the java code for these algorithms. I had success using the SmithWatermanGotoh java code from SimMetrics.
正如Anastasiosyal指出的那样,SimMetrics有这些算法的 Java 代码。我使用SimMetrics 的 SmithWatermanGotoh java 代码取得了成功。

