C# 比较字符串与容差

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2344320/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-07 01:38:37  来源:igfitidea点击:

Comparing strings with tolerance

c#.netstring-comparisonsimilarity

提问by Oliver Hanappi

I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on.

我正在寻找一种将字符串与字符串数组进行比较的方法。进行精确搜索当然很容易,但我希望我的程序能够容忍拼写错误、字符串的缺失部分等。

Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will return a few results order by the percentage of match or something like this.

是否有某种框架可以执行这样的搜索?我有一些想法,搜索算法将按匹配百分比或类似的方式返回一些结果顺序。

采纳答案by RedFilter

You could use the Levenshtein Distance algorithm.

您可以使用Levenshtein 距离算法

"The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character."- Wikipedia.com

“两个字符串之间的 Levenshtein 距离定义为将一个字符串转换为另一个字符串所需的最少编辑次数,允许的编辑操作是插入、删除或替换单个字符。” - 维基百科

This one is from dotnetperls.com:

这是来自dotnetperls.com

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
        Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
        Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

You may in fact prefer to use the Damerau-Levenshtein distance algorithm, which also allows characters to be transposed, which is a common human error in data entry. You'll find a C# implementation of it here.

实际上,您可能更喜欢使用Damerau-Levenshtein 距离算法,该算法还允许转置字符,这是数据输入中常见的人为错误。您将在此处找到它的 C# 实现。

回答by Samuel Neff

Here are two methods that calculate the Levenshtein Distancebetween strings.

这里有两种计算字符串之间Levenshtein 距离的方法。

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

两个字符串之间的编辑距离定义为将一个字符串转换为另一个字符串所需的最小编辑次数,允许的编辑操作是插入、删除或替换单个字符。

Once you have the result, you'll need to define what value you want to use as a threshold for a match or not. Run the function on a bunch of sample data to get a good idea of how it works to help decide on your particular threshold.

获得结果后,您需要定义要用作匹配或不匹配阈值的值。在一堆样本数据上运行该函数,以了解它如何工作以帮助确定您的特定阈值。

    /// <summary>
    /// Calculates the Levenshtein distance between two strings--the number of changes that need to be made for the first string to become the second.
    /// </summary>
    /// <param name="first">The first string, used as a source.</param>
    /// <param name="second">The second string, used as a target.</param>
    /// <returns>The number of changes that need to be made to convert the first string to the second.</returns>
    /// <remarks>
    /// From http://www.merriampark.com/ldcsharp.htm
    /// </remarks>
    public static int LevenshteinDistance(string first, string second)
    {
        if (first == null)
        {
            throw new ArgumentNullException("first");
        }
        if (second == null)
        {
            throw new ArgumentNullException("second");
        }

        int n = first.Length;
        int m = second.Length;
        var d = new int[n + 1, m + 1]; // matrix

        if (n == 0) return m;
        if (m == 0) return n;

        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        for (int i = 1; i <= n; i++)
        {

            for (int j = 1; j <= m; j++)
            {
                int cost = (second.Substring(j - 1, 1) == first.Substring(i - 1, 1) ? 0 : 1); // cost
                d[i, j] = Math.Min(
                    Math.Min(
                        d[i - 1, j] + 1,
                        d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }

        return d[n, m];
    }

回答by casperOne

There is nothing in the .NET framework that will help you with this out-of-the-box.

.NET 框架中没有任何内容可以帮助您实现开箱即用。

The most common spelling mistakes are those where the letters are a decent phonetic representation of the word, but not the correct spelling of the word.

最常见的拼写错误是那些字母是单词的体面语音表示,但不是单词的正确拼写。

For example, it could be argued that the words swordand sord(yes, that's a word) have the same phonetic roots (they sound the same when you pronounce them).

例如,可能会争辩说单词swordsord(是的,那是一个单词)具有相同的音根(当您发音时,它们的发音相同)。

That being said, there are a number of algorithms that you can use to translate words (even mispelled ones) into phonetic variants.

话虽如此,您可以使用多种算法将单词(甚至拼错的单词)翻译成语音变体。

The first is Soundex. It's fairly simple to implement and there are a fair number of .NET implementations of this algorithm. It's rather simple, but it gives you real values you can compare to each other.

第一个是Soundex。实现起来相当简单,并且有相当数量的这种算法.NET 实现。这相当简单,但它为您提供了可以相互比较的真实值。

Another is Metaphone. While I can't find a native .NET implementation of Metaphone, the link provided has links to a number of other implementations which could be converted. The easiest to convert would probably be the Java implementation of the Metaphone algorithm.

另一个是Metaphone。虽然我找不到 Metaphone 的本机 .NET 实现,但提供的链接包含指向许多其他可以转换的实现的链接。最容易转换的可能是Metaphone 算法Java 实现

It should be noted that the Metaphone algorithm has gone through revisions. There is Double Metaphone(which has a .NET implementation) and Metaphone 3. Metaphone 3 is a commercial application, but has a 98% accuracy rate compared to an 89% accuracy rate for the Double Metaphone algorithm when run against a database of common English words. Depending on your need, you might want to look for (in the case of Double Metaphone) or purchase (in the case of Metaphone 3) the source for the algorithm and convert or access it through the P/Invoke layer (there are C++ implementations abound).

需要注意的是,Metaphone 算法已经过修改。有Double Metaphone(具有.NET 实现)和Metaphone 3。Metaphone 3 是一个商业应用程序,但在针对常见英语单词的数据库运行时,其准确率为 98%,而 Double Metaphone 算法的准确率为 89%。根据您的需要,您可能想要寻找(在双元音的情况下)或购买(在元音 3 的情况下)算法的源代码并通过 P/Invoke 层转换或访问它(有 C++ 实现)盛产)。

Metaphone and Soundex differ in the sense that Soundex produces fixed length numeric keys, whereas Metaphone produces keys of different length, so the results will be different. In the end, both will do the same kind of comparison for you, you just have to find out which suits your needs the best, given your requirements and resources (and intolerance levels for the spelling mistakes, of course).

Metaphone 和 Soundex 的区别在于 Soundex 产生固定长度的数字键,而 Metaphone 产生不同长度的键,因此结果会有所不同。最后,两者都会为您进行相同的比较,您只需要根据您的要求和资源(当然还有拼写错误的不容忍程度)找出最适合您的需求。

回答by Elijah

You can find implementations of soundex and the levenshtein distance algorithms in the open source CommonLibrary.NET project.

您可以在开源CommonLibrary.NET 项目中找到 soundex 和 levenshtein 距离算法的实现。

回答by Jonathan Wood

Your other option is to compare phonetically using Soundex or Metaphone. I just completed an article that presents C# code for both algorithms. You can view it at http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex.

您的另一个选择是使用 Soundex 或 Metaphone 进行语音比较。我刚刚完成了一篇介绍这两种算法的 C# 代码的文章。您可以在http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex查看。

回答by Ben Gripka

Here is an implementation of the LevenshteinDistance method that uses far less memory while producing the same results. This is a C# adaptation of the pseudo code found in this wikipedia articleunder the "Iterative with two matrix rows" heading.

这是 LevenshteinDistance 方法的实现,它使用更少的内存,同时产生相同的结果。这是本维基百科文章“具有两个矩阵行的迭代”标题下的伪代码的 C# 改编版。

public static int LevenshteinDistance(string source, string target)
{
    // degenerate cases
    if (source == target) return 0;
    if (source.Length == 0) return target.Length;
    if (target.Length == 0) return source.Length;

    // create two work vectors of integer distances
    int[] v0 = new int[target.Length + 1];
    int[] v1 = new int[target.Length + 1];

    // initialize v0 (the previous row of distances)
    // this row is A[0][i]: edit distance for an empty s
    // the distance is just the number of characters to delete from t
    for (int i = 0; i < v0.Length; i++)
        v0[i] = i;

    for (int i = 0; i < source.Length; i++)
    {
        // calculate v1 (current row distances) from the previous row v0

        // first element of v1 is A[i+1][0]
        //   edit distance is delete (i+1) chars from s to match empty t
        v1[0] = i + 1;

        // use formula to fill in the rest of the row
        for (int j = 0; j < target.Length; j++)
        {
            var cost = (source[i] == target[j]) ? 0 : 1;
            v1[j + 1] = Math.Min(v1[j] + 1, Math.Min(v0[j + 1] + 1, v0[j] + cost));
        }

        // copy v1 (current row) to v0 (previous row) for next iteration
        for (int j = 0; j < v0.Length; j++)
            v0[j] = v1[j];
    }

    return v1[target.Length];
}

Here is a function that will give you the percentage similarity.

这是一个可以为您提供百分比相似度的函数。

/// <summary>
/// Calculate percentage similarity of two strings
/// <param name="source">Source String to Compare with</param>
/// <param name="target">Targeted String to Compare</param>
/// <returns>Return Similarity between two strings from 0 to 1.0</returns>
/// </summary>
public static double CalculateSimilarity(string source, string target)
{
    if ((source == null) || (target == null)) return 0.0;
    if ((source.Length == 0) || (target.Length == 0)) return 0.0;
    if (source == target) return 1.0;

    int stepsToSame = LevenshteinDistance(source, target);
    return (1.0 - ((double)stepsToSame / (double)Math.Max(source.Length, target.Length)));
}