C# 中的模糊文本(句子/标题)匹配

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53480/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 10:07:07  来源:igfitidea点击:

Fuzzy text (sentences/titles) matching in C#

提问by Lukas ?alkauskas

Hey, I'm using Levenshteinsalgorithm to get distance between source and target string.

嘿,我正在使用Levenshteins算法来获取源字符串和目标字符串之间的距离。

also I have method which returns value from 0 to 1:

我也有返回值从 0 到 1 的方法:

/// <summary>
/// Gets the similarity between two strings.
/// All relation scores are in the [0, 1] range, 
/// which means that if the score gets a maximum value (equal to 1) 
/// then the two string are absolutely similar
/// </summary>
/// <param name="string1">The string1.</param>
/// <param name="string2">The string2.</param>
/// <returns></returns>
public static float CalculateSimilarity(String s1, String s2)
{
    if ((s1 == null) || (s2 == null)) return 0.0f;

    float dis = LevenshteinDistance.Compute(s1, s2);
    float maxLen = s1.Length;
    if (maxLen < s2.Length)
        maxLen = s2.Length;
    if (maxLen == 0.0F)
        return 1.0F;
    else return 1.0F - dis / maxLen;
}

but this for me is not enough. Because I need more complex way to match two sentences.

但这对我来说还不够。因为我需要更复杂的方式来匹配两个句子。

For example I want automatically tag some music, I have original song names, and i have songs with trash, like super, quality,years like 2007, 2008,etc..etc.. also some files have just http://trash..thash..song_name_mp3.mp3, other are normal. I want to create an algorithm which will work just more perfect than mine now.. Maybe anyone can help me?

例如,我想自动标记一些音乐,我有原始歌曲名称,我有垃圾歌曲,如超级、质量、2007、2008年等年份等等。还有一些文件只有http://trash。 .thash..song_name_mp3.mp3,其他正常。我想创建一个比我现在更完美的算法..也许有人可以帮助我?

here is my current algo:

这是我目前的算法:

/// <summary>
/// if we need to ignore this target.
/// </summary>
/// <param name="targetString">The target string.</param>
/// <returns></returns>
private bool doIgnore(String targetString)
{
    if ((targetString != null) && (targetString != String.Empty))
    {
        for (int i = 0; i < ignoreWordsList.Length; ++i)
        {
            //* if we found ignore word or target string matching some some special cases like years (Regex).
            if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true;
        }
    }

   return false;
}

/// <summary>
/// Removes the duplicates.
/// </summary>
/// <param name="list">The list.</param>
private void removeDuplicates(List<String> list)
{
    if ((list != null) && (list.Count > 0))
    {
        for (int i = 0; i < list.Count - 1; ++i)
        {
            if (list[i] == list[i + 1])
            {
                list.RemoveAt(i);
                --i;
            }
        }
    }
}

/// <summary>
/// Does the fuzzy match.
/// </summary>
/// <param name="targetTitle">The target title.</param>
/// <returns></returns>
private TitleMatchResult doFuzzyMatch(String targetTitle)
{
    TitleMatchResult matchResult = null;

   if (targetTitle != null && targetTitle != String.Empty)
   {
       try
       {
           //* change target title (string) to lower case.
           targetTitle = targetTitle.ToLower();

           //* scores, we will select higher score at the end.
           Dictionary<Title, float> scores = new Dictionary<Title, float>();

           //* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\"', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_'
           List<String> targetKeywords = new List<string>(targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));

          //* remove all trash from keywords, like super, quality, etc..
           targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); });
          //* sort keywords.
          targetKeywords.Sort();
        //* remove some duplicates.
        removeDuplicates(targetKeywords);

        //* go through all original titles.
        foreach (Title sourceTitle in titles)
        {
            float tempScore = 0f;
            //* split orig. title to keywords list.
            List<String> sourceKeywords = new List<string>(sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));
            sourceKeywords.Sort();
            removeDuplicates(sourceKeywords);

            //* go through all source ttl keywords.
            foreach (String keyw1 in sourceKeywords)
            {
                float max = float.MinValue;
                foreach (String keyw2 in targetKeywords)
                {
                    float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2);
                    if (currentScore > max)
                    {
                        max = currentScore;
                    }
                }
                tempScore += max;
            }

            //* calculate average score.
            float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count)); 

            //* if average score is bigger than minimal score and target title is not in this source title ignore list.
            if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle))
            {
                //* add score.
                scores.Add(sourceTitle, averageScore);
            }
        }

        //* choose biggest score.
        float maxi = float.MinValue;
        foreach (KeyValuePair<Title, float> kvp in scores)
        {
            if (kvp.Value > maxi)
            {
                maxi = kvp.Value;
                matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic);
            }
        }
    }
    catch { }
}
//* return result.
return matchResult;
}

This works normally but just in some cases, a lot of titles which should match, does not match... I think I need some kind of formula to play with weights and etc, but i can't think of one..

这正常工作,但只是在某些情况下,很多标题应该匹配,不匹配......我想我需要某种公式来处理重量等,但我想不出一个......

Ideas? Suggestions? Algos?

想法?建议?算法?

by the way I already know this topic (My colleague already posted it but we cannot come with a proper solution for this problem.): Approximate string matching algorithms

顺便说一下,我已经知道这个主题(我的同事已经发布了它,但我们无法为这个问题提供合适的解决方案。): 近似字符串匹配算法

采纳答案by Keith

Your problem here may be distinguishing between noise words and useful data:

您的问题可能是区分干扰词和有用数据:

  • Rolling_Stones.Best_of_2003.Wild_Horses.mp3
  • Super.Quality.Wild_Horses.mp3
  • Tori_Amos.Wild_Horses.mp3
  • Rolling_Stones.Best_of_2003.Wild_Horses.mp3
  • Super.Quality.Wild_Horses.mp3
  • Tori_Amos.Wild_Horses.mp3

You may need to produce a dictionary of noise words to ignore. That seems clunky, but I'm not sure there's an algorithm that can distinguish between band/album names and noise.

您可能需要制作一个要忽略的干扰词词典。这看起来很笨重,但我不确定是否有一种算法可以区分乐队/专辑名称和噪音。

回答by Greg

It sounds like what you want may be a longest substring match. That is, in your example, two files like

听起来您想要的可能是最长的子字符串匹配。也就是说,在您的示例中,有两个文件,例如

trash..thash..song_name_mp3.mp3 and garbage..spotch..song_name_mp3.mp3

垃圾..thash..song_name_mp3.mp3 和垃圾..spotch..song_name_mp3.mp3

would end up looking the same.

最终会看起来一样。

You'd need some heuristics there, of course. One thing you might try is putting the string through a soundex converter. Soundex is the "codec" used to see if things "sound" the same (as you might tell a telephone operator). It's more or less a rough phonetic and mispronunciation semi-proof transliteration. It is definitely poorer than edit distance, but much, much cheaper. (The official use is for names, and only uses three characters. There's no reason to stop there, though, just use the mapping for every character in the string. See wikipediafor details)

当然,你需要一些启发式方法。您可以尝试的一件事是将字符串通过 soundex 转换器。Soundex 是用于查看事物“听起来”是否相同的“编解码器”(您可能会告诉电话接线员)。它或多或少是一个粗略的语音和错误发音的半证明音译。它绝对比编辑距离差,但要便宜得多。(官方用于名称,并且只使用三个字符。没有理由停在那里,但是,只需对字符串中的每个字符使用映射。有关详细信息,请参阅维基百科

So my suggestion would be to soundex your strings, chop each one into a few length tranches (say 5, 10, 20) and then just look at clusters. Within clusters you can use something more expensive like edit distance or max substring.

所以我的建议是将你的琴弦发声,将每个琴弦切成几段长度(比如 5、10、20),然后只看簇。在集群中,您可以使用更昂贵的东西,例如编辑距离或最大子字符串。

回答by ima

There's a lot of work done on somewhat related problem of DNA sequence alignment (search for "local sequence alignment") - classic algorithm being "Needleman-Wunsch" and more complex modern ones also easy to find. The idea is - similar to Greg's answer - instead of identifying and comparing keywords try to find longest loosely matching substrings within long strings.

关于 DNA 序列比对(搜索“局部序列比对”)的一些相关问题已经做了很多工作——经典算法是“Needleman-Wunsch”,更复杂的现代算法也很容易找到。这个想法是 - 类似于 Greg 的回答 - 而不是识别和比较关键字,而是尝试在长字符串中找到最长的松散匹配子字符串。

That being sad, if the only goal is sorting music, a number of regular expressions to cover possible naming schemes would probably work better than any generic algorithm.

可悲的是,如果唯一的目标是对音乐进行排序,那么一些涵盖可能命名方案的正则表达式可能比任何通用算法都更好。

回答by Alain

Kind of old, but It might be useful to future visitors. If you're already using the Levenshtein algorithm and you need to go a little better, I describe some very effective heuristics in this solution:

有点旧,但它可能对未来的访客有用。如果您已经在使用 Levenshtein 算法并且需要改进一点,我在此解决方案中描述了一些非常有效的启发式方法:

Getting the closest string match

获得最接近的字符串匹配

The key is that you come up with 3 or 4 (or more) methods of gauging the similarity between your phrases (Levenshtein distance is just one method) - and then using real examples of strings you want to match as similar, you adjust the weightings and combinations of those heuristics until you get something that maximizes the number of positive matches. Then you use that formula for all future matches and you should see great results.

关键是你想出 3 或 4(或更多)方法来衡量你的短语之间的相似性(Levenshtein 距离只是一种方法) - 然后使用你想要匹配的真实字符串示例,调整权重和这些启发式的组合,直到你得到最大化正匹配数量的东西。然后你在所有未来的比赛中使用这个公式,你应该会看到很好的结果。

If a user is involved in the process, it's also best if you provide an interface which allows the user to see additional matches that rank highly in similarity in case they disagree with the first choice.

如果用户参与了该过程,那么最好提供一个界面,让用户可以查看相似度较高的其他匹配项,以防他们不同意第一个选择。

Here's an excerpt from the linked answer. If you end up wanting to use any of this code as is, I apologize in advance for having to convert VBA into C#.

这是链接答案的摘录。如果您最终想要按原样使用此代码中的任何一个,我提前为必须将 VBA 转换为 C# 而道歉。



Simple, speedy, and a very useful metric. Using this, I created two separate metrics for evaluating the similarity of two strings. One I call "valuePhrase" and one I call "valueWords". valuePhrase is just the Levenshtein distance between the two phrases, and valueWords splits the string into individual words, based on delimiters such as spaces, dashes, and anything else you'd like, and compares each word to each other word, summing up the shortest Levenshtein distance connecting any two words. Essentially, it measures whether the information in one 'phrase' is really contained in another, just as a word-wise permutation. I spent a few days as a side project coming up with the most efficient way possible of splitting a string based on delimiters.

简单、快速且非常有用的指标。使用这个,我创建了两个单独的指标来评估两个字符串的相似性。一种称为“valuePhrase”,另一种称为“valueWords”。valuePhrase 只是两个短语之间的 Levenshtein 距离,valueWords 根据分隔符(例如空格、破折号和您想要的任何其他内容)将字符串拆分为单个单词,并将每个单词与其他单词进行比较,总结出最短的连接任意两个词的 Levenshtein 距离。本质上,它衡量一个“短语”中的信息是否真的包含在另一个“短语”中,就像逐字排列一样。我花了几天的时间作为一个附带项目提出了基于分隔符分割字符串的最有效方法。

valueWords, valuePhrase, and Split function:

valueWords、valuePhrase 和 Split 函数:

Public Function valuePhrase#(ByRef S1$, ByRef S2$)
    valuePhrase = LevenshteinDistance(S1, S2)
End Function

Public Function valueWords#(ByRef S1$, ByRef S2$)
    Dim wordsS1$(), wordsS2$()
    wordsS1 = SplitMultiDelims(S1, " _-")
    wordsS2 = SplitMultiDelims(S2, " _-")
    Dim word1%, word2%, thisD#, wordbest#
    Dim wordsTotal#
    For word1 = LBound(wordsS1) To UBound(wordsS1)
        wordbest = Len(S2)
        For word2 = LBound(wordsS2) To UBound(wordsS2)
            thisD = LevenshteinDistance(wordsS1(word1), wordsS2(word2))
            If thisD < wordbest Then wordbest = thisD
            If thisD = 0 Then GoTo foundbest
        Next word2
foundbest:
        wordsTotal = wordsTotal + wordbest
    Next word1
    valueWords = wordsTotal
End Function

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' SplitMultiDelims
' This function splits Text into an array of substrings, each substring
' delimited by any character in DelimChars. Only a single character
' may be a delimiter between two substrings, but DelimChars may
' contain any number of delimiter characters. It returns a single element
' array containing all of text if DelimChars is empty, or a 1 or greater
' element array if the Text is successfully split into substrings.
' If IgnoreConsecutiveDelimiters is true, empty array elements will not occur.
' If Limit greater than 0, the function will only split Text into 'Limit'
' array elements or less. The last element will contain the rest of Text.
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function SplitMultiDelims(ByRef Text As String, ByRef DelimChars As String, _
        Optional ByVal IgnoreConsecutiveDelimiters As Boolean = False, _
        Optional ByVal Limit As Long = -1) As String()
    Dim ElemStart As Long, N As Long, M As Long, Elements As Long
    Dim lDelims As Long, lText As Long
    Dim Arr() As String

    lText = Len(Text)
    lDelims = Len(DelimChars)
    If lDelims = 0 Or lText = 0 Or Limit = 1 Then
        ReDim Arr(0 To 0)
        Arr(0) = Text
        SplitMultiDelims = Arr
        Exit Function
    End If
    ReDim Arr(0 To IIf(Limit = -1, lText - 1, Limit))

    Elements = 0: ElemStart = 1
    For N = 1 To lText
        If InStr(DelimChars, Mid(Text, N, 1)) Then
            Arr(Elements) = Mid(Text, ElemStart, N - ElemStart)
            If IgnoreConsecutiveDelimiters Then
                If Len(Arr(Elements)) > 0 Then Elements = Elements + 1
            Else
                Elements = Elements + 1
            End If
            ElemStart = N + 1
            If Elements + 1 = Limit Then Exit For
        End If
    Next N
    'Get the last token terminated by the end of the string into the array
    If ElemStart <= lText Then Arr(Elements) = Mid(Text, ElemStart)
    'Since the end of string counts as the terminating delimiter, if the last character
    'was also a delimiter, we treat the two as consecutive, and so ignore the last elemnent
    If IgnoreConsecutiveDelimiters Then If Len(Arr(Elements)) = 0 Then Elements = Elements - 1

    ReDim Preserve Arr(0 To Elements) 'Chop off unused array elements
    SplitMultiDelims = Arr
End Function

Measures of Similarity

相似性度量

Using these two metrics, and a third which simply computes the distance between two strings, I have a series of variables which I can run an optimization algorithm to achieve the greatest number of matches. Fuzzy string matching is, itself, a fuzzy science, and so by creating linearly independent metrics for measuring string similarity, and having a known set of strings we wish to match to each other, we can find the parameters that, for our specific styles of strings, give the best fuzzy match results.

使用这两个指标,第三个只是简单地计算两个字符串之间的距离,我有一系列变量,我可以运行优化算法来实现最大数量的匹配。模糊字符串匹配本身是一门模糊科学,因此通过创建用于测量字符串相似性的线性独立度量,并拥有一组我们希望相互匹配的已知字符串,我们可以找到参数,对于我们的特定样式字符串,给出最好的模糊匹配结果。

Initially, the goal of the metric was to have a low search value for for an exact match, and increasing search values for increasingly permuted measures. In an impractical case, this was fairly easy to define using a set of well defined permutations, and engineering the final formula such that they had increasing search values results as desired.

最初,该指标的目标是为精确匹配提供较低的搜索值,并为日益排列的度量增加搜索值。在不切实际的情况下,这很容易使用一组定义明确的排列来定义,并设计最终公式,以便它们根据需要增加搜索值结果。

enter image description here

在此处输入图片说明

As you can see, the last two metrics, which are fuzzy string matching metrics, already have a natural tendency to give low scores to strings that are meant to match (down the diagonal). This is very good.

如您所见,最后两个指标,即模糊字符串匹配指标,已经自然而然地倾向于为要匹配的字符串(沿对角线下方)提供低分。这是非常好的。

ApplicationTo allow the optimization of fuzzy matching, I weight each metric. As such, every application of fuzzy string match can weight the parameters differently. The formula that defines the final score is a simply combination of the metrics and their weights:

应用为了优化模糊匹配,我对每个度量进行加权。因此,模糊字符串匹配的每个应用都可以对参数进行不同的加权。定义最终分数的公式是指标及其权重的简单组合:

value = Min(phraseWeight*phraseValue, wordsWeight*wordsValue)*minWeight + 
        Max(phraseWeight*phraseValue, wordsWeight*wordsValue)*maxWeight + lengthWeight*lengthValue

Using an optimization algorithm (neural network is best here because it is a discrete, multi-dimentional problem), the goal is now to maximize the number of matches. I created a function that detects the number of correct matches of each set to each other, as can be seen in this final screenshot. A column or row gets a point if the lowest score is assigned the the string that was meant to be matched, and partial points are given if there is a tie for the lowest score, and the correct match is among the tied matched strings. I then optimized it. You can see that a green cell is the column that best matches the current row, and a blue square around the cell is the row that best matches the current column. The score in the bottom corner is roughly the number of successful matches and this is what we tell our optimization problem to maximize.

使用优化算法(神经网络在这里是最好的,因为它是一个离散的多维问题),现在的目标是最大化匹配的数量。我创建了一个函数来检测每个集合彼此正确匹配的数量,如最终截图所示。如果最低分数被分配给要匹配的字符串,则列或行获得一分,如果最低分数并列,则给出部分分数,并且正确匹配在并列匹配的字符串中。然后我优化了它。您可以看到绿色单元格是与当前行最匹配的列,单元格周围的蓝色方块是与当前列最匹配的行。底角的分数大致是成功匹配的数量,这就是我们告诉优化问题最大化的分数。

enter image description here

在此处输入图片说明

回答by Riga

There is a GitHub repoimplementing several methods.

有一个GitHub 存储库实现了多种方法。