php 字符串相似度的算法(比 Levenshtein 和 similar_text 好)?哲学博士

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5351659/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 21:07:18  来源:igfitidea点击:

Algorithms for string similarities (better than Levenshtein, and similar_text)? Php, Js

php

提问by Cambiata

Where can I find algorithms that values the spelling of misplaced characters more accurately than levenshtein() and php similar_text() methods?

我在哪里可以找到比 levenshtein() 和 php similar_text() 方法更准确地评估错位字符拼写的算法?

Example:

例子:

similar_text('jonas', 'xxjon', $similar); echo $similar; // returns 60
similar_text('jonas', 'asjon', $similar); echo $similar; // returns 60 <- although more similar!
echo levenshtein('jonas', 'xxjon'); // returns 4
echo levenshtein('jonas', 'asjon'); // returns 4  <- although more similar!

/ Jonas

/ 乔纳斯

回答by Cambiata

Here's a solution that I've come up to. It's based on Tim's suggestion of comparing the order of subsequent charachters. Some results:

这是我提出的解决方案。它基于 Tim 的建议,即比较后续字符的顺序。一些结果:

  • jonas / jonax : 0.8
  • jonas / sjona : 0.68
  • jonas / sjonas : 0.66
  • jonas / asjon : 0.52
  • jonas / xxjon : 0.36
  • 乔纳斯/乔纳克斯:0.8
  • 乔纳斯/斯乔纳:0.68
  • 乔纳斯/斯乔纳斯:0.66
  • 乔纳斯/阿琼:0.52
  • 乔纳斯 / xxjon : 0.36

I'm sure i isn't perfect, and that it could be optimized, but nevertheless it seems to produce the results that I'm after... One weak spot is that when strings have different length, it produces different result when the values are swapped...

我确信我并不完美,并且它可以被优化,但是它似乎产生了我所追求的结果......一个弱点是当字符串具有不同的长度时,它会产生不同的结果值被交换...

static public function string_compare($str_a, $str_b) 
{
    $length = strlen($str_a);
    $length_b = strlen($str_b);

    $i = 0;
    $segmentcount = 0;
    $segmentsinfo = array();
    $segment = '';
    while ($i < $length) 
    {
        $char = substr($str_a, $i, 1);
        if (strpos($str_b, $char) !== FALSE) 
        {               
            $segment = $segment.$char;
            if (strpos($str_b, $segment) !== FALSE) 
            {
                $segmentpos_a = $i - strlen($segment) + 1;
                $segmentpos_b = strpos($str_b, $segment);
                $positiondiff = abs($segmentpos_a - $segmentpos_b);
                $posfactor = ($length - $positiondiff) / $length_b; // <-- ?
                $lengthfactor = strlen($segment)/$length;
                $segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
            } 
            else 
            {
                $segment = '';
                $i--;
                $segmentcount++;
            } 
        } 
        else 
        {
            $segment = '';
            $segmentcount++;
        }
        $i++;
    }   

    // PHP 5.3 lambda in array_map      
    $totalscore = array_sum(array_map(function($v) { return $v['score'];  }, $segmentsinfo));
    return $totalscore;     
}

回答by Mark Baker

In addition to levenshtein() and similar_text(), there's also:

除了 levenshtein() 和 similar_text() 之外,还有:

soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.

soundex():返回一个单词的四字符 soundex 键,它应该与任何发音相似的单词的键相同。
metaphone():类似于 soundex,可能对您更有效。它比 soundex() 更准确,因为它知道英语发音的基本规则。变音素生成的键是可变长度的。

回答by Solo.dmitry

Please, be careful about using string_compare:

请小心使用string_compare

ivanov ivan / ivanov ivan : 1 OK!

伊凡诺夫伊万 / 伊凡诺夫伊万:1好的!

ivanov ivan2/ ivanov ivan : 1 o_O

伊凡诺夫伊凡2/ 伊凡诺夫伊凡 : 1 o_O

ivanov ivan/ ivanov i : 1.1363636363636 OMG!

ivanov i van/ ivanov i:1.1363636363636 天哪

回答by Tim

@Tim: I'm actually looking for a way to process/measure similarities in a pedagogical game context. Let's say that a student's task is to select objects from a pool, and put those objects in a specific order (sort them by alphabet or whatever). I then need a way to measure the similarity between the students answer and the correct one

@Tim:我实际上正在寻找一种方法来处理/衡量教学游戏环境中的相似性。假设学生的任务是从池中选择对象,并将这些对象按特定顺序排列(按字母或其他顺序排序)。然后我需要一种方法来衡量学生答案与正确答案之间的相似性

Algorithms to calculate the degree-of-correctness of the order of characters in a word (i.e. its spelling) could be very different from an algorithm to measure the correct order of words in a list. The way spelling algorithms handle omissions or dittography or transpositions might not apply very well to your use case.

计算单词中字符顺序(即其拼写)正确程度的算法可能与测量列表中单词正确顺序的算法大不相同。拼写算法处理遗漏、单字法或换位的方式可能不适用于您的用例。

If you know the order of elements in advance, and know the number of elements too, then you could simply loop through the answer and compare value-at-position to correct-value-at-position and arrive at a percentage-correct. Yet that would be a crude measure, and misleading, for if the goal of your game was to test, say, whether the gamer understood alphabetic sorting, and the gamer happened to get the first word wrong, every word could be in the wrong position even if the words were in otherwise correct alphabetic order:

如果您事先知道元素的顺序,并且也知道元素的数量,那么您可以简单地遍历答案并将位置值与位置正确值进行比较,并得出正确的百分比。然而,这将是一个粗略的衡量标准,并且具有误导性,因为如果您的游戏的目标是测试,例如,游戏玩家是否理解字母排序,而游戏玩家碰巧弄错了第一个单词,那么每个单词都可能处于错误的位置即使单词按其他正确的字母顺序排列:

      banana
      blackberry
      blueberry
      cherry
      fig
      grapefruit
      orange
      pear
      persimmon
      raspberry
      apple

So what you could do to improve the accuracy of your measurement in our hypothetical situation is this: loop through the gamer's answer-list looking to see if the answer value is immediately followed by the correct word; every time a word is followed by the correct word, you would give the gamer a point. The gamer who produced the list above would get 9 points out of a possible 10 and that score would indeed accurately reflect the gamer's understanding of the rules of alphabetic sorting.

因此,在我们假设的情况下,您可以采取以下措施来提高测量的准确性:循环浏览玩家的答案列表,查看答案值后面是否紧跟正确的单词;每当一个单词后面跟着正确的单词时,你就会给玩家一个分数。制作上述列表的玩家将在可能的 10 分中获得 9 分,该分数确实准确地反映了玩家对字母排序规则的理解。

回答by joshweir

I've found that Jaro Winkleris also good for spelling mistakes and small differences between strings. I modified this codeto be object-oriented:

我发现Jaro Winkler也适用于拼写错误和字符串之间的细微差异。我将此代码修改为面向对象:

class StringCompareJaroWinkler 
{
    public function compare($str1, $str2)
    {
        return $this->JaroWinkler($str1, $str2, $PREFIXSCALE = 0.1 );
    }

    private function getCommonCharacters( $string1, $string2, $allowedDistance ){

      $str1_len = mb_strlen($string1);
      $str2_len = mb_strlen($string2);
      $temp_string2 = str_split($string2);

      $commonCharacters='';
      for( $i=0; $i < $str1_len; $i++){

        $noMatch = True;
        // compare if char does match inside given allowedDistance
        // and if it does add it to commonCharacters
        for( $j= max( 0, $i-$allowedDistance ); $noMatch && $j < min( $i + $allowedDistance + 1, $str2_len ); $j++){
          if( $temp_string2[$j] == $string1[$i] ){
            $noMatch = False;
        $commonCharacters .= $string1[$i];
        $temp_string2[$j] = '';
          }
        }
      }
      return $commonCharacters;
    }

    private function Jaro( $string1, $string2 ){

      $str1_len = mb_strlen( $string1 );
      $str2_len = mb_strlen( $string2 );

      // theoretical distance
      $distance = (int) floor(min( $str1_len, $str2_len ) / 2.0); 

      // get common characters
      $commons1 = $this->getCommonCharacters( $string1, $string2, $distance );
      $commons2 = $this->getCommonCharacters( $string2, $string1, $distance );

      if( ($commons1_len = mb_strlen( $commons1 )) == 0) return 0;
      if( ($commons2_len = mb_strlen( $commons2 )) == 0) return 0;
      // calculate transpositions
      $transpositions = 0;
      $upperBound = min( $commons1_len, $commons2_len );
      for( $i = 0; $i < $upperBound; $i++){
        if( $commons1[$i] != $commons2[$i] ) $transpositions++;
      }
      $transpositions /= 2.0;
      // return the Jaro distance
      return ($commons1_len/($str1_len) + $commons2_len/($str2_len) + ($commons1_len - $transpositions)/($commons1_len)) / 3.0;

    }

    private function getPrefixLength( $string1, $string2, $MINPREFIXLENGTH = 4 ){

      $n = min( array( $MINPREFIXLENGTH, mb_strlen($string1), mb_strlen($string2) ) );

      for($i = 0; $i < $n; $i++){
        if( $string1[$i] != $string2[$i] ){
          // return index of first occurrence of different characters 
          return $i;
        }
      }
      // first n characters are the same   
      return $n;
    }

    private function JaroWinkler($string1, $string2, $PREFIXSCALE = 0.1 ){

      $JaroDistance = $this->Jaro( $string1, $string2 );
      $prefixLength = $this->getPrefixLength( $string1, $string2 );
      return $JaroDistance + $prefixLength * $PREFIXSCALE * (1.0 - $JaroDistance);
    }
}

$jw = new StringCompareJaroWinkler();
echo $jw->compare("jonas","asjon");