javascript Similar_text 是如何工作的？

Question

提问by Hugo Delsing

I just found the similar_text function and was playing around with it, but the percentage output always suprises me. See the examples below.

我刚刚找到了similar_text 函数并正在尝试使用它，但百分比输出总是让我感到惊讶。请参阅下面的示例。

I tried to find information on the algorithm used as mentioned on php: similar_text()^Docs:

我试图找到有关php: similar_text()^Docs上提到的算法的信息：

<?php
$p = 0;
similar_text('aaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//66.666666666667
//Since 5 out of 10 chars match, I would expect a 50% match

similar_text('aaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//40
//5 out of 20 > not 25% ?

similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>"; 
//9.5238095238095 
//5 out of 100 > not 5% ?


//Example from PHP.net
//Why is turning the strings around changing the result?

similar_text('PHP IS GREAT', 'WITH MYSQL', $p);
echo $p . "<hr>"; //27.272727272727

similar_text('WITH MYSQL', 'PHP IS GREAT', $p);
echo $p . "<hr>"; //18.181818181818

?>

Can anybody explain how this actually works?

任何人都可以解释这实际上是如何工作的吗？

Update:

更新：

Thanks to the comments I found that the percentage is actually calculated using the number of similar charactors * 200 / length1 + lenght 2

感谢评论，我发现百分比实际上是使用相似字符数 * 200 / length1 + lenght 2 计算的

Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);

So that explains why the percenatges are higher then expected. With a string with 5 out of 95 it turns out 10, so that I can use.

所以这就解释了为什么百分比高于预期。使用 95 个中有 5 个的字符串，结果是 10 个，这样我就可以使用了。

similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>"; 
//10
//5 out of 95 = 5 * 200 / (5 + 95) = 10

But I still cant figure out why PHP returns a different result on turning the strings around. The JS code provided by dfsq doesn't do this. Looking at the source code in PHP I can only find a difference in the following line, but i'm not a c programmer. Some insight in what the difference is, would be appreciated.

但我仍然无法弄清楚为什么 PHP 在转换字符串时会返回不同的结果。dfsq 提供的 JS 代码没有做到这一点。查看 PHP 中的源代码，我只能在以下行中找到不同之处，但我不是 ac 程序员。对区别的一些见解，将不胜感激。

In JS:

在JS中：

for (l = 0;(p + l < firstLength) && (q + l < secondLength) && (first.charAt(p + l) === second.charAt(q + l)); l++);

In PHP: (php_similar_str function)

在 PHP 中：（php_similar_str 函数）

for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);

Source:

来源：

/* {{{ proto int similar_text(string str1, string str2 [, float percent])
   Calculates the similarity between two strings */
PHP_FUNCTION(similar_text)
{
  char *t1, *t2;
  zval **percent = NULL;
  int ac = ZEND_NUM_ARGS();
  int sim;
  int t1_len, t2_len;

  if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent) == FAILURE) {
    return;
  }

  if (ac > 2) {
    convert_to_double_ex(percent);
  }

  if (t1_len + t2_len == 0) {
    if (ac > 2) {
      Z_DVAL_PP(percent) = 0;
    }

    RETURN_LONG(0);
  }

  sim = php_similar_char(t1, t1_len, t2, t2_len);

  if (ac > 2) {
    Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);
  }

  RETURN_LONG(sim);
}
/* }}} */ 


/* {{{ php_similar_str
 */
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
  char *p, *q;
  char *end1 = (char *) txt1 + len1;
  char *end2 = (char *) txt2 + len2;
  int l;

  *max = 0;
  for (p = (char *) txt1; p < end1; p++) {
    for (q = (char *) txt2; q < end2; q++) {
      for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
      if (l > *max) {
        *max = l;
        *pos1 = p - txt1;
        *pos2 = q - txt2;
      }
    }
  }
}
/* }}} */


/* {{{ php_similar_char
 */
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
  int sum;
  int pos1, pos2, max;

  php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);

  if ((sum = max)) {
    if (pos1 && pos2) {
      sum += php_similar_char(txt1, pos1,
                  txt2, pos2);
    }
    if ((pos1 + max < len1) && (pos2 + max < len2)) {
      sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
                  txt2 + pos2 + max, len2 - pos2 - max);
    }
  }

  return sum;
}
/* }}} */

Source in Javascript: similar text port to javascript

Javascript 中的源代码：与 javascript 类似的文本端口

Answer 1

采纳答案by eis

It would indeed seem the function uses different logic depending of the parameter order. I think there are two things at play.

该函数似乎确实根据参数顺序使用不同的逻辑。我认为有两件事在起作用。

First, see this example:

首先，看这个例子：

echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2

It seems to be that it is testing "how many times any distinct char on param1 is found in param2", and thus result would be different if you swap the params around. It has been reported as a bug, which has been closed as "working as expected".

似乎它正在测试“在 param2 中找到 param1 上任何不同字符的次数”，因此如果您交换参数，结果会有所不同。它已被报告为一个错误，已被关闭为“按预期工作”。

Now, the above is the samefor both PHP and javascript implementations - paremeter order has an impact, so saying that JS code wouldn't do this is wrong. This is argued in the bug entry as intended behaviour.

现在，以上对于 PHP 和 javascript 实现都是相同的 - 参数顺序有影响，所以说 JS 代码不会这样做是错误的。这在错误条目中被认为是预期行为。

Second - what doesn't seem correct is the MYSQL/PHP word example. With that, javascript version gives 3 irrelevant of the order of params, whereas PHP gives 2 and 3 (and due to that, percentage is equally different). Now, the phrases "PHP IS GREAT" and "WITH MYSQL" should have 5 characters in common, irrelevant of which way you compare: H, I, S and T, one each, plus one for empty space. In order they have 3 characters, 'H', ' ' and 'S', so if you look at the ordering, correct answer should be 3 both ways. I modified the C code to a runnable version, and added some output, so one can see what is happening there (codepad link):

其次 - 似乎不正确的是 MYSQL/PHP 单词示例。这样，javascript 版本给出了 3 个与参数顺序无关的值，而 PHP 给出了 2 和 3（因此，百分比同样不同）。现在，短语“PHP IS GREAT”和“WITH MYSQL”应该有 5 个共同的字符，与您比较哪种方式无关：H、I、S 和 T，各一个，加上一个空格。为了它们有 3 个字符，'H'、' ' 和 'S'，所以如果你看看排序，正确的答案应该是 3 双向。我将 C 代码修改为可运行的版本，并添加了一些输出，以便人们可以看到那里发生了什么（键盘链接）：

#include<stdio.h>

/* {{{ php_similar_str
 */
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
  char *p, *q;
  char *end1 = (char *) txt1 + len1;
  char *end2 = (char *) txt2 + len2;
  int l;

  *max = 0;
  for (p = (char *) txt1; p < end1; p++) {
    for (q = (char *) txt2; q < end2; q++) {
      for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
      if (l > *max) {
        *max = l;
        *pos1 = p - txt1;
        *pos2 = q - txt2;
      }
    }
  }
}
/* }}} */


/* {{{ php_similar_char
 */
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
  int sum;
  int pos1, pos2, max;

  php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);

  if ((sum = max)) {
    if (pos1 && pos2) {
      printf("txt here %s,%s\n", txt1, txt2);
      sum += php_similar_char(txt1, pos1,
                  txt2, pos2);
    }
    if ((pos1 + max < len1) && (pos2 + max < len2)) {
      printf("txt here %s,%s\n", txt1+ pos1 + max, txt2+ pos2 + max);
      sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
                  txt2 + pos2 + max, len2 - pos2 - max);
    }
  }

  return sum;
}
/* }}} */
int main(void)
{
    printf("Found %d similar chars\n",
        php_similar_char("PHP IS GREAT", 12, "WITH MYSQL", 10));
    printf("Found %d similar chars\n",
        php_similar_char("WITH MYSQL", 10,"PHP IS GREAT", 12));
    return 0;
}

the result is output:

结果是输出：

txt here PHP IS GREAT,WITH MYSQL
txt here P IS GREAT, MYSQL
txt here IS GREAT,MYSQL
txt here IS GREAT,MYSQL
txt here  GREAT,QL
Found 3 similar chars
txt here WITH MYSQL,PHP IS GREAT
txt here TH MYSQL,S GREAT
Found 2 similar chars

So one can see that on the first comparison, the function found 'H', ' ' and 'S', but not 'T', and got the result of 3. The second comparison found 'I' and 'T' but not 'H', ' ' or 'S', and thus got the result of 2.

所以可以看到，在第一次比较中，函数找到了'H'、' '和'S'，但没有找到'T'，结果为3。第二次比较找到了'I'和'T'但没有'H'、' ' 或 'S'，从而得到 2 的结果。

The reason for these results can be seen from the output: algorithm takes the first letter in the first string that second string contains, counts that, and throws away the chars before that from the second string. That is why it misses the characters in-between, and that's the thing causing the difference when you change the character order.

产生这些结果的原因可以从输出中看出：算法取第二个字符串包含的第一个字符串中的第一个字母，对其进行计数，并从第二个字符串中丢弃该字符之前的字符。这就是为什么它错过了中间的字符，这就是当您更改字符顺序时导致差异的原因。

What happens there might be intentional or it might not. However, that's not how javascript version works. If you print out the same things in the javascript version, you get this:

那里发生的事情可能是有意的，也可能不是。但是，这不是 javascript 版本的工作方式。如果你在 javascript 版本中打印出相同的东西，你会得到这个：

txt here: PHP, WIT
txt here: P IS GREAT,  MYSQL
txt here: IS GREAT, MYSQL
txt here: IS, MY
txt here:  GREAT, QL
Found 3 similar chars
txt here: WITH, PHP 
txt here: W, P
txt here: TH MYSQL, S GREAT
Found 3 similar chars

showing that javascript version does it in a different way. What the javascript version does is that it finds 'H', ' ' and 'S' being in the same order in the first comparison, and the same 'H', ' ' and 'S' also on the second one - so in this case the order of params doesn't matter.

显示 javascript 版本以不同的方式执行此操作。javascript 版本的作用是它发现 'H'、' ' 和 'S' 在第一次比较中的顺序相同，并且在第二次比较中也有相同的 'H'、' ' 和 'S' - 所以在在这种情况下，参数的顺序无关紧要。

As the javascript is meant to duplicate the code of PHP function, it needs to behave identically, so I submitted bug report based on analysis of @Khez and the fix, which has been merged now.

由于javascript是为了复制PHP函数的代码，它需要表现相同，所以我根据@Khez的分析和修复提交了错误报告，现在已经合并。

Answer 2

回答by Khez

This was actually a very interesting question, thank you for giving me a puzzle that turned out to be very rewarding.

这实际上是一个非常有趣的问题，谢谢你给了我一个结果非常有益的谜题。

Let me start out by explaining how similar_textactually works.

让我首先解释similar_text 的实际工作原理。

Similar Text: The Algorithm

相似文字：算法

It's a recursion based divide and conquer algorithm. It works by first finding the longest common string between the two inputsand breaking the problem into subsets around that string.

这是一个基于递归的分治算法。它的工作原理是首先找到两个输入之间最长的公共字符串，然后将问题分解为围绕该字符串的子集。

The examples you have used in your question, actually all perform only one iteration of the algorithm. The only ones not using one iteration and the ones giving different results are from the php.net comments.

您在问题中使用的示例实际上都只执行算法的一次迭代。唯一不使用一次迭代和给出不同结果的来自php.net 评论。

Here is a simple example to understand the main issue behind simple_text and hopefully give some insight into how it works.

这是一个简单的例子来理解 simple_text 背后的主要问题，并希望对它的工作原理有一些了解。

Similar Text: The Flaw

相似文字：缺陷

eeeefaaaaafddddd
ddddgaaaaagbeeee

Iteration 1:
Max    = 5
String = aaaaa
Left : eeeef and ddddg
Right: fddddd and geeeee

I hope the flaw is already apparent. It will only check directly to the left and to the right of the longest matched stringin both input strings. This example

我希望缺陷已经很明显了。它只会直接检查两个输入字符串中最长匹配字符串的左侧和右侧。这个例子

$s1='eeeefaaaaafddddd';
$s2='ddddgaaaaagbeeee';

echo similar_text($s1, $s2).'|'.similar_text($s2, $s1);
// outputs 5|5, this is due to Iteration 2 of the algorithm
// it will fail to find a matching string in both left and right subsets

To be honest, I'm uncertain how this case should be treated. It can be seen that only 2 characters are different in the string. But both eeeeand ddddare on opposite ends of the two strings, uncertain what NLPenthusiasts or other literaryexperts have to say about this specific situation.

老实说，我不确定应该如何处理这个案子。可以看出，字符串中只有2个字符不同。但是eeee和dddd都在两个字符串的两端，不确定NLP爱好者或其他文学专家对这种特定情况有什么看法。

Similar Text: Inconsistent results on argument swapping

类似的文字：参数交换的结果不一致

The different results you were experiencing based on input order was due to the way the alogirthm actually behaves (as mentioned above). I'll give a final explination on what's going on.

您根据输入顺序遇到的不同结果是由于算法的实际行为方式（如上所述）。我将对正在发生的事情做最后的解释。

echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2

On the first case, there's only one Iteration:

在第一种情况下，只有一个迭代：

test
wert

Iteration 1:
Max    = 1
String = t
Left :  and wer
Right: est and

We only have one iteration because empty/null strings return 0 on recursion. So this ends the algorithm and we have our result: 1

我们只有一次迭代，因为空/空字符串在递归时返回 0。所以这结束了算法，我们得到了我们的结果：1

On the second case, however, we are faced with multiple Iterations:

然而，在第二种情况下，我们面临多次迭代：

wert
test

Iteration 1:
Max    = 1
String = e
Left : w and t
Right: rt and st

We already have a common string of length 1. The algorithm on the left subset will end in 0 matches, but on the right:

我们已经有一个长度为 1 的公共字符串。左边子集的算法将以 0 个匹配结束，但在右边：

rt
st

Iteration 1:
Max    = 1
String = t
Left : r and s
Right:  and

This will lead to our new and final result: 2

这将导致我们的新结果和最终结果：2

I thank you for this very informative question and the opportunity to dabble in C++ again.

我感谢您提出的这个非常有用的问题以及再次涉足 C++ 的机会。

Similar Text: JavaScript Edition

相似文字：JavaScript 版

The short answer is: The javascript code is not implementing the correct algorithm

简短的回答是：javascript 代码没有实现正确的算法

sum += this.similar_text(first.substr(0, pos2), second.substr(0, pos2));

Obviously it should be first.substr(0,pos1)

显然应该是 first.substr(0,pos1)

Note:The JavaScript code has been fixed by eisin a previous commit. Thanks @eis

注意：JavaScript 代码已在之前的提交中由eis修复。谢谢@eis

Demystified!

揭秘！

Answer 3

回答by spinsch

first String = aaaaaaaaaa = 10 letters
second String = aaaaa = 5 letters

first five letters are similar
a+a
a+a
a+a
a+a
a+a
a
a
a
a
a


( <similar_letters> * 200 ) / (<letter_count_first_string> + <letter_count_second_string>)

( 5 * 200 ) / (10 + 5);
= 66.6666666667

Answer 4

回答by Kamesh S

Description int similar_text ( string $first , string $second [, float &$percent ] )

说明 int similar_text ( string $first , string $second [, float &$percent ] )

This calculates the similarity between two strings as described in Oliver [1993]. Note that this implementation does not use a stack as in Oliver's pseudo code, but recursive calls which may or may not speed up the whole process. Note also that the complexity of this algorithm is O(N**3) where N is the length of the longest string. Parameters

这会计算两个字符串之间的相似度，如 Oliver [1993] 中所述。请注意，此实现不像 Oliver 的伪代码那样使用堆栈，而是使用递归调用，这可能会或可能不会加速整个过程。另请注意，该算法的复杂度为 O(N**3)，其中 N 是最长字符串的长度。参数

first

第一的

The first string.

second

第二

The second string.

percent

百分

By passing a reference as third argument, similar_text() will calculate the similarity in percent for you.

javascript Similar_text 是如何工作的？

提问by Hugo Delsing

采纳答案by eis

回答by Khez

Similar Text: The Algorithm

相似文字：算法

Similar Text: The Flaw

相似文字：缺陷

Similar Text: Inconsistent results on argument swapping

类似的文字：参数交换的结果不一致

Similar Text: JavaScript Edition

相似文字：JavaScript 版

回答by spinsch

回答by Kamesh S

相关推荐

最近更新

标签

javascript Similar_text 是如何工作的？

提问by Hugo Delsing

采纳答案by eis

回答by Khez

Similar Text: The Algorithm

相似文字：算法

Similar Text: The Flaw

相似文字：缺陷

Similar Text: Inconsistent results on argument swapping

类似的文字：参数交换的结果不一致

Similar Text: JavaScript Edition

相似文字：JavaScript 版

回答by spinsch

回答by Kamesh S

相关推荐

带有 Google JavaScript 客户端库的 Google Contacts API

javascript .charCodeAt() 的 Java 等效项

javascript remove() 函数中的 jQuery 回调

HTML5/JavaScript 音频播放列表

相关推荐

最近更新

标签