MySQL 如何计算MYSQL中两个字符串之间的相似度

Question

提问by Lina

if i have two strings in mysql:

如果我在 mysql 中有两个字符串：

@a="Welcome to Stack Overflow"
@b=" Hello to stack overflow";

is there a way to get the similarity percentage between those two string using MYSQL? here for example 3 words are similar and thus the similarity should be something like:
count(similar words between @a and @b) / (count(@a)+count(@b) - count(intersection))
and thus the result is 3/(4 + 4 - 3)= 0.6
any idea is highly appreciated!

有没有办法使用MYSQL获得这两个字符串之间的相似度百分比？例如，这里有 3 个词是相似的，因此相似度应该是这样的：
count(@a 和 @b 之间的相似词)/(count(@a)+count(@b) - count(intersection))
，因此结果是 3/(4 + 4 - 3)= 0.6
任何想法都受到高度赞赏！

Answer 1

回答by Alaa

you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552):

你可以使用这个函数（cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552）：

CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END

and for getting it as XX% use this function

并将其作为 XX% 使用此功能

CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END

Answer 2

回答by Neville Kuyt

I don't think there's a nice, single-step query way to do this - the natural language stuff is designed mostly for "google-like" search, which sounds different to what you're trying to do.

我不认为有一种很好的单步查询方式来做到这一点——自然语言的东西主要是为“类似谷歌”的搜索而设计的，这听起来与你想要做的不同。

Depending on what you're actually trying to do - I assume you've left out a lot of detail - I would:

根据您实际尝试做的事情-我假设您遗漏了很多细节-我会：

create a table into which you split each string into words, all in lower case, stripping out spaces and punctuation - in your example, you'd end up with:

string_id               word

1                       hello
1                       from
1                       stack
1                       overflow
2                       welcome
2                       from
2                       stack
2                       overflow

创建一个表格，将每个字符串拆分为单词，全部为小写，去掉空格和标点符号 - 在您的示例中，您最终会得到：

string_id               word

1                       hello
1                       from
1                       stack
1                       overflow
2                       welcome
2                       from
2                       stack
2                       overflow

You can then run queries against that table - e.g.

然后您可以对该表运行查询 - 例如

select count(*)
from  stringWords
where string_id = 2
and word in 
  (select word 
  from stringWords
  where string_id = 1);

gives you the intersection.

给你交集。

You can then create a function or similar to calculate similarity according to your formula.

然后，您可以创建一个函数或类似的函数来根据您的公式计算相似度。

Not very clean, but it should perform pretty snappily, it's mostly relational, and it should be largely language independent. To deal with possible typos, you could calculate the soundex - this would allow you to compare "stack" with "stak" and see how similar they really are, though this doesn't work reliably for languages other than English.

不是很干净，但它应该表现得非常迅速，它主要是关系性的，并且应该在很大程度上与语言无关。为了处理可能的拼写错误，您可以计算 soundex - 这将允许您将“stack”与“stak”进行比较，并查看它们的真正相似程度，尽管这对于英语以外的语言并不可靠。

Answer 3

回答by SubniC

You can try the SOUNDEX algorithm, take a look here :)

你可以试试 SOUNDEX 算法，看这里 :)

SOUNDEX MySQL

EDIT 1:

编辑 1：

Maybe this link about natural language processing with MySQL could be useful

也许这个关于用 MySQL 进行自然语言处理的链接可能有用

Natural Language Full-Text Searches

自然语言全文搜索

How to find similar results and sort by similarity?

如何找到相似的结果并按相似度排序？

HTH!

哼！

Answer 4

回答by DhruvPathak

This might be of help to you if you do not want to write your own algorithms :

如果您不想编写自己的算法，这可能对您有所帮助：

http://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html

MySQL 如何计算MYSQL中两个字符串之间的相似度

提问by Lina

回答by Alaa

回答by Neville Kuyt

回答by SubniC

回答by DhruvPathak

相关推荐

最近更新

标签

MySQL 如何计算MYSQL中两个字符串之间的相似度

提问by Lina

回答by Alaa

回答by Neville Kuyt

回答by SubniC

回答by DhruvPathak

相关推荐

MySQL Linux 命令行工具和正确格式化的结果

MySQL 将 timediff 输出转换为日、时、分、秒格式

MySQL mysql切换案例

如何在 MySQL 中获取当前日期和时间？

相关推荐

最近更新

标签