用于 mysql/模糊搜索的 Levenshtein 距离的实现?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/634995/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 12:54:02  来源:igfitidea点击:

Implementation of Levenshtein distance for mysql/fuzzy search?

mysqldatabasealgorithmsearchlevenshtein-distance

提问by Andrew Clark

I would like to be able to search a table as follows for smith as get everything that it within 1 variance.

我希望能够为 smith 搜索如下表格,以获取它在 1 个方差内的所有内容。

Data:

数据:

O'Brien
Smithe
Dolan
Smuth
Wong
Smoth
Gunther
Smiht

I have looked into using Levenshtein distance does anyone know how to implement this with it?

我已经研究过使用 Levenshtein distance 有没有人知道如何用它来实现?

回答by Nick Johnson

In order to efficiently search using levenshtein distance, you need an efficient, specialised index, such as a bk-tree. Unfortunately, no database system I know of, including MySQL, implements bk-tree indexes. This is further complicated if you're looking for full-text search, instead of just a single term per row. Off-hand, I can't think of any way that you could do full-text indexing in a manner that allows for searching based on levenshtein distance.

为了使用 levenshtein 距离进行有效搜索,您需要一个高效的专用索引,例如bk-tree。不幸的是,我所知道的数据库系统,包括 MySQL,都没有实现 bk-tree 索引。如果您正在寻找全文搜索,而不仅仅是每行一个词,这会更加复杂。顺便说一句,我想不出任何方法可以以允许基于 levenshtein 距离进行搜索的方式进行全文索引。

回答by Hongzheng

There is a mysql UDF implementation of Levenshtein Distance function

有一个 Levenshtein 距离函数的 mysql UDF 实现

https://github.com/jmcejuela/Levenshtein-MySQL-UDF

https://github.com/jmcejuela/Levenshtein-MySQL-UDF

It is implemented in C and has better performance than the "MySQL Levenshtein distance query" mentioned by schnaader

它是用C实现的,性能比schnaader提到的“MySQL Levenshtein distance query”更好

回答by Ponk

An implementation for the damerau-levenshtein distance can be found here: Damerau-Levenshtein algorithm: Levenshtein with transpositionsThe improvement over pure Levenshtein distance is that the swapping of characters is considered. I found it in the comments of schnaader's link, thanks!

damerau-levenshtein 距离的实现可以在这里找到: Damerau-Levenshtein 算法:Levenshtein with transpositions对纯 Levenshtein 距离的改进是考虑了字符的交换。我在 schnaader 链接的评论中找到了它,谢谢!

回答by stopthe

The function given for levenshtein <= 1 above is not right -- it gives incorrect results for e.g., "bed" and "bid".

上面为 levenshtein <= 1 给出的函数是不正确的——它给出了不正确的结果,例如“床”和“出价”。

I modified the "MySQL Levenshtein distance query" given above, in the first answer, to accept a "limit" that will speed it up a little. Basically, if you only care about Levenshtein <= 1, set the limit to "2" and the function will return the exact levenshtein distance if it is 0 or 1; or a 2 if the exact levenshtein distance is 2 or greater.

我修改了上面给出的“MySQL Levenshtein 距离查询”,在第一个答案中,接受一个“限制”,这会稍微加快速度。基本上,如果您只关心 Levenshtein <= 1,请将限制设置为“2”,如果它是 0 或 1,该函数将返回精确的 Levenshtein 距离;或 2 如果精确的 levenshtein 距离为 2 或更大。

This mod makes it 15% to 50% faster -- the longer your search word, the bigger the advantage (because the algorithm can bail earlier.) For instance, on a search against 200,000 words to find all matches within distance 1 of the word "giggle," the original takes 3 min 47 sec on my laptop, versus 1:39 for the "limit" version. Of course, these are both too slow for any real-time use.

这个 mod 让它快 15% 到 50%——你的搜索词越长,优势就越大(因为算法可以提前保释。)例如,在搜索 200,000 个词时找到距离 1 内的所有匹配词“咯咯笑”,原始版本在我的笔记本电脑上需要 3 分 47 秒,而“限制”版本则为 1:39。当然,这些对于任何实时使用来说都太慢了。

Code:

代码:

DELIMITER $$
CREATE FUNCTION levenshtein_limit_n( s1 VARCHAR(255), s2 VARCHAR(255), n INT) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost, c_min INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0, c_min = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len and c_min < n DO -- if actual levenshtein dist >= limit, don't bother computing it
        SET s1_char = SUBSTRING(s1, i, 1), c = i, c_min = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            IF c < c_min THEN
              SET c_min = c;
            END IF; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF;
    IF i <= s1_len THEN -- we didn't finish, limit exceeded    
      SET c = c_min; -- actual distance is >= c_min (i.e., the smallest value in the last computed row of the matrix) 
    END IF;
    RETURN c;
  END$$

回答by Alaa

you can use this function

你可以使用这个功能

CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END

and for getting it as XX% use this function

并将其作为 XX% 使用此功能

CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END

回答by AbcAeffchen

If you only want to know if the levenshtein-distance is at most 1, you can use the following MySQL function.

如果只想知道 levenshtein-distance 是否最多为 1,可以使用下面的 MySQL 函数。

CREATE FUNCTION `lv_leq_1` (
`s1` VARCHAR( 255 ) ,
`s2` VARCHAR( 255 )
) RETURNS TINYINT( 1 ) DETERMINISTIC
BEGIN
    DECLARE s1_len, s2_len, i INT;
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), i = 1;
    IF s1 = s2 THEN
        RETURN TRUE;
    ELSEIF ABS(s1_len - s2_len) > 1 THEN
        RETURN FALSE;
    ELSE
        WHILE SUBSTRING(s1,s1_len - i,1) = SUBSTRING(s2,s2_len - i,1) DO
            SET i = i + 1;
        END WHILE;
        RETURN SUBSTRING(s1,1,s1_len-i) = SUBSTRING(s2,1,s2_len-i) OR SUBSTRING(s1,1,s1_len-i) = SUBSTRING(s2,1,s2_len-i+1) OR SUBSTRING(s1,1,s1_len-i+1) = SUBSTRING(s2,1,s2_len-i);
    END IF;
END

This is basically a single step in the recursive description of the levenshtein distance. The function returns 1, if the distance is at most 1, else it returns 0.

这基本上是 levenshtein 距离的递归描述中的一个步骤。该函数返回 1,如果距离最多为 1,否则返回 0。

Since this function does not completely compute the levenshtein-distance, it is much faster.

由于这个函数没有完全计算出编辑距离,所以它要快得多。

You can also modify this function such that it returns trueif the levenshtein-distance is at most 2 or 3, by calling it self recursively. If MySQL does not support recursive calls, you can copy slightly modified versions of this function two times and call them instead. But you should not use the recursive function to calculate the exact levenshtein-distance.

您还可以修改此函数,使其true在 levenshtein-distance 最多为 2 或 3 时返回,方法是通过自我递归调用它。如果 MySQL 不支持递归调用,您可以将这个函数的稍微修改的版本复制两次并调用它们。但是您不应该使用递归函数来计算精确的编辑距离。

回答by AbcAeffchen

I am setting up a search based on Levenshtein or Damerau-Levenshtein (probably the latter) for multiple searches over an indexed text, based on a paper by by Gonzalo Navarro and Ricardo Baeza-yates: link text

我正在根据 Gonzalo Navarro 和 Ricardo Baeza-yates 的论文,基于 Levenshtein 或 Damerau-Levenshtein(可能是后者)对索引文本进行多次搜索设置搜索:链接文本

After building a suffix array (see wikipedia), if you are interested in a string with at most k mismatches to the search string, break the search string into k+1 pieces; at least one of those must be intact. Find the substrings by binary search over the suffix array, then apply the distance function to the patch around each matched piece.

构建后缀数组(参见维基百科)后,如果您对与搜索字符串最多有 k 个不匹配的字符串感兴趣,请将搜索字符串分成 k+1 段;其中至少一个必须完好无损。通过对后缀数组的二分搜索找到子串,然后将距离函数应用于每个匹配块周围的补丁。

回答by greg

I had a specialized case of k-distance searching and after installing the Damerau-Levenshtein UDF in MySQL found that the query was taking too long. I came up with the following solution:

我有一个特殊的 k 距离搜索案例,在 MySQL 中安装 Damerau-Levenshtein UDF 后发现查询耗时太长。我想出了以下解决方案:

  • I have a very restrictive search space (9 character string limited to numeric values).
  • 我有一个非常有限的搜索空间(9 个字符串仅限于数值)。

Create a new table (or append columns to your target table) with columns for each character position in your target field. ie. My VARCHAR(9) ended up as 9 TINYINT columns + 1 Id column that matches my main table (add indexes for each column). I added triggers to ensure that these new columns always get updated when my main table gets updated.

创建一个新表(或将列附加到目标表),其中包含目标字段中每个字符位置的列。IE。我的 VARCHAR(9) 最终成为 9 个 TINYINT 列 + 1 个与我的主表匹配的 Id 列(为每列添加索引)。我添加了触发器以确保当我的主表更新时这些新列总是得到更新。

To perform a k-distance query use the following predicate:

要执行 k 距离查询,请使用以下谓词:

(Column1=s[0]) + (Column2=s[1]) + (Column3=s[2]) + (Column4=s[3]) + ... >= m

(Column1=s[0]) + (Column2=s[1]) + (Column3=s[2]) + (Column4=s[3]) + ... >= m

where s is your search string and m is the required number of matching characters (or m = 9 - d in my case where d is the maximum distance I want returned).

其中 s 是您的搜索字符串,m 是所需的匹配字符数(或 m = 9 - d 在我的情况下,其中 d 是我想要返回的最大距离)。

After testing I found that a query over 1 million rows that was taking 4.6 seconds on average was returning matching ids in less than a second. A second query to return the data for the matching rows in my main table similarly took under a second. (Combining these two queries as a subquery or join resulted in significantly longer execution times and I'm not sure why.)

经过测试,我发现超过 100 万行的查询平均需要 4.6 秒,在不到一秒的时间内返回匹配的 ID。返回主表中匹配行数据的第二个查询同样需要不到一秒钟的时间。(将这两个查询组合为子查询或连接会导致执行时间显着延长,我不知道为什么。)

Though this is not Damerau-Levenshtein (doesn't account for substitution) it suffices for my purposes.

虽然这不是 Damerau-Levenshtein(不考虑替代),但它足以满足我的目的。

Though this solution probably doesn't scale well for a larger (length) search space it worked for this restrictive case very well.

尽管此解决方案可能不适用于更大(长度)搜索空间,但它非常适合这种限制性情况。

回答by D. Savina

Based on Chella's answerand Ryan Ginstrom's article, a fuzzy search could be implemented as so:

根据Chella 的回答和 Ryan Ginstrom 的文章,模糊搜索可以这样实现:

DELIMITER $$
CREATE FUNCTION fuzzy_substring( s1 VARCHAR(255), s2 VARCHAR(255) )
    RETURNS INT
    DETERMINISTIC
BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    -- max strlen=255
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
        RETURN 0;
    ELSEIF s1_len = 0 THEN
        RETURN s2_len;
    ELSEIF s2_len = 0 THEN
        RETURN s1_len;
    ELSE
        WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(0))), j = j + 1;
        END WHILE;
        WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN
                    SET cost = 0; ELSE SET cost = 1;
                END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN
                    SET c = c_temp;
                END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
        END WHILE;
    END IF;
    SET j = 1;
    WHILE j <= s2_len DO
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10);
        IF c > c_temp THEN
            SET c = c_temp;
        END IF;
        SET j = j + 1;
    END WHILE;
    RETURN c;
END$$
DELIMITER ;