string 在大量字符串中查找相似字符串组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3329297/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 00:48:52  来源:igfitidea点击:

Finding groups of similar strings in a large set of strings

algorithmstringdesign-patterns

提问by latentflip

I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.

我有一组相当大的字符串(比如 100 个),其中包含许多以相似性为特征的子组。我正在尝试找到/设计一种算法,该算法可以合理有效地找到这些组。

As an example let's say the input list is on the left below, and the output groups are on the right.

例如,假设输入列表位于下方左侧,而输出组位于右侧。

Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts     
Foo McBar                       Philip Roberts   
David Jones                     
Phil Roberts                    Foo McBar        
Davey Jones            =>         
John Smith                      David Jones      
Philip Roberts                  Dave Jones       
Dave Jones                      Davey Jones      
Jonny Smith                     
                                Jane Doe         

                                John Smith       
                                Jonny Smith 

Does anybody know of any ways to solve this reasonably efficiently?

有人知道有什么方法可以合理有效地解决这个问题吗?

The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.

查找相似字符串的标准方法似乎是 Levenshtein 距离,但我不知道如何在这里充分利用它,而不必将每个字符串与列表中的每个其他字符串进行比较,然后以某种方式决定差异决定两个字符串是否在同一组中的阈值。

An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists

另一种方法是将字符串散列到整数的算法,其中类似的字符串散列到在数行上靠得很近的整数。我不知道那会是什么算法,如果存在的话

Does anybody have any thoughts/pointers?

有没有人有任何想法/指示?



UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.

更新:@Will A:也许名字不像我最初想的那样是一个很好的例子。作为起点,我想我可以假设在我将使用的数据中,字符串中的微小变化不会使其从一组跳到另一组。

采纳答案by Nordic Mainframe

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

另一种流行的方法是通过它们的 Jaccard 索引关联字符串。从http://en.wikipedia.org/wiki/Jaccard_index开始。

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

这是一篇关于使用 Jaccard-index(以及其他几种方法)来解决像您这样的问题的文章:

http://matpalm.com/resemblance/

http://matpalm.com/resemblance/

回答by Roman

The problem you're trying to solve is a typical clusterizationproblem.

您要解决的问题是典型的聚类问题。

Start with simple K-Meansalgorithm and use Levenshtein distance as a function for calculating distance between elements and clusters centers.

从简单的K-Means算法开始,并使用 Levenshtein 距离作为计算元素和簇中心之间距离的函数。

BTW, algorithm for Levenshtein distance calculation is implemented in Apache Commons StringUtils - StringUtils.getLevenshteinDistance

顺便说一句,Levenshtein 距离计算的算法是在 Apache Commons StringUtils 中实现的 - StringUtils.getLevenshteinDistance

The main problem of K-Means is that you should specify the number of clusters (subgroups in your terms). So, you'll have 2 options: improve K-Means with some euristic or use another clusterization algorithm which doesn't require specifying clusters number (but that algorithm can show worse performance and can be very difficult in implemenation if you decide to implement it yourself).

K-Means 的主要问题是您应该指定集群的数量(在您的术语中是子组)。因此,您将有 2 个选择:使用一些 euristic 改进 K-Means 或使用另一种不需要指定簇数的聚类算法(但该算法可能会表现出更差的性能,并且如果您决定实施它,实施起来可能会非常困难)你自己)。

回答by Wrikken

If we're talking about actual pronouncable words, comparing the (start of) their metaphonemight be of assistance:

如果我们谈论的实际pronouncable的话,比较(开始)的变音可能会有所帮助:

MRFLPRBRTS: Mr Philip Roberts
FLRBRTS: Phil Roberts   
FLPRBRTS: Philip Roberts 
FMKBR: Foo McBar      
TFTJNS: David Jones    
TFJNS: Dave Jones     
TFJNS: Davey Jones    
JNT: Jane Doe       
JNSM0: John Smith     
JNSM0: Jonny Smith

回答by Will A

For the example you give, I reckon Levenshtein distance would be unsuitable as "Bonny Smith" would be 'very similar' to "Jonny Smith" and would almost certainly end up being considered in the same class.

对于您给出的示例,我认为 Levenshtein 距离是不合适的,因为“Bonny Smith”与“Jonny Smith”“非常相似”,并且几乎肯定最终会被考虑在同一班级中。

I think you need to approach this (if working with names) from the point-of-view of certain names having synonyms (e.g. "John", "Jon", "Jonny", "Johnny" etc.) and matching based on these.

我认为您需要从某些具有同义词(例如“John”、“Jon”、“Jonny”、“Johnny”等)的名称的角度来解决这个问题(如果使用名称)并基于这些进行匹配.

回答by Sibok666

I have solved a problem like that, first of all I normalized the text, and get out of the string words without value to the entire string, like InC. OF USA ...

我已经解决了这样的问题,首先我对文本进行了规范化,然后将没有值的字符串词移到整个字符串中,例如 InC。美国...

This unvalue words have to be defined by you.

这个不值的词必须由你定义。

After normalize I run a inspection by names using Jaro Winkler distance, and I grouped the results in an object with a list of similar objects.

标准化后,我使用 Jaro Winkler 距离按名称运行检查,并将结果分组到一个对象中,其中包含一系列相似对象。

It was really good.

真的很好。

I ran this in java with 30K people names

我用 30K 人名在 java 中运行了这个

I hope this idea is going to be useful to someone

我希望这个想法对某人有用

回答by mob

There is a solution for this exact problem documented in an open source java library for fuzzy matching https://github.com/intuit/fuzzy-matcher

在用于模糊匹配的开源 java 库中记录了这个确切问题的解决方案 https://github.com/intuit/fuzzy-matcher

The idea used there is to break down the names in words (tokens) and use text matching algorithm to find similarity in words (like Soundex, Jaccard or Lavenshtiein).

那里使用的想法是分解单词(标记)中的名称并使用文本匹配算法来查找单词(如 Soundex、Jaccard 或 Lavenshtiein)中的相似性。

Then use the score found from each word and average the score for each name.

然后使用从每个单词中找到的分数并平均每个名称的分数。

Performance for such matching is fairly critical, since if we keep matching every name with every other this becomes an exponentially growing complexity.

这种匹配的性能相当关键,因为如果我们不断地将每个名称相互匹配,这将成为指数级增长的复杂性。

This library relies on equivalence and transitive property of the match algorithm Where if "David" matches with "Davey" then the reverse match is implied, and you don't have to run those matches

该库依赖于匹配算法的等价性和传递属性,如果“大卫”与“戴维”匹配,则隐含反向匹配,您不必运行这些匹配

This library has a few more tricks to reduce the match complexity, and I was able to run a match against 4000 names in around 2 secs.

这个库还有一些降低匹配复杂性的技巧,我能够在大约 2 秒内对 4000 个名字进行匹配。

回答by user1884677

Here is SQL code for a Levenshtein function:

以下是 Levenshtein 函数的 SQL 代码:

CREATE FUNCTION [Levenshtein](@str_1 nvarchar(4000), @str_2 nvarchar(4000))
RETURNS int
AS

BEGIN
 DECLARE    @str_1_len int
        ,   @str_2_len int
        ,   @str_1_itr int
        ,   @str_2_itr int
        ,   @str_1_char nchar
        ,   @Levenshtein int
        ,   @LD_temp int
        ,   @cv0 varbinary(8000)
        ,   @cv1 varbinary(8000)

SELECT  @str_1_len = LEN(@str_1)
    ,   @str_2_len = LEN(@str_2)
    ,   @cv1 = 0x0000
    ,   @str_2_itr = 1
    ,   @str_1_itr = 1
    ,   @Levenshtein = 0


WHILE @str_2_itr <= @str_2_len

SELECT  @cv1 = @cv1 + CAST(@str_2_itr AS binary(2))
    ,   @str_2_itr = @str_2_itr + 1

WHILE @str_1_itr <= @str_1_len
BEGIN
    SELECT  @str_1_char = SUBSTRING(@str_1, @str_1_itr, 1)
        ,   @Levenshtein = @str_1_itr
        ,   @cv0 = CAST(@str_1_itr AS binary(2))
        ,   @str_2_itr = 1

    WHILE @str_2_itr <= @str_2_len
    BEGIN
        SET @Levenshtein = @Levenshtein + 1
        SET @LD_temp = CAST(SUBSTRING(@cv1, @str_2_itr+@str_2_itr-1, 2) AS int) +
        CASE WHEN @str_1_char = SUBSTRING(@str_2, @str_2_itr, 1) THEN 0 ELSE 1 END
        IF @Levenshtein > @LD_temp SET @Levenshtein = @LD_temp
        SET @LD_temp = CAST(SUBSTRING(@cv1, @str_2_itr+@str_2_itr+1, 2) AS int)+1
        IF @Levenshtein > @LD_temp SET @Levenshtein = @LD_temp
        SELECT @cv0 = @cv0 + CAST(@Levenshtein AS binary(2)), @str_2_itr = @str_2_itr + 1
    END

SELECT @cv1 = @cv0, @str_1_itr = @str_1_itr + 1
END

RETURN @Levenshtein
END