string 聚类(尤其是字符串聚类)是如何工作的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8196371/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 01:17:07  来源:igfitidea点击:

How does clustering (especially String clustering) work?

stringcluster-analysisdata-mining

提问by Renato Dinhani

I heard about clustering to group similar data. I want to know how it works in the specific case for String.

我听说过聚类来对相似的数据进行分组。我想知道它在 String 的特定情况下是如何工作的。

I have a table with more than different 100,000 words.

我有一张超过 100,000 个单词的桌子。

I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, @house, "house", etc...).

我想识别有一些差异的同一个词(例如:)house, house!!, hooouse, HoUse, @house, "house", etc...

What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this?

需要什么来识别相似性并将每个单词分组到一个集群中?对此更推荐什么算法?

回答by ffriend

To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can split all objects into groups (such as cities). Clustering algorithms make exactly this thing - they allow you to split your data into groups without previous specifying groups borders.

要了解什么是聚类,请想象一张地理地图。您可以看到许多不同的物体(例如房屋)。他们中的一些人彼此靠近,而另一些人则相距甚远。基于此,您可以将所有对象分成组(例如城市)。聚类算法完全可以做到这一点 - 它们允许您将数据分成组,而无需事先指定组边界。

All clustering algorithms are based on the distance(or likelihood) between 2 objects. On geographical map it is normal distance between 2 houses, in multidimensional space it may be Euclidean distance (in fact, distance between 2 houses on the map also is Euclidean distance). For string comparison you have to use something different. 2 good choices here are Hammingand Levenshtein distance. In your particular case Levenshtein distanceif more preferable (Hamming distance works only with the strings of same size).

所有聚类算法都基于2 个对象之间的距离(或可能性)。在地理地图上是2间房屋之间的正常距离,在多维空间中可能是欧氏距离(其实地图上2间房屋间的距离也是欧氏距离)。对于字符串比较,您必须使用不同的东西。这里有 2 个不错的选择是HammingLevenshtein distance。在您的特定情况下,Levenshtein 距离更可取(汉明距离仅适用于相同大小的字符串)。

Now you can use one of existing clustering algorithms. There's plenty of them, but not all can fit your needs. For example, pure k-means, already mentioned here will hardly help you since it requires initial number of groups to find, and with large dictionary of strings it may be 100, 200, 500, 10000 - you just don't know the number. So other algorithms may be more appropriate.

现在您可以使用现有的聚类算法之一。有很多,但不是所有的都能满足您的需求。例如,这里已经提到的纯 k 均值对您几乎没有帮助,因为它需要初始数量的组才能找到,并且对于大型字符串字典,它可能是 100、200、500、10000 - 您只是不知道数字. 所以其他算法可能更合适。

One of them is expectation maximizationalgorithm. Its advantage is that it can find number of clusters automatically. However, in practice often it gives less precise results than other algorithms, so it is normal to use k-means on top of EM, that is, first find number of clusters and their centers with EM and then use k-means to adjust the result.

其中之一是期望最大化算法。它的优点是可以自动找到簇数。但是,在实践中往往给出的结果不如其他算法精确,所以在 EM 之上使用k-means是正常,即先用 EM 找出簇的数量和它们的中心,然后使用 k-means 调整结果。

Another possible branch of algorithms, that may be suitable for your task, is hierarchical clustering. The result of cluster analysis in this case in not a set of independent groups, but rather tree (hierarchy), where several smaller clusters are grouped into one bigger, and all clusters are finally part of one big cluster. In your case it means that all words are similar to each other up to some degree.

可能适合您的任务的另一个可能的算法分支是层次聚类。在这种情况下,聚类分析的结果不是一组独立的组,而是树(层次结构),其中几个较小的聚类被分组为一个较大的聚类,并且所有聚类最终成为一个大聚类的一部分。在您的情况下,这意味着所有单词在某种程度上彼此相似。

回答by Amit Kohli

There is a package called stringdistthat allows for string comparison using several different methods. Copypasting from that page:

有一个名为stringdist的包,它允许使用几种不同的方法进行字符串比较。从该页面复制粘贴:

  • Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.
  • Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.
  • (Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.
  • Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.
  • Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
  • q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
  • Cosine distance: 1 minus the cosine similarity of both N-gram vectors.
  • Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
  • Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.
  • Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].
  • 汉明距离:两个字符串中具有相同符号的位置数。只为等长的字符串定义。
  • Levenshtein 距离:将字符串 a 转换为字符串 b 所需的最少插入、删除和替换次数。
  • (Full) Damerau-Levenshtein 距离:与 Levenshtein 距离类似,但允许相邻符号的换位。
  • 最佳字符串对齐/受限 Damerau-Levenshtein 距离:类似于(完整)Damerau-Levenshtein 距离,但每个子字符串只能编辑一次。
  • 最长公共子串距离:必须在两个字符串中删除的最小符号数,直到结果子串相同。
  • q-gram 距离:两个字符串的 N-gram 向量之间的绝对差之和。
  • 余弦距离:1 减去两个 N-gram 向量的余弦相似度。
  • Jaccard 距离:1 减去共享 N-gram 和所有观察到的 N-gram 的商。
  • Jaro 距离:Jaro 距离是一个包含 4 个值的公式,实际上是 p = 0 的 Jaro-Winkler 距离的特例。
  • Jaro-Winkler 距离:这个距离是由从 [0, 0.25] 中选择的两个比较字符串 (A,B,m,t,l) 和 p 确定的 5 个参数的公式。

That will give you the distance. You might not need to perform a cluster analysis, perhaps sorting by the string distance itself is sufficient. I have created a script to provide the basic functionality here... feel free to improve it as needed.

那会给你距离。您可能不需要执行聚类分析,也许按字符串距离本身排序就足够了。我已经创建了一个脚本来提供这里的基本功能......随时根据需要对其进行改进。

回答by Oded

You can use an algorithm like the Levenshtein distancefor the distance calculation and k-meansfor clustering.

您可以使用Levenshtein 距离等算法进行距离计算和k-means聚类。

the Levenshtein distance is a string metric for measuring the amount of difference between two sequences

Levenshtein 距离是一个字符串度量,用于测量两个序列之间的差异量

Do some testing and find a similarity threshold per word that will decide your groups.

做一些测试并找到每个单词的相似度阈值,这将决定您的组。