Python 基于距离矩阵的聚类

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16246066/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:06:23  来源:igfitidea点击:

Clustering Based On Distance Matrix

pythoncluster-computingscikit-learnhierarchical-clustering

提问by user2115183

My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.

我的目标是根据单词与文本文档语料库的相似程度对单词进行聚类。我已经计算了每对单词之间的 Jaccard 相似度。换句话说,我有一个可用的稀疏距离矩阵。任何人都可以指点我任何以距离矩阵为输入的聚类算法(可能还有它的 Python 库)?我也事先不知道集群的数量。我只想对这些词进行聚类并获取哪些词聚类在一起。

回答by Bastiaan van den Berg

The scipy clustering package could be usefull (scipy.cluster). There are hierarchical clustering functions in scipy.cluster.hierarchy. Note however that those require a condensed matrix as input (the upper triangular of the distance matrix). Hopefully the documentation pages will help you along.

scipy 集群包可能很有用(scipy.cluster)。scipy.cluster.hierarchy 中有层次聚类函数。但是请注意,那些需要压缩矩阵作为输入(距离矩阵的上三角)。希望文档页面能帮助你。

回答by Andreas Mueller

You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm. DBSCANis the only one that doesn't need the number of clusters and also uses arbitrary distance matrices. You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.

您可以使用带有预计算距离矩阵的 scikit-learn 中的大多数算法。不幸的是,您需要许多算法的集群数量。 DBSCAN是唯一一种不需要簇数并且还使用任意距离矩阵的方法。您也可以尝试MeanShift,但这会将距离解释为坐标 - 这也可能有效。

There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.

还有亲缘传播,但我还没有真正看到它运作良好。不过,如果您想要多个集群,那可能会有所帮助。

disclosure: I'm a scikit-learn core dev.

披露:我是 scikit-learn 核心开发人员。

回答by Jason Hu

Recommend to take a look at agglomerative clustering.

推荐看一下凝聚聚类。