Python scipy.cluster.hierarchy 教程

Question

提问by user2988577

I'm trying to understand how to manipulate a hierarchy cluster but the documentation is too ... technical?... and I can't understand how it works.

我试图了解如何操作层次结构集群，但文档太......技术性？......我无法理解它是如何工作的。

Is there any tutorial that can help me to start with, explaining step by step some simple tasks?

是否有任何教程可以帮助我开始，逐步解释一些简单的任务？

Let's say I have the following data set:

假设我有以下数据集：

a = np.array([[0,   0  ],
              [1,   0  ],
              [0,   1  ],
              [1,   1  ], 
              [0.5, 0  ],
              [0,   0.5],
              [0.5, 0.5],
              [2,   2  ],
              [2,   3  ],
              [3,   2  ],
              [3,   3  ]])

I can easily do the hierarchy cluster and plot the dendrogram:

我可以轻松地进行层次聚类并绘制树状图：

z = linkage(a)
d = dendrogram(z)

Now, how I can recover a specific cluster? Let's say the one with elements [0,1,2,4,5,6]in the dendrogram?
How I can get back the values of that elements?

现在，我如何恢复特定的集群？让我们说[0,1,2,4,5,6]在树状图中有元素的那个？
我如何取回该元素的值？

Answer 1

采纳答案by embert

There are three steps in hierarchical agglomerative clustering (HAC):

层次凝聚聚类（HAC）分为三个步骤：

Quantify Data (metricargument)
Cluster Data (methodargument)
Choose the number of clusters

量化数据（metric参数）
集群数据（method参数）
选择簇数

Doing

正在做

z = linkage(a)

will accomplish the first two steps. Since you did not specify any parameters it uses the standard values

将完成前两个步骤。由于您没有指定任何参数，它使用标准值

metric = 'euclidean'
method = 'single'

metric = 'euclidean'
method = 'single'

So z = linkage(a)will give you a single linked hierachical agglomerative clustering of a. This clustering is kind of a hierarchy of solutions. From this hierarchy you get some information about the structure of your data. What you might do now is:

所以z = linkage(a)会给你一个单一链接的层次凝聚聚类a。这种聚类是一种解决方案的层次结构。从这个层次结构中，您可以获得有关数据结构的一些信息。你现在可以做的是：

Check which metricis appropriate, e. g. cityblockor chebychevwill quantify your data differently (cityblock, euclideanand chebychevcorrespond to L1, L2, and L_infnorm)
Check the different properties / behaviours of the methdos(e. g. single, completeand average)
Check how to determine the number of clusters, e. g. by reading the wiki about it
Compute indices on the found solutions (clusterings) such as the silhouette coefficient(with this coefficient you get a feedback on the quality of how good a point/observation fits to the cluster it is assigned to by the clustering). Different indices use different criteria to qualify a clustering.

检查其metric是否合适，如 cityblock或chebychev将不同的量化数据（cityblock，euclidean以及chebychev对应于L1，L2和L_inf规范）
检查methdos（例如 single，complete和average）的不同属性/行为
检查如何确定集群的数量，例如通过阅读有关它的维基
计算找到的解决方案（聚类）的索引，例如轮廓系数（通过该系数，您可以获得关于点/观测值与聚类分配给它的聚类的匹配程度的反馈）。不同的索引使用不同的标准来限定聚类。

Here is something to start with

这里有一些开始

import numpy as np
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt


a = np.array([[0.1,   2.5],
              [1.5,   .4 ],
              [0.3,   1  ],
              [1  ,   .8 ],
              [0.5,   0  ],
              [0  ,   0.5],
              [0.5,   0.5],
              [2.7,   2  ],
              [2.2,   3.1],
              [3  ,   2  ],
              [3.2,   1.3]])

fig, axes23 = plt.subplots(2, 3)

for method, axes in zip(['single', 'complete'], axes23):
    z = hac.linkage(a, method=method)

    # Plotting
    axes[0].plot(range(1, len(z)+1), z[::-1, 2])
    knee = np.diff(z[::-1, 2], 2)
    axes[0].plot(range(2, len(z)), knee)

    num_clust1 = knee.argmax() + 2
    knee[knee.argmax()] = 0
    num_clust2 = knee.argmax() + 2

    axes[0].text(num_clust1, z[::-1, 2][num_clust1-1], 'possible\n<- knee point')

    part1 = hac.fcluster(z, num_clust1, 'maxclust')
    part2 = hac.fcluster(z, num_clust2, 'maxclust')

    clr = ['#2200CC' ,'#D9007E' ,'#FF6600' ,'#FFCC00' ,'#ACE600' ,'#0099CC' ,
    '#8900CC' ,'#FF0000' ,'#FF9900' ,'#FFFF00' ,'#00CC01' ,'#0055CC']

    for part, ax in zip([part1, part2], axes[1:]):
        for cluster in set(part):
            ax.scatter(a[part == cluster, 0], a[part == cluster, 1], 
                       color=clr[cluster])

    m = '\n(method: {})'.format(method)
    plt.setp(axes[0], title='Screeplot{}'.format(m), xlabel='partition',
             ylabel='{}\ncluster distance'.format(m))
    plt.setp(axes[1], title='{} Clusters'.format(num_clust1))
    plt.setp(axes[2], title='{} Clusters'.format(num_clust2))

plt.tight_layout()
plt.show()

Gives enter image description here

给在此处输入图片说明

Python scipy.cluster.hierarchy 教程

提问by user2988577

采纳答案by embert

相关推荐

最近更新

标签

Python scipy.cluster.hierarchy 教程

提问by user2988577

采纳答案by embert

相关推荐

Python 有条件替换 Pandas

Python 如何使用对变量的方法名称赋值来动态调用类中的方法

Python 运行时警告：日志中遇到除以零

Python 3 中的 string.lower

相关推荐

最近更新

标签