Python scipy.cluster.hierarchy 教程

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21638130/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:18:29  来源:igfitidea点击:

Tutorial for scipy.cluster.hierarchy

pythonscipyhierarchical-clustering

提问by user2988577

I'm trying to understand how to manipulate a hierarchy cluster but the documentation is too ... technical?... and I can't understand how it works.

我试图了解如何操作层次结构集群,但文档太......技术性?......我无法理解它是如何工作的。

Is there any tutorial that can help me to start with, explaining step by step some simple tasks?

是否有任何教程可以帮助我开始,逐步解释一些简单的任务?

Let's say I have the following data set:

假设我有以下数据集:

a = np.array([[0,   0  ],
              [1,   0  ],
              [0,   1  ],
              [1,   1  ], 
              [0.5, 0  ],
              [0,   0.5],
              [0.5, 0.5],
              [2,   2  ],
              [2,   3  ],
              [3,   2  ],
              [3,   3  ]])

I can easily do the hierarchy cluster and plot the dendrogram:

我可以轻松地进行层次聚类并绘制树状图:

z = linkage(a)
d = dendrogram(z)
  • Now, how I can recover a specific cluster? Let's say the one with elements [0,1,2,4,5,6]in the dendrogram?
  • How I can get back the values of that elements?
  • 现在,我如何恢复特定的集群?让我们说[0,1,2,4,5,6]在树状图中有元素的那个?
  • 我如何取回该元素的值?

采纳答案by embert

There are three steps in hierarchical agglomerative clustering (HAC):

层次凝聚聚类(HAC)分为三个步骤:

  1. Quantify Data (metricargument)
  2. Cluster Data (methodargument)
  3. Choose the number of clusters
  1. 量化数据(metric参数)
  2. 集群数据(method参数)
  3. 选择簇数

Doing

正在做

z = linkage(a)

will accomplish the first two steps. Since you did not specify any parameters it uses the standard values

将完成前两个步骤。由于您没有指定任何参数,它使用标准值

  1. metric = 'euclidean'
  2. method = 'single'
  1. metric = 'euclidean'
  2. method = 'single'

So z = linkage(a)will give you a single linked hierachical agglomerative clustering of a. This clustering is kind of a hierarchy of solutions. From this hierarchy you get some information about the structure of your data. What you might do now is:

所以z = linkage(a)会给你一个单一链接的层次凝聚聚类a。这种聚类是一种解决方案的层次结构。从这个层次结构中,您可以获得有关数据结构的一些信息。你现在可以做的是:

  • Check which metricis appropriate, e. g. cityblockor chebychevwill quantify your data differently (cityblock, euclideanand chebychevcorrespond to L1, L2, and L_infnorm)
  • Check the different properties / behaviours of the methdos(e. g. single, completeand average)
  • Check how to determine the number of clusters, e. g. by reading the wiki about it
  • Compute indices on the found solutions (clusterings) such as the silhouette coefficient(with this coefficient you get a feedback on the quality of how good a point/observation fits to the cluster it is assigned to by the clustering). Different indices use different criteria to qualify a clustering.
  • 检查其metric是否合适,如 cityblockchebychev将不同的量化数据(cityblockeuclidean以及chebychev对应于L1L2L_inf规范)
  • 检查methdos(例如 singlecompleteaverage)的不同属性/行为
  • 检查如何确定集群的数量,例如通过阅读有关它的维基
  • 计算找到的解决方案(聚类)的索引,例如轮廓系数(通过该系数,您可以获得关于点/观测值与聚类分配给它的聚类的匹配程度的反馈)。不同的索引使用不同的标准来限定聚类。

Here is something to start with

这里有一些开始

import numpy as np
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt


a = np.array([[0.1,   2.5],
              [1.5,   .4 ],
              [0.3,   1  ],
              [1  ,   .8 ],
              [0.5,   0  ],
              [0  ,   0.5],
              [0.5,   0.5],
              [2.7,   2  ],
              [2.2,   3.1],
              [3  ,   2  ],
              [3.2,   1.3]])

fig, axes23 = plt.subplots(2, 3)

for method, axes in zip(['single', 'complete'], axes23):
    z = hac.linkage(a, method=method)

    # Plotting
    axes[0].plot(range(1, len(z)+1), z[::-1, 2])
    knee = np.diff(z[::-1, 2], 2)
    axes[0].plot(range(2, len(z)), knee)

    num_clust1 = knee.argmax() + 2
    knee[knee.argmax()] = 0
    num_clust2 = knee.argmax() + 2

    axes[0].text(num_clust1, z[::-1, 2][num_clust1-1], 'possible\n<- knee point')

    part1 = hac.fcluster(z, num_clust1, 'maxclust')
    part2 = hac.fcluster(z, num_clust2, 'maxclust')

    clr = ['#2200CC' ,'#D9007E' ,'#FF6600' ,'#FFCC00' ,'#ACE600' ,'#0099CC' ,
    '#8900CC' ,'#FF0000' ,'#FF9900' ,'#FFFF00' ,'#00CC01' ,'#0055CC']

    for part, ax in zip([part1, part2], axes[1:]):
        for cluster in set(part):
            ax.scatter(a[part == cluster, 0], a[part == cluster, 1], 
                       color=clr[cluster])

    m = '\n(method: {})'.format(method)
    plt.setp(axes[0], title='Screeplot{}'.format(m), xlabel='partition',
             ylabel='{}\ncluster distance'.format(m))
    plt.setp(axes[1], title='{} Clusters'.format(num_clust1))
    plt.setp(axes[2], title='{} Clusters'.format(num_clust2))

plt.tight_layout()
plt.show()

Gives enter image description here

在此处输入图片说明