Python 使用 sklearn.AgglomerativeClustering 绘制树状图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29127013/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:07:02  来源:igfitidea点击:

Plot dendrogram using sklearn.AgglomerativeClustering

pythonplotcluster-analysisdendrogram

提问by Shukhrat Khannanov

I'm trying to build a dendrogram using the children_attribute provided by AgglomerativeClustering, but so far I'm out of luck. I can't use scipy.clustersince agglomerative clustering provided in scipylacks some options that are important to me (such as the option to specify the amount of clusters). I would be really grateful for a any advice out there.

我正在尝试使用children_提供的属性构建树状图AgglomerativeClustering,但到目前为止我运气不好。我无法使用,scipy.cluster因为 中提供的凝聚聚类scipy缺少一些对我很重要的选项(例如指定聚类数量的选项)。如果有任何建议,我将不胜感激。

    import sklearn.cluster
    clstr = cluster.AgglomerativeClustering(n_clusters=2)
    clusterer.children_

回答by lucianopaz

I came across the exact same problem some time ago. The way I managed to plot the damn dendogram was using the software package ete3. This package is able to flexibly plot trees with various options. The only difficulty was to convert sklearn's children_output to the Newick Tree formatthat can be read and understood by ete3. Furthermore, I need to manually compute the dendrite's span because that information was not provided with the children_. Here is a snippet of the code I used. It computes the Newick tree and then shows the ete3Tree datastructure. For more details on how to plot, take a look here

前段时间我遇到了完全相同的问题。我设法绘制该死的树状图的方法是使用软件包ete3。这个包能够灵活地绘制具有各种选项的树。唯一的困难是要转换sklearnchildren_输出到Newick树格式,可以阅读和理解ete3。此外,我需要手动计算树突的跨度,因为该信息未随children_. 这是我使用的代码片段。它计算 Newick 树,然后显示ete3Tree 数据结构。有关如何绘图的更多详细信息,请查看此处

import numpy as np
from sklearn.cluster import AgglomerativeClustering
import ete3

def build_Newick_tree(children,n_leaves,X,leaf_labels,spanner):
    """
    build_Newick_tree(children,n_leaves,X,leaf_labels,spanner)

    Get a string representation (Newick tree) from the sklearn
    AgglomerativeClustering.fit output.

    Input:
        children: AgglomerativeClustering.children_
        n_leaves: AgglomerativeClustering.n_leaves_
        X: parameters supplied to AgglomerativeClustering.fit
        leaf_labels: The label of each parameter array in X
        spanner: Callable that computes the dendrite's span

    Output:
        ntree: A str with the Newick tree representation

    """
    return go_down_tree(children,n_leaves,X,leaf_labels,len(children)+n_leaves-1,spanner)[0]+';'

def go_down_tree(children,n_leaves,X,leaf_labels,nodename,spanner):
    """
    go_down_tree(children,n_leaves,X,leaf_labels,nodename,spanner)

    Iterative function that traverses the subtree that descends from
    nodename and returns the Newick representation of the subtree.

    Input:
        children: AgglomerativeClustering.children_
        n_leaves: AgglomerativeClustering.n_leaves_
        X: parameters supplied to AgglomerativeClustering.fit
        leaf_labels: The label of each parameter array in X
        nodename: An int that is the intermediate node name whos
            children are located in children[nodename-n_leaves].
        spanner: Callable that computes the dendrite's span

    Output:
        ntree: A str with the Newick tree representation

    """
    nodeindex = nodename-n_leaves
    if nodename<n_leaves:
        return leaf_labels[nodeindex],np.array([X[nodeindex]])
    else:
        node_children = children[nodeindex]
        branch0,branch0samples = go_down_tree(children,n_leaves,X,leaf_labels,node_children[0])
        branch1,branch1samples = go_down_tree(children,n_leaves,X,leaf_labels,node_children[1])
        node = np.vstack((branch0samples,branch1samples))
        branch0span = spanner(branch0samples)
        branch1span = spanner(branch1samples)
        nodespan = spanner(node)
        branch0distance = nodespan-branch0span
        branch1distance = nodespan-branch1span
        nodename = '({branch0}:{branch0distance},{branch1}:{branch1distance})'.format(branch0=branch0,branch0distance=branch0distance,branch1=branch1,branch1distance=branch1distance)
        return nodename,node

def get_cluster_spanner(aggClusterer):
    """
    spanner = get_cluster_spanner(aggClusterer)

    Input:
        aggClusterer: sklearn.cluster.AgglomerativeClustering instance

    Get a callable that computes a given cluster's span. To compute
    a cluster's span, call spanner(cluster)

    The cluster must be a 2D numpy array, where the axis=0 holds
    separate cluster members and the axis=1 holds the different
    variables.

    """
    if aggClusterer.linkage=='ward':
        if aggClusterer.affinity=='euclidean':
            spanner = lambda x:np.sum((x-aggClusterer.pooling_func(x,axis=0))**2)
    elif aggClusterer.linkage=='complete':
        if aggClusterer.affinity=='euclidean':
            spanner = lambda x:np.max(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2))
        elif aggClusterer.affinity=='l1' or aggClusterer.affinity=='manhattan':
            spanner = lambda x:np.max(np.sum(np.abs(x[:,None,:]-x[None,:,:]),axis=2))
        elif aggClusterer.affinity=='l2':
            spanner = lambda x:np.max(np.sqrt(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2)))
        elif aggClusterer.affinity=='cosine':
            spanner = lambda x:np.max(np.sum((x[:,None,:]*x[None,:,:]))/(np.sqrt(np.sum(x[:,None,:]*x[:,None,:],axis=2,keepdims=True))*np.sqrt(np.sum(x[None,:,:]*x[None,:,:],axis=2,keepdims=True))))
        else:
            raise AttributeError('Unknown affinity attribute value {0}.'.format(aggClusterer.affinity))
    elif aggClusterer.linkage=='average':
        if aggClusterer.affinity=='euclidean':
            spanner = lambda x:np.mean(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2))
        elif aggClusterer.affinity=='l1' or aggClusterer.affinity=='manhattan':
            spanner = lambda x:np.mean(np.sum(np.abs(x[:,None,:]-x[None,:,:]),axis=2))
        elif aggClusterer.affinity=='l2':
            spanner = lambda x:np.mean(np.sqrt(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2)))
        elif aggClusterer.affinity=='cosine':
            spanner = lambda x:np.mean(np.sum((x[:,None,:]*x[None,:,:]))/(np.sqrt(np.sum(x[:,None,:]*x[:,None,:],axis=2,keepdims=True))*np.sqrt(np.sum(x[None,:,:]*x[None,:,:],axis=2,keepdims=True))))
        else:
            raise AttributeError('Unknown affinity attribute value {0}.'.format(aggClusterer.affinity))
    else:
        raise AttributeError('Unknown linkage attribute value {0}.'.format(aggClusterer.linkage))
    return spanner

clusterer = AgglomerativeClustering(n_clusters=2,compute_full_tree=True) # You can set compute_full_tree to 'auto', but I left it this way to get the entire tree plotted
clusterer.fit(X) # X for whatever you want to fit
spanner = get_cluster_spanner(clusterer)
newick_tree = build_Newick_tree(clusterer.children_,clusterer.n_leaves_,X,leaf_labels,spanner) # leaf_labels is a list of labels for each entry in X
tree = ete3.Tree(newick_tree)
tree.show()

回答by sebastianspiegel

Use the scipy implementation of agglomerative clustering instead. Here is an example.

请改用凝聚聚类的 scipy 实现。这是一个例子。

from scipy.cluster.hierarchy import dendrogram, linkage

data = [[0., 0.], [0.1, -0.1], [1., 1.], [1.1, 1.1]]

Z = linkage(data)

dendrogram(Z)  

You can find documentation for linkagehereand documentation for dendrogramhere.

您可以找到linkage此处的文档和dendrogram此处的文档。

回答by jagthebeetle

For those willing to step out of Python and use the robust D3 library, it's not super difficult to use the d3.cluster()(or, I guess, d3.tree()) APIs to achieve a nice, customizable result.

对于那些愿意离开 Python 并使用强大的 D3 库的人来说,使用d3.cluster()(或者,我猜,d3.tree())API 来实现一个不错的、可定制的结果并不是特别困难。

See the jsfiddlefor a demo.

有关演示,请参阅jsfiddle

The children_array luckily functions easily as a JS array, and the only intermediary step is to use d3.stratify()to turn it into a hierarchical representation. Specifically, we need each node to have an idand a parentId:

children_幸运的是,数组很容易用作 JS 数组,唯一的中间步骤是d3.stratify()将其转换为分层表示。具体来说,我们需要每个节点都有一个id和一个parentId

var N = 272;  // Your n_samples/corpus size.
var root = d3.stratify()
  .id((d,i) => i + N)
  .parentId((d, i) => {
    var parIndex = data.findIndex(e => e.includes(i + N));
    if (parIndex < 0) {
      return; // The root should have an undefined parentId.
    }
    return parIndex + N;
  })(data); // Your children_

You end up with at least O(n^2) behaviour here due to the findIndexline, but it probably doesn't matter until your n_samples becomes huge, in which case, you could precompute a more efficient index.

由于该findIndex行,您最终在这里至少有 O(n^2) 行为,但在您的 n_samples 变得巨大之前这可能无关紧要,在这种情况下,您可以预先计算更有效的索引。

Beyond that, it's pretty much plug and chug use of d3.cluster(). See mbostock's canonical blockor my JSFiddle.

除此之外,它几乎是即插即用的d3.cluster(). 请参阅 mbostock 的规范块或我的 JSFiddle。

N.B. For my use case, it sufficed merely to show non-leaf nodes; it's a bit trickier to visualise the samples/leaves, since these might not all be in the children_array explicitly.

注意对于我的用例,仅显示非叶节点就足够了;可视化样本/叶子有点棘手,因为这些可能并不都children_明确地在数组中。

回答by David Diaz

Here is a simple functionfor taking a hierarchical clustering model from sklearn and plotting it using the scipy dendrogramfunction. Seems like graphing functions are often not directly supported in sklearn. You can find an interesting discussion of that related to the pull request for this plot_dendrogramcode snippet here.

这是一个简单的函数,用于从 sklearn 获取层次聚类模型并使用 scipydendrogram函数绘制它。似乎 sklearn 中通常不直接支持绘图函数。您可以在此处找到与此plot_dendrogram代码片段的拉取请求相关的有趣讨论。

I'd clarify that the use case you describe (defining number of clusters) is available in scipy: after you've performed the hierarchical clustering using scipy's linkageyou can cut the hierarchy to whatever number of clusters you want using fclusterwith number of clusters specified in the targument and criterion='maxclust'argument.

我要澄清的是,您描述的用例(定义集群数)在 scipy 中可用:在使用 scipy 执行层次聚类后,linkage您可以将层次结构切割为您想要使用的任何数量的集群,fcluster并指定在该t论点和criterion='maxclust'论据。