重新排序矩阵元素以反映原始 python 中的列和行聚类

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2455761/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:41:27  来源:igfitidea点击:

Reordering matrix elements to reflect column and row clustering in naiive python

pythonstatisticsnumpycluster-analysisscipy

提问by Boris Gorelik

I'm looking for a way to perform clustering separately on matrix rows and than on its columns, reorder the data in the matrix to reflect the clustering and putting it all together. The clustering problem is easily solvable, so is the dendrogram creation (for example in this blogor in "Programming collective intelligence"). However, how to reorder the data remains unclear for me.

我正在寻找一种方法来分别对矩阵行而不是其列执行聚类,重新排序矩阵中的数据以反映聚类并将它们放在一起。聚类问题很容易解决,树状图的创建也是如此(例如在本博客“编程集体智慧”中)。但是,如何重新排序数据对我来说仍然不清楚。

Eventually, I'm looking for a way of creating graphs similar to the one below using naive Python (with any "standard" library such as numpy, matplotlib etc, but without using Ror other external tools).

最终,我正在寻找一种使用 naive Python(使用任何“标准”库,例如 numpy、matplotlib 等,但不使用 R或其他外部工具)来创建类似于以下图形的方法。

dendogram
(source: warwick.ac.uk)

树状图
(来源:warwick.ac.uk

Clarifications

澄清

I was asked what I meant by reordering. When you cluster data in a matrix first by matrix rows, then by its columns, each matrix cell can be identified by the position in the two dendrograms. If you reorder the rows and the columns of the original matrix such that the elements that are close each to another in the dendrograms become close each to another in the matrix, and then generate heatmap, the clustering of the data may become evident to the viewer (as in the figure above)

有人问我重新排序是什么意思。当您首先按矩阵行然后按列对矩阵中的数据进行聚类时,可以通过两个树状图中的位置来识别每个矩阵单元格。如果对原始矩阵的行和列重新排序,使得树状图中彼此接近的元素在矩阵中彼此接近,然后生成热图,则数据的聚类可能对查看者变得明显(如上图)

回答by Steve Tjoa

See my recent answer, copied in part below, to this related question.

请参阅我最近这个相关问题的回答,部分复制如下。

import scipy
import pylab
import scipy.cluster.hierarchy as sch

# Generate features and distance matrix.
x = scipy.rand(40)
D = scipy.zeros([40,40])
for i in range(40):
    for j in range(40):
        D[i,j] = abs(x[i] - x[j])

# Compute and plot dendrogram.
fig = pylab.figure()
axdendro = fig.add_axes([0.09,0.1,0.2,0.8])
Y = sch.linkage(D, method='centroid')
Z = sch.dendrogram(Y, orientation='right')
axdendro.set_xticks([])
axdendro.set_yticks([])

# Plot distance matrix.
axmatrix = fig.add_axes([0.3,0.1,0.6,0.8])
index = Z['leaves']
D = D[index,:]
D = D[:,index]
im = axmatrix.matshow(D, aspect='auto', origin='lower')
axmatrix.set_xticks([])
axmatrix.set_yticks([])

# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.8])
pylab.colorbar(im, cax=axcolor)

# Display and save figure.
fig.show()
fig.savefig('dendrogram.png')

Dendrogram and distance matrix
(source: stevetjoa.com)

树状图和距离矩阵
(来源:stevetjoa.com

回答by Paul

I'm not sure completely understand, but it appears you are trying to re-index each axis of the array based on sorts of the dendrogram indicies. I guess that assumes there is some comparative logic in each branch delineation. If this is the case then would this work(?):

我不确定是否完全理解,但您似乎正在尝试根据各种树状图索引重新索引数组的每个轴。我想这是假设在每个分支描述中都有一些比较逻辑。如果是这种情况,那么这是否可行(?):

>>> x_idxs = [(0,1,0,0),(0,1,1,1),(0,1,1),(0,0,1),(1,1,1,1),(0,0,0,0)]
>>> y_idxs = [(1,1),(0,1),(1,0),(0,0)]
>>> a = np.random.random((len(x_idxs),len(y_idxs)))
>>> x_idxs2, xi = zip(*sorted(zip(x_idxs,range(len(x_idxs)))))
>>> y_idxs2, yi = zip(*sorted(zip(y_idxs,range(len(y_idxs)))))
>>> a2 = a[xi,:][:,yi]

x_idxsand y_idxsare the dendrogram indicies. ais the unsorted matrix. xiand yiare your new row/column array indicies. a2is the sorted matrix while x_idxs2and y_idxs2are the new, sorted dendrogram indicies. This assumes that when the dendrogram was created that a 0branch column/row is always comparatively larger/smaller than a 1branch.

x_idxsy_idxs是树状图索引。 a是未排序的矩阵。 xi并且yi是您的新行/列数组索引。 a2是排序矩阵,而x_idxs2y_idxs2是新的排序树状图索引。这假设当创建树状图时,0分支列/行总是比1分支大/小。

If your y_idxs and x_idxs are not lists but are numpy arrays, then you could use np.argsortin a similar manner.

如果您的 y_idxs 和 x_idxs 不是列表而是 numpy 数组,那么您可以np.argsort以类似的方式使用。

回答by themantalope

I know this is very late to the game, but I made a plotting object based on the code from the post on this page. It's registered on pip, so to install you just have to call

我知道这对游戏来说已经很晚了,但是我根据此页面上帖子中的代码制作了一个绘图对象。它是在 pip 上注册的,所以要安装你只需要调用

pip install pydendroheatmap

check out the project's github page here : https://github.com/themantalope/pydendroheatmap

在此处查看项目的 github 页面:https: //github.com/themantalope/pydendroheatmap