如何在 Python 中对图形进行聚类?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/653496/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:33:26  来源:igfitidea点击:

How can I cluster a graph in Python?

pythonsortingmatrixcluster-analysisgraph-theory

提问by Pietro Speroni

Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine, but website development), with millions of nodes.

设 G 为图。所以 G 是一组节点和一组链接。我需要找到一种快速的方法来划分图形。我现在正在处理的图表只有 120*160 个节点,但我可能很快就会在另一个上下文(不是医学,而是网站开发)中处理具有数百万个节点的等效问题。

So, what I did was to store all the links into a graph matrix:

所以,我所做的是将所有链接存储到一个图形矩阵中:

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))

Now M holds a 1 in position s,t, if node s is connected to node t. I make sure M is symmetrical M[s,t]=M[t,s] and each node links to itself M[s,s]=1.

现在,如果节点 s 连接到节点 t,则 M 在位置 s,t 中保持 1。我确保 M 是对称的 M[s,t]=M[t,s] 并且每个节点都链接到自己 M[s,s]=1。

If I remember well if I multiply M with M, the results is a matrix that represents the graph that connects vertexes that are reached on through two steps.

如果我记得清楚,如果我将 M 与 M 相乘,结果是一个矩阵,表示连接通过两个步骤到达的顶点的图。

So I keep on multplying M with itself, until the number of zeros in the matrix do not decrease any longer. Now I have the list of the connected components. And now I need to cluster this matrix.

所以我继续将 M 与自身相乘,直到矩阵中的零数不再减少。现在我有了连接组件的列表。现在我需要对这个矩阵进行聚类。

Up to now I am pretty satisfied with the algorithm. I think it is easy, elegant, and reasonably fast. I am having trouble with this part.

到目前为止,我对算法非常满意。我认为它很容易,优雅,而且相当快。我在这部分遇到了问题。

Essentially I need to split this graph into its connected components.

本质上,我需要将此图拆分为其连接的组件。

I can go through all the nodes, and see what are they connected to.

我可以浏览所有节点,看看它们连接到什么。

But what about sorting the matrix reordering the lines. But I don't know if it is possible to do it.

但是如何对矩阵进行排序以重新排序行。但是不知道能不能实现。

What follows is the code so far:

以下是到目前为止的代码:

def findzeros(M):
    nZeros=0
    for t in M.flat:
        if not t:
            nZeros+=1
    return nZeros

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))    
for s in data.keys():
    MatrixCells[s,s]=1
    for t in data.keys():
        if t<s:
            if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
                M[s,t]=1
                M[t,s]=1

nZeros=findzeros(M)
M2=M*M
nZeros2=findzeros(M2)

while (nZeros-nZeros2):
    nZeros=nZeros2
    M=M2
    M2=M*M
    nZeros2=findzeros(M2)


Edit:

编辑:

It has been suggested that I use SVD decomposition. Here is a simple example of the problem on a 5x5 graph. We shall use this since with the 19200x19200 square matrix is not that easy to see the clusters.

有人建议我使用 SVD 分解。这是 5x5 图形上问题的一个简单示例。我们将使用它,因为使用 19200x19200 方阵不容易看到集群。

import numpy
import scipy

M=numpy.mat(numpy.zeros((5,5)))

M[1,3]=1
M[3,1]=1
M[1,1]=1
M[2,2]=1
M[3,3]=1
M[4,4]=1
M[0,0]=1

print M

u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh

Essentially there are 4 clusters here: (0),(1,3),(2),(4) But I still don't see how the svn can help in this context.

基本上这里有 4 个集群:(0),(1,3),(2),(4) 但我仍然不明白 svn 在这种情况下如何提供帮助。

采纳答案by kquinn

Why not use a real graph library, like Python-Graph? It has a function to determine connected components(though no example is provided). I'd imagine a dedicated library is going to be faster than whatever ad-hoc graph code you've cooked up.

为什么不使用真正的图形库,比如Python-Graph?它具有确定连接组件功能(尽管没有提供示例)。我想一个专用的库将比您编写的任何临时图形代码更快。

EDIT: NetworkXseems like it might be a better choice than python-graph; its documentation (here for the connected components function)certainly is.

编辑:NetworkX似乎比 python-graph 更好;它的文档(此处用于连接组件功能)当然是。

回答by vartec

In SciPy you can use sparse matrices. Also note, that there are more efficient ways of multiplying matrix by itself. Anyway, what you're trying to do can by done by SVD decomposition.

在 SciPy 中,您可以使用稀疏矩阵。另请注意,有更有效的方法可以将矩阵自身相乘。无论如何,您尝试做的事情可以通过 SVD 分解来完成。

Introduction with useful links.

带有有用链接的介绍

回答by drevicko

There's also graph_tooland networkitthat have efficient routines for connected components, and both store the network efficiently. If you're going to work with millions of nodes, networkx will likely not be sufficient (it's pure python afaik). Both those tools are written in C++ so can handle analysis of large graphs with reasonable run times.

还有graph_toolnetworkit具有用于连接组件的高效例程,并且都有效地存储网络。如果您要使用数百万个节点,networkx 可能还不够(它是纯 python afaik)。这两个工具都是用 C++ 编写的,因此可以在合理的运行时间内处理大型图形的分析。

As Phil points out, your method will have horribly long compute times for large graphs (we're talking days, weeks, months...), and your representation for a graph of a million nodes will need something like a million gigabytes of memory!

正如 Phil 指出的那样,您的方法对于大型图(我们说的是几天、几周、几个月……)的计算时间将非常长,并且您对一百万个节点的表示将需要大约一百万 GB 的内存!

回答by lynxoid

Finding an optimal graph partition is an NP-hard problem, so whatever the algorithm, it is going to be an approximation or a heuristic. Not surprisingly, different clustering algorithms produce (wildly) different results.

寻找最佳图分区是一个 NP 难题,因此无论算法如何,它都将是近似值或启发式算法。毫不奇怪,不同的聚类算法会产生(非常)不同的结果。

Python implementation of Newman's modularity algorithm: modularity

纽曼模块化算法的Python实现: 模块化

Also: MCL, MCODE, CFinder, NeMo, clusterONE

还有:MCLMCODECFinderNeMoclusterONE

回答by RexE

As others have pointed out, no need to reinvent the wheel. A lot of thought has been put into optimal clustering techniques. Hereis one well-known clustering program.

正如其他人指出的那样,无需重新发明轮子。许多思想已经投入到最佳聚类技术中。是一个著名的聚类程序。

回答by RexE

Here's some naive implementation, which finds the connected components using depth first search, i wrote some time ago. Although it's very simple, it scales well to ten thousands of vertices and edges...

这是一些天真的实现,它使用深度优先搜索找到连接的组件,我前段时间写过。虽然它很简单,但它可以很好地扩展到数以万计的顶点和边......


import sys
from operator import gt, lt

class Graph(object):
    def __init__(self):
        self.nodes = set()
        self.edges = {}
        self.cluster_lookup = {}
        self.no_link = {}

    def add_edge(self, n1, n2, w):
        self.nodes.add(n1)
        self.nodes.add(n2)
        self.edges.setdefault(n1, {}).update({n2: w})
        self.edges.setdefault(n2, {}).update({n1: w})

    def connected_components(self, threshold=0.9, op=lt):
        nodes = set(self.nodes)
        components, visited = [], set()
        while len(nodes) > 0:
            connected, visited = self.dfs(nodes.pop(), visited, threshold, op)
            connected = set(connected)
            for node in connected:
                if node in nodes:
                    nodes.remove(node)

            subgraph = Graph()
            subgraph.nodes = connected
            subgraph.no_link = self.no_link
            for s in subgraph.nodes:
                for k, v in self.edges.get(s, {}).iteritems():
                    if k in subgraph.nodes:
                        subgraph.edges.setdefault(s, {}).update({k: v})
                if s in self.cluster_lookup:
                    subgraph.cluster_lookup[s] = self.cluster_lookup[s]

            components.append(subgraph)
        return components

    def dfs(self, v, visited, threshold, op=lt, first=None):
        aux = [v]
        visited.add(v)
        if first is None:
            first = v
        for i in (n for n, w in self.edges.get(v, {}).iteritems()
                  if op(w, threshold) and n not in visited):
            x, y = self.dfs(i, visited, threshold, op, first)
            aux.extend(x)
            visited = visited.union(y)
        return aux, visited

def main(args):
    graph = Graph()
    # first component
    graph.add_edge(0, 1, 1.0)
    graph.add_edge(1, 2, 1.0)
    graph.add_edge(2, 0, 1.0)

    # second component
    graph.add_edge(3, 4, 1.0)
    graph.add_edge(4, 5, 1.0)
    graph.add_edge(5, 3, 1.0)

    first, second = graph.connected_components(op=gt)
    print first.nodes
    print second.nodes

if __name__ == '__main__':
    main(sys.argv)

回答by RexE

The SVD algorithm is not applicable here, but otherwise Phil H is correct.

SVD 算法在这里不适用,否则 Phil H 是正确的。

回答by Phil H

Looks like there is a library PyMetis, which will partition your graph for you, given a list of links. It should be fairly easy to extract the list of links from your graph by passing it your original list of linked nodes (not the matrix-multiply-derived one).

看起来有一个库PyMetis,它会根据链接列表为您划分图形。通过将链接节点的原始列表(不是矩阵乘法导出的)传递给图表,从图中提取链接列表应该相当容易。

Repeatedly performing M' = MM will not be efficient for large orders of M. A full matrix-multiply for matrices of order N will cost N multiplications and N-1 additions per element, of which there are N2, that is O(N3) operations. If you are scaling that to "millions of nodes", that would be O(1018) operations per matrix-matrix multiplication, of which you want to do several.

重复执行 M' = MM 对于 M 的大阶将不会有效。 N 阶矩阵的完整矩阵乘法将花费 N 次乘法和 N-1 次加法,其中每个元素有 N 2,即 O(N 3) 操作。如果您将其扩展到“数百万个节点”,那么每个矩阵-矩阵乘法将是 O(10 18) 次操作,您想要执行其中的几个操作。

In short, you don't want to do it this way. The SVD suggestionfrom Vartec would be the only appropriate choice there. Your best option is just to use PyMetis, and not try to reinvent graph-partitioning.

简而言之,你不想这样做。Vatec的SVD 建议将是那里唯一合适的选择。您最好的选择是使用 PyMetis,而不是尝试重新发明图形分区。