Python 遍历 scipy.sparse 向量（或矩阵）

Question

提问by RandomGuy

I'm wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:

我想知道用 scipy.sparse 迭代稀疏矩阵的非零条目的最佳方法是什么。例如，如果我执行以下操作：

from scipy.sparse import lil_matrix

x = lil_matrix( (20,1) )
x[13,0] = 1
x[15,0] = 2

c = 0
for i in x:
  print c, i
  c = c+1

the output is

输出是

0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13   (0, 0) 1.0
14 
15   (0, 0) 2.0
16 
17 
18 
19

so it appears the iterator is touching every element, not just the nonzero entries. I've had a look at the API

所以看起来迭代器正在接触每个元素，而不仅仅是非零条目。我看过API

http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html

and searched around a bit, but I can't seem to find a solution that works.

并搜索了一下，但我似乎找不到有效的解决方案。

Answer 1

采纳答案by unutbu

Edit: bbtrb's method(using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izipalso improves the speed. Current fastest is using_tocoo_izip:

编辑：bbtrb的方法（使用coo_matrix）是远远超过我原来的建议更快，使用非零。Sven Marnach 的建议使用itertools.izip也提高了速度。目前最快的是using_tocoo_izip：

import scipy.sparse
import random
import itertools

def using_nonzero(x):
    rows,cols = x.nonzero()
    for row,col in zip(rows,cols):
        ((row,col), x[row,col])

def using_coo(x):
    cx = scipy.sparse.coo_matrix(x)    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo(x):
    cx = x.tocoo()    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo_izip(x):
    cx = x.tocoo()    
    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
        (i,j,v)

N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
    x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)

yields these timeitresults:

产生这些timeit结果：

% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop

Answer 2

回答by Kabie

Try filter(lambda x:x, x)instead of x.

尝试filter(lambda x:x, x)代替x.

Answer 3

回答by bbtrb

The fastest way should be by converting to a coo_matrix:

最快的方法应该是转换为 a coo_matrix：

cx = scipy.sparse.coo_matrix(x)

for i,j,v in zip(cx.row, cx.col, cx.data):
    print "(%d, %d), %s" % (i,j,v)

Answer 4

回答by Davide C

I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)

我遇到了同样的问题，实际上，如果您只关心速度，那么最快的方法（快 1 个数量级以上）是将稀疏矩阵转换为密集矩阵（x.todense()），并迭代非零稠密矩阵中的元素。（当然，这种方法需要更多的内存）

Answer 5

回答by Herbert

tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.

tocoo() 将整个矩阵具体化为不同的结构，这不是python 3的首选MO。您也可以考虑这个迭代器，它对大型矩阵特别有用。

from itertools import chain, repeat
def iter_csr(matrix):
  for (row, col, val) in zip(
    chain(*(
          repeat(i, r)
          for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
    )),
    matrix.indices,
    matrix.data
  ):
    yield (row, col, val)

I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).

我不得不承认我使用了很多 python 结构，它们可能应该被 numpy 结构替换（尤其是枚举）。

NB:

注意：

In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>

So yes, enumerate is somewhat slow(ish)

所以是的，枚举有点慢（ish）

For the iterator:

对于迭代器：

In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something

So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows's.

所以你决定这个开销是否可以接受，在我的情况下是 tocoo 引起MemoryOverflows的。

IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)

恕我直言：这样的迭代器应该是 csr_matrix 接口的一部分，类似于 dict() 中的 items() :)

Answer 6

回答by zeroth

To loop a variety of sparse matrices from the scipy.sparsecode section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrangeand izipfor better performance on large matrices):

要循环scipy.sparse代码部分中的各种稀疏矩阵，我将使用这个小包装函数（请注意，对于 Python-2，鼓励您使用xrange并izip在大型矩阵上获得更好的性能）：

from scipy.sparse import *
def iter_spmatrix(matrix):
    """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix`` 

    This will always return:
    >>> (row, column, matrix-element)

    Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.

    Parameters
    ----------
    matrix : ``scipy.sparse.sp_matrix``
      the sparse matrix to iterate non-zero elements
    """
    if isspmatrix_coo(matrix):
        for r, c, m in zip(matrix.row, matrix.col, matrix.data):
            yield r, c, m

    elif isspmatrix_csc(matrix):
        for c in range(matrix.shape[1]):
            for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
                yield matrix.indices[ind], c, matrix.data[ind]

    elif isspmatrix_csr(matrix):
        for r in range(matrix.shape[0]):
            for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
                yield r, matrix.indices[ind], matrix.data[ind]

    elif isspmatrix_lil(matrix):
        for r in range(matrix.shape[0]):
            for c, d in zip(matrix.rows[r], matrix.data[r]):
                yield r, c, d

    else:
        raise NotImplementedError("The iterator for this sparse matrix has not been implemented")

Python 遍历 scipy.sparse 向量（或矩阵）

提问by RandomGuy

采纳答案by unutbu

回答by Kabie

回答by bbtrb

回答by Davide C

回答by Herbert

回答by zeroth

相关推荐

最近更新

标签

Python 遍历 scipy.sparse 向量（或矩阵）

提问by RandomGuy

采纳答案by unutbu

回答by Kabie

回答by bbtrb

回答by Davide C

回答by Herbert

回答by zeroth

相关推荐

如何在不知道编码的情况下将字节写入 Python 3 中的文件？

Python 编辑字典列表中的值？

在 Python 中开发简单 GUI 的最简单方法

Python 如何在 Django 模型中存储字符串数组？

相关推荐

最近更新

标签