使用 scipy 在 python 中构建和更新稀疏矩阵

Question

提问by syllogismos

I'm trying to build and update a sparse matrix as I read data from file. The matrix is of size 100000X40000

当我从文件中读取数据时，我正在尝试构建和更新一个稀疏矩阵。矩阵的大小100000X40000

What is the most efficient way of updating multiple entries of the sparse matrix? specifically I need to increment each entry by 1.

更新稀疏矩阵的多个条目的最有效方法是什么？具体来说，我需要将每个条目增加 1。

Let's say I have row indices [2, 236, 246, 389, 1691]

假设我有行索引 [2, 236, 246, 389, 1691]

and column indices [117, 3, 34, 2757, 74, 1635, 52]

和列索引 [117, 3, 34, 2757, 74, 1635, 52]

so all the following entries must be incremented by one:

所以以下所有条目都必须加一：

(2,117) (2,3) (2,34) (2,2757) ...

(236,117) (236,3) (236, 34) (236,2757) ...

and so on.

等等。

I'm already using lil_matrixas it gave me a warning to use while I tried to update a single entry.

我已经在使用了，lil_matrix因为它在我尝试更新单个条目时给了我一个使用警告。

lil_matrixformat is already not supporting multiple updating. matrix[1:3,0] += [2,3]is giving me a notimplemented error.

lil_matrix格式已经不支持多次更新。 matrix[1:3,0] += [2,3]给了我一个未实现的错误。

I can do this naively, by incrementing every entry individually. I was wondering if there is any better way to do this, or better sparse matrix implementation that I can use.

我可以通过单独增加每个条目来天真地做到这一点。我想知道是否有更好的方法来做到这一点，或者我可以使用更好的稀疏矩阵实现。

My computer is also an average i5 machine with 4GB RAM, so I have to be careful not to blow it up :)

我的电脑也是普通的 i5 机器，内存为 4GB，所以我必须小心不要把它炸毁:)

Answer 1

采纳答案by Jaime

Creating a second matrix with 1s in your new coordinates and adding it to the existing one is a possible way of doing this:

使用1新坐标中的 s创建第二个矩阵并将其添加到现有矩阵是一种可能的方法：

>>> import scipy.sparse as sps
>>> shape = (1000, 2000)
>>> rows, cols = 1000, 2000
>>> sps_acc = sps.coo_matrix((rows, cols)) # empty matrix
>>> for j in xrange(100): # add 100 sets of 100 1's
...     r = np.random.randint(rows, size=100)
...     c = np.random.randint(cols, size=100)
...     d = np.ones((100,))
...     sps_acc = sps_acc + sps.coo_matrix((d, (r, c)), shape=(rows, cols))
... 
>>> sps_acc
<1000x2000 sparse matrix of type '<type 'numpy.float64'>'
    with 9985 stored elements in Compressed Sparse Row format>

Answer 2

回答by Ray

import scipy.sparse

rows = [2, 236, 246, 389, 1691]
cols = [117, 3, 34, 2757, 74, 1635, 52]
prod = [(x, y) for x in rows for y in cols] # combinations
r = [x for (x, y) in prod] # x_coordinate
c = [y for (x, y) in prod] # y_coordinate
data = [1] * len(r)
m = scipy.sparse.coo_matrix((data, (r, c)), shape=(100000, 40000))

I think it works well and doesn't need loops. I am directly following the doc

我认为它运行良好，不需要循环。我直接关注文档

<100000x40000 sparse matrix of type '<type 'numpy.int32'>'
    with 35 stored elements in COOrdinate format>

Answer 3

回答by Warren Weckesser

This answer expands the comment of @behzad.nouri. To increment the values at the "outer product" of your lists of rows and columns indices, just create these as numpy arrays configured for broadcasting. In this case, that means put the rows into a column. For example,

这个答案扩展了@behzad.nouri 的评论。要增加行和列索引列表的“外积”处的值，只需将它们创建为为广播配置的 numpy 数组。在这种情况下，这意味着将行放入一列中。例如，

In [59]: a = lil_matrix((4,4), dtype=int)

In [60]: a.A
Out[60]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

In [61]: rows = np.array([1,3]).reshape(-1, 1)

In [62]: rows
Out[62]: 
array([[1],
       [3]])

In [63]: cols = np.array([0, 2, 3])

In [64]: a[rows, cols] += np.ones((rows.size, cols.size))

In [65]: a.A
Out[65]: 
array([[0, 0, 0, 0],
       [1, 0, 1, 1],
       [0, 0, 0, 0],
       [1, 0, 1, 1]])

In [66]: rows = np.array([0, 1]).reshape(-1,1)

In [67]: cols = np.array([1, 2])

In [68]: a[rows, cols] += np.ones((rows.size, cols.size))

In [69]: a.A
Out[69]: 
array([[0, 1, 1, 0],
       [1, 1, 2, 1],
       [0, 0, 0, 0],
       [1, 0, 1, 1]])

使用 scipy 在 python 中构建和更新稀疏矩阵

提问by syllogismos

采纳答案by Jaime

回答by Ray

回答by Warren Weckesser

相关推荐

最近更新

标签

使用 scipy 在 python 中构建和更新稀疏矩阵

提问by syllogismos

采纳答案by Jaime

回答by Ray

回答by Warren Weckesser

相关推荐

从python中的pandas Series和DataFrames获取字符串？

Python 打开本地文件适用于 urllib 但不适用于 urllib2

Python 使用 groupby 获取组中具有最大计数的行

python没有名为serial的模块

相关推荐

最近更新

标签