Python 使用稀疏矩阵与 numpy 数组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36969886/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:37:36  来源:igfitidea点击:

Using a sparse matrix versus numpy array

pythonnumpymatrixscipyscikit-learn

提问by patrick

I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matrices when processing these further, e.g. in a classifier. When feeding a numpy array versus a sparse matrix into the Scikit logistic regression classifier, it did not seem to make much of a difference, however. So I was wondering about three things:

我正在用 Python 创建一些带有字数统计的 numpy 数组:行是文档,列是单词 X 的计数。如果我有很多零计数,人们建议在进一步处理这些时使用稀疏矩阵,例如在分类器中。然而,当将 numpy 数组与稀疏矩阵输入 Scikit逻辑回归分类器时,它似乎没有太大区别。所以我想知道三件事:

  • Wikipediasays

    a sparse matrix is a matrix in which most of the elements are zero

    Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?

  • How much does a sparse matrix help performance in a task like mine, especially compared to a numpy array or a standard list?
  • So far, I collect my data into a numpy array, then convert into the csr_matrix in Scipy. Is that the right way to do it? I could not figure out how to build a sparse matrix from the ground up, and that might be impossible.
  • 维基百科

    稀疏矩阵是其中大多数元素为零的矩阵

    这是确定何时使用稀疏矩阵格式的合适方法 - 一旦 > 50 % 的值为零?或者以防万一使用有意义吗?

  • 稀疏矩阵对像我这样的任务的性能有多大帮助,尤其是与 numpy 数组或标准列表相比?
  • 到目前为止,我将我的数据收集到一个 numpy 数组中,然后在Scipy 中转换为 csr_matrix 。这是正确的方法吗?我无法弄清楚如何从头开始构建稀疏矩阵,这可能是不可能的。

Any help is much appreciated!

任何帮助深表感谢!

采纳答案by hpaulj

The scipysparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the dotproduct for numpy arrays) and equation solvers are well developed.

scipy稀疏矩阵包,并且在MATLAB类似的,是基于来自线性代数问题,如求解大型稀疏线性方程组开发的想法(例如有限差分和有限元的实现)。因此,诸如矩阵乘积(dotnumpy 数组的乘积)和方程求解器之类的东西得到了很好的发展。

My rough experience is that a sparse csrmatrix product has to have a 1% sparsity to be faster than the equivalent dense dotoperation - in other words, one nonzero value for every 99 zeros. (but see tests below)

我的粗略经验是,稀疏csr矩阵乘积必须具有 1% 的稀疏度才能比等效的密集dot运算更快——换句话说,每 99 个零一个非零值。(但请参阅下面的测试)

But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the cooformat). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that.

但是人们也尝试使用稀疏矩阵来节省内存。但请记住,这样的矩阵必须存储 3 个值数组(至少在coo格式上)。所以稀疏度必须小于 1/3 才能开始节省内存。显然,如果您首先构建密集数组并从中创建稀疏数组,则不会节省内存。

The scipypackage implements many sparse formats. The cooformat is easiest to understand and build. Build one according to documentation and look at its .data, .row, and .colattributes (3 1d arrays).

scipy包实现了许多稀疏格式。该coo格式最容易理解和构建。根据文档和看其建立一个.data.row.col属性(3个维数组)。

csrand cscare typically built from the cooformat, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality.

csr并且csc通常是根据coo格式构建的,并对数据进行一些压缩,使它们更难理解。但它们具有大部分数学功能。

It is also possible to index csrformat, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower.

也可以索引csr格式,尽管通常这比等效的密集矩阵/数组情况慢。其他操作,如更改值(尤其是从 0 到非零)、串联、增量增长,也较慢。

lil(lists of lists) is also easy to understand, and best for incremental building. dokis a actually a dictionary subclass.

lil(列表列表)也很容易理解,最适合增量构建。 dok实际上是一个字典子类。

A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the np.matrixclass (though it isn't a subclass).

一个关键点是稀疏矩阵仅限于 2d,并且在许多方面表现得像np.matrix类(尽管它不是子类)。

A search for other questions using scikit-learnand sparsemight be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the learnside. So far the sparsepackage has not been optimized for this application.

使用scikit-learn和搜索其他问题sparse可能是找出使用这些矩阵的优缺点的最佳方式。我已经回答了许多问题,但我比“学习”方面更了解“稀疏”方面。我认为它们很有用,但我觉得合身并不总是最好的。任何定制都在learn旁边。到目前为止,该sparse软件包尚未针对此应用程序进行优化。



I just tried some matrix product tests, using the sparse.randommethod to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected.

我只是尝试了一些矩阵乘积测试,使用该sparse.random方法创建具有指定稀疏度的稀疏矩阵。稀疏矩阵乘法的表现比我预期的要好。

In [251]: M=sparse.random(1000,1000,.5)

In [252]: timeit M1=M*M
1 loops, best of 3: 2.78 s per loop

In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1 loops, best of 3: 4.28 s per loop

It is a size issue; for smaller matrix the dense dotis faster

这是一个尺寸问题;对于较小的矩阵,密集dot速度更快

In [255]: M=sparse.random(100,100,.5)

In [256]: timeit M1=M*M
100 loops, best of 3: 3.24 ms per loop

In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
1000 loops, best of 3: 1.44 ms per loop

But compare indexing

但是比较索引

In [268]: timeit M.tocsr()[500,500]
10 loops, best of 3: 86.4 ms per loop

In [269]: timeit Ma[500,500]
1000000 loops, best of 3: 318 ns per loop

In [270]: timeit Ma=M.toarray();Ma[500,500]
10 loops, best of 3: 23.6 ms per loop

回答by lejlot

a sparse matrix is a matrix in which most of the elements are zero Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?

稀疏矩阵是其中大多数元素为零的矩阵这是确定何时使用稀疏矩阵格式的合适方法 - 一旦 > 50 % 的值为零?或者以防万一使用有意义吗?

There is no general rule. It solely depends on your exact usage later on. You have to compute complexity of the model based on sparse matrix and without, and then you can find the "sweet spot". This will depend on both number of samples and dimension. In general, it often boils down to matrix multiplications of the form

没有一般规则。这完全取决于您以后的确切用法。您必须根据稀疏矩阵计算模型的复杂度,然后才能找到“最佳点”。这将取决于样本数量和维度。一般来说,它通常归结为以下形式的矩阵乘法

X' W

where X is data matrix N x d, and W is some weight matrix d x K. Consequently "dense" multiplication takes NdKtime, while sparse, assuming that your average per-row sparsity is p is NpdK. Thus if your sparsity is 50% you can expect nearly 2x faster operation. The harder part is to estimate the overhead of sparse access as opposed to heavily optimized dense based.

其中 X 是数据矩阵 N xd,W 是某个权重矩阵 dx K。因此,“密集”乘法需要NdK时间,而稀疏,假设您的平均每行稀疏度为 p 是NpdK。因此,如果您的稀疏度为 50%,您可以预期运行速度提高近 2 倍。较难的部分是估计稀疏访问的开销,而不是高度优化的密集访问。

How much does a sparse matrix help performance in a task like mine, especially compared to a numpy array or a standard list?

稀疏矩阵对像我这样的任务的性能有多大帮助,尤其是与 numpy 数组或标准列表相比?

For a particular case of LR, this can be even few times faster than dense format, but in order to observe the difference you need lots of data (>1000) of high dimension (>100).

对于 LR 的特定情况,这甚至可能比密集格式快几倍,但为了观察差异,您需要大量高维度 (>100) 的数据 (>1000)。

So far, I collect my data into a numpy array, then convert into the csr_matrix in Scipy. Is that the right way to do it? I could not figure out how to build a sparse matrix from the ground up, and that might be impossible.

到目前为止,我将我的数据收集到一个 numpy 数组中,然后在 Scipy 中转换为 csr_matrix。这是正确的方法吗?我无法弄清楚如何从头开始构建稀疏矩阵,这可能是不可能的。

No, it is not a good approach. You can build it "from scratch" by for example first building a dictionary and then converting it etc. there are plenty of ways to construct sparse matrix without a dense one in the first place.

不,这不是一个好方法。您可以通过例如首先构建字典然后转换它等方式“从头开始”构建它。有很多方法可以构建稀疏矩阵而首先没有密集矩阵。

回答by komuher

@hpaulj Your timeit is wrong, u are getting slow results cause of mapping sparse.random to numpy array (its slowish) with that in mind:

@hpaulj 你的时间错了,你得到的结果很慢,因为将 sparse.random 映射到 numpy 数组(它很慢),请记住这一点:

M=sparse.random(1000,1000,.5)
Ma=M.toarray()

%timeit -n 25 M1=M*M
352 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 25 loops each)

%timeit -n 25 M2=Ma.dot(Ma)
13.5 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 25 loops each)

To get close to numpy we need to have

为了接近 numpy 我们需要有

M=sparse.random(1000,1000,.03)

%timeit -n 25 M1=M*M
10.7 ms ± 119 μs per loop (mean ± std. dev. of 7 runs, 25 loops each)

%timeit -n 25 M2=Ma.dot(Ma)
11.4 ms ± 564 μs per loop (mean ± std. dev. of 7 runs, 25 loops each)