C++ 为什么 boosts 矩阵乘法比我的慢?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11110604/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 14:52:02  来源:igfitidea点击:

Why is boosts matrix multiplication slower than mine?

c++performanceboostublasboost-ublas

提问by Martin Thoma

I have implemented one matrix multiplication with boost::numeric::ublas::matrix(see my full, working boost code)

我已经实现了一个矩阵乘法boost::numeric::ublas::matrix(请参阅我的完整工作提升代码

Result result = read ();

boost::numeric::ublas::matrix<int> C;
C = boost::numeric::ublas::prod(result.A, result.B);

and another one with the standard algorithm (see full standard code):

另一个使用标准算法(请参阅完整的标准代码):

vector< vector<int> > ijkalgorithm(vector< vector<int> > A, 
                                    vector< vector<int> > B) {
    int n = A.size();

    // initialise C with 0s
    vector<int> tmp(n, 0);
    vector< vector<int> > C(n, tmp);

    for (int i = 0; i < n; i++) {
        for (int k = 0; k < n; k++) {
            for (int j = 0; j < n; j++) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
    return C;
}

This is how I test the speed:

这是我测试速度的方法:

time boostImplementation.out > boostResult.txt
diff boostResult.txt correctResult.txt

time simpleImplementation.out > simpleResult.txt
diff simpleResult.txt correctResult.txt

Both programs read a hard-coded textfile which contains two 2000 x 2000 matrices. Both programs were compiled with these flags:

这两个程序都读取一个硬编码的文本文件,其中包含两个 2000 x 2000 矩阵。这两个程序都是用这些标志编译的:

g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic

I got 15 secondsfor my implementation and over 4 minutesfor the boost-implementation!

15秒对我实施和在4分钟用于升压的实现!

edit: After compiling it with

编辑:编译后

g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out

I got 28.19 secondsfor the ikj-algorithm and 60.99 secondsfor Boost. So Boost is still considerably slower.

我得到了28.19 秒的 ikj 算法和60.99 秒的 Boost。所以Boost仍然相当慢。

Why is boost so much slower than my implementation?

为什么 boost 比我的实现慢这么多?

回答by vitaut

Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD.

正如 TJD 所指出的,uBLAS 版本性能较慢的部分原因可以通过调试后者的功能来解释。

Here's the time taken by the uBLAS version with debugging on:

这是 uBLAS 版本在调试时所用的时间:

real    0m19.966s
user    0m19.809s
sys     0m0.112s

Here's the time taken by the uBLAS version with debugging off (-DNDEBUG -DBOOST_UBLAS_NDEBUGcompiler flags added):

这是 uBLAS 版本在调试关闭时所用的时间(-DNDEBUG -DBOOST_UBLAS_NDEBUG添加了编译器标志):

real    0m7.061s
user    0m6.936s
sys     0m0.096s

So with debugging off, uBLAS version is almost 3 times faster.

因此,关闭调试后,uBLAS 版本几乎快了 3 倍。

Remaining performance difference can be explained by quoting the following section of uBLAS FAQ"Why is uBLAS so much slower than (atlas-)BLAS":

可以通过引用uBLAS FAQ“为什么 uBLAS 比 (atlas-)BLAS 慢得多”的以下部分来解释剩余的性能差异:

An important design goal of ublas is to be as general as possible.

ublas 的一个重要设计目标是尽可能通用。

This generality almost always comes with a cost. In particular the prodfunction template can handle different types of matrices, such as sparse or triangular ones. Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prodand block_prod. Here are the results of comparing different methods:

这种普遍性几乎总是伴随着代价。特别是prod函数模板可以处理不同类型的矩阵,例如稀疏矩阵或三角矩阵。幸运的是,uBLAS 提供了针对密集矩阵乘法优化的替代方案,特别是axpy_prodblock_prod. 以下是比较不同方法的结果:

ijkalgorithm   prod   axpy_prod  block_prod
   1.335       7.061    1.330       1.278

As you can see both axpy_prodand block_prodare somewhat faster than your implementation. Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size for block_prod(I used 64) can make the difference more profound.

正如你可以同时看到axpy_prodblock_prod有些比你更快地实现。仅在没有 I/O 的情况下测量计算时间、删除不必要的复制和仔细选择块大小block_prod(我使用 64)可以使差异更加深刻。

See also uBLAS FAQand Effective uBlas and general code optimization.

另请参阅uBLAS FAQEffective uBlas 和一般代码优化

回答by panda-34

I believe, your compiler doesn't optimize enough. uBLAS code makes heavy use of templates and templates require heavy use of optimizations. I ran your code through MS VC 7.1 compiler in release mode for 1000x1000 matrices, it gives me

我相信,您的编译器优化得不够。uBLAS 代码大量使用模板,模板需要大量使用优化。我在 1000x1000 矩阵的发布模式下通过 MS VC 7.1 编译器运行您的代码,它给了我

10.064s for uBLAS

10.064用于 uBLAS

7.851s for vector

7.851s 表示向量

The difference is still there, but by no means overwhelming. uBLAS's core concept is lazy evaluation, so prod(A, B)evaluates results only when needed, e.g. prod(A, B)(10,100)will execute in no time, since only that one element will actually be calculated. As such there's actually no dedicated algorithm for whole matrix multiplication which could be optimized(see below). But you could help the library a little, declaring

差异仍然存在,但绝不是压倒性的。uBLAS 的核心概念是惰性求值,因此prod(A, B)仅在需要时才求值结果,例如立即prod(A, B)(10,100)执行,因为实际上只会计算一个元素。因此,实际上没有可以优化的整个矩阵乘法的专用算法(见下文)。但是你可以帮助图书馆一点,声明

matrix<int, column_major> B;

will reduce running time to 4.426s which beats your function with one hand tied. This declaration makes access to memory more sequential when multiplying matrices, optimizing cache usage.

将运行时间减少到4.426s,这会在一只手被绑住的情况下击败您的功能。此声明使矩阵相乘时对内存的访问更加连续,优化缓存使用。

P.S. Having read uBLAS documentation to the end ;), you should have found out that there's actually a dedicated function to multiply wholematrices at once. 2 functions - axpy_prodand opb_prod. So

PS 读完 uBLAS 文档 ;),您应该已经发现实际上有一个专用函数可以一次乘以整个矩阵。2 个功能 -axpy_prodopb_prod. 所以

opb_prod(A, B, C, true);

even on unoptimized row_major B matrix executes in 8.091sec and is on par with your vector algorithm

即使在未优化的 row_major B 矩阵上也能在8.091sec 内执行并且与您的向量算法相当

P.P.S. There's even more optimizations:

PPS 还有更多优化:

C = block_prod<matrix<int>, 1024>(A, B);

executes in 4.4s, no matter whether B is column_ or row_ major. Consider the description: "The function block_prod is designed for large densematrices." Choose specific tools for specific tasks!

4.4s 中执行,无论 B 是 column_ 还是 row_major。考虑这样的描述:“函数 block_prod 是为大型密集矩阵设计的。” 为特定任务选择特定工具!

回答by Michael Lehn

I created a little website Matrix-Matrix Product Experiments with uBLAS. It's about integrating a new implementation for the matrix-matrix product into uBLAS. If you already have the boost library it only consists of additional 4 files. So it is pretty much self-contained.

用 uBLAS创建了一个小网站Matrix-Matrix Product Experiments。这是关于将矩阵-矩阵产品的新实现集成到 uBLAS 中。如果你已经有了 boost 库,它只包含额外的 4 个文件。所以它几乎是独立的。

I would be interested if others could run the simple benchmarks on different machines.

如果其他人可以在不同的机器上运行简单的基准测试,我会很感兴趣。