C++ 为什么 boosts 矩阵乘法比我的慢?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11110604/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why is boosts matrix multiplication slower than mine?
提问by Martin Thoma
I have implemented one matrix multiplication with boost::numeric::ublas::matrix
(see my full, working boost code)
我已经实现了一个矩阵乘法boost::numeric::ublas::matrix
(请参阅我的完整工作提升代码)
Result result = read ();
boost::numeric::ublas::matrix<int> C;
C = boost::numeric::ublas::prod(result.A, result.B);
and another one with the standard algorithm (see full standard code):
另一个使用标准算法(请参阅完整的标准代码):
vector< vector<int> > ijkalgorithm(vector< vector<int> > A,
vector< vector<int> > B) {
int n = A.size();
// initialise C with 0s
vector<int> tmp(n, 0);
vector< vector<int> > C(n, tmp);
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
return C;
}
This is how I test the speed:
这是我测试速度的方法:
time boostImplementation.out > boostResult.txt
diff boostResult.txt correctResult.txt
time simpleImplementation.out > simpleResult.txt
diff simpleResult.txt correctResult.txt
Both programs read a hard-coded textfile which contains two 2000 x 2000 matrices. Both programs were compiled with these flags:
这两个程序都读取一个硬编码的文本文件,其中包含两个 2000 x 2000 矩阵。这两个程序都是用这些标志编译的:
g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic
I got 15 secondsfor my implementation and over 4 minutesfor the boost-implementation!
我15秒对我实施和在4分钟用于升压的实现!
edit: After compiling it with
编辑:编译后
g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out
I got 28.19 secondsfor the ikj-algorithm and 60.99 secondsfor Boost. So Boost is still considerably slower.
我得到了28.19 秒的 ikj 算法和60.99 秒的 Boost。所以Boost仍然相当慢。
Why is boost so much slower than my implementation?
为什么 boost 比我的实现慢这么多?
回答by vitaut
Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD.
正如 TJD 所指出的,uBLAS 版本性能较慢的部分原因可以通过调试后者的功能来解释。
Here's the time taken by the uBLAS version with debugging on:
这是 uBLAS 版本在调试时所用的时间:
real 0m19.966s
user 0m19.809s
sys 0m0.112s
Here's the time taken by the uBLAS version with debugging off (-DNDEBUG -DBOOST_UBLAS_NDEBUG
compiler flags added):
这是 uBLAS 版本在调试关闭时所用的时间(-DNDEBUG -DBOOST_UBLAS_NDEBUG
添加了编译器标志):
real 0m7.061s
user 0m6.936s
sys 0m0.096s
So with debugging off, uBLAS version is almost 3 times faster.
因此,关闭调试后,uBLAS 版本几乎快了 3 倍。
Remaining performance difference can be explained by quoting the following section of uBLAS FAQ"Why is uBLAS so much slower than (atlas-)BLAS":
可以通过引用uBLAS FAQ“为什么 uBLAS 比 (atlas-)BLAS 慢得多”的以下部分来解释剩余的性能差异:
An important design goal of ublas is to be as general as possible.
ublas 的一个重要设计目标是尽可能通用。
This generality almost always comes with a cost. In particular the prod
function template can handle different types of matrices, such as sparse or triangular ones. Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prodand block_prod
. Here are the results of comparing different methods:
这种普遍性几乎总是伴随着代价。特别是prod
函数模板可以处理不同类型的矩阵,例如稀疏矩阵或三角矩阵。幸运的是,uBLAS 提供了针对密集矩阵乘法优化的替代方案,特别是axpy_prod和block_prod
. 以下是比较不同方法的结果:
ijkalgorithm prod axpy_prod block_prod
1.335 7.061 1.330 1.278
As you can see both axpy_prod
and block_prod
are somewhat faster than your implementation. Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size for block_prod
(I used 64) can make the difference more profound.
正如你可以同时看到axpy_prod
和block_prod
有些比你更快地实现。仅在没有 I/O 的情况下测量计算时间、删除不必要的复制和仔细选择块大小block_prod
(我使用 64)可以使差异更加深刻。
See also uBLAS FAQand Effective uBlas and general code optimization.
回答by panda-34
I believe, your compiler doesn't optimize enough. uBLAS code makes heavy use of templates and templates require heavy use of optimizations. I ran your code through MS VC 7.1 compiler in release mode for 1000x1000 matrices, it gives me
我相信,您的编译器优化得不够。uBLAS 代码大量使用模板,模板需要大量使用优化。我在 1000x1000 矩阵的发布模式下通过 MS VC 7.1 编译器运行您的代码,它给了我
10.064
s for uBLAS
10.064
用于 uBLAS
7.851
s for vector
7.851
s 表示向量
The difference is still there, but by no means overwhelming. uBLAS's core concept is lazy evaluation, so prod(A, B)
evaluates results only when needed, e.g. prod(A, B)(10,100)
will execute in no time, since only that one element will actually be calculated. As such there's actually no dedicated algorithm for whole matrix multiplication which could be optimized(see below). But you could help the library a little, declaring
差异仍然存在,但绝不是压倒性的。uBLAS 的核心概念是惰性求值,因此prod(A, B)
仅在需要时才求值结果,例如立即prod(A, B)(10,100)
执行,因为实际上只会计算一个元素。因此,实际上没有可以优化的整个矩阵乘法的专用算法(见下文)。但是你可以帮助图书馆一点,声明
matrix<int, column_major> B;
will reduce running time to 4.426
s which beats your function with one hand tied. This declaration makes access to memory more sequential when multiplying matrices, optimizing cache usage.
将运行时间减少到4.426
s,这会在一只手被绑住的情况下击败您的功能。此声明使矩阵相乘时对内存的访问更加连续,优化缓存使用。
P.S. Having read uBLAS documentation to the end ;), you should have found out that there's actually a dedicated function to multiply wholematrices at once. 2 functions - axpy_prod
and opb_prod
. So
PS 读完 uBLAS 文档 ;),您应该已经发现实际上有一个专用函数可以一次乘以整个矩阵。2 个功能 -axpy_prod
和opb_prod
. 所以
opb_prod(A, B, C, true);
even on unoptimized row_major B matrix executes in 8.091
sec and is on par with your vector algorithm
即使在未优化的 row_major B 矩阵上也能在8.091
sec 内执行并且与您的向量算法相当
P.P.S. There's even more optimizations:
PPS 还有更多优化:
C = block_prod<matrix<int>, 1024>(A, B);
executes in 4.4
s, no matter whether B is column_ or row_ major.
Consider the description: "The function block_prod is designed for large densematrices." Choose specific tools for specific tasks!
在4.4
s 中执行,无论 B 是 column_ 还是 row_major。考虑这样的描述:“函数 block_prod 是为大型密集矩阵设计的。” 为特定任务选择特定工具!
回答by Michael Lehn
I created a little website Matrix-Matrix Product Experiments with uBLAS. It's about integrating a new implementation for the matrix-matrix product into uBLAS. If you already have the boost library it only consists of additional 4 files. So it is pretty much self-contained.
我用 uBLAS创建了一个小网站Matrix-Matrix Product Experiments。这是关于将矩阵-矩阵产品的新实现集成到 uBLAS 中。如果你已经有了 boost 库,它只包含额外的 4 个文件。所以它几乎是独立的。
I would be interested if others could run the simple benchmarks on different machines.
如果其他人可以在不同的机器上运行简单的基准测试,我会很感兴趣。