C++ Eigen 库如何比专门的供应商库表现得更好？

Question

提问by Anycorn

I was looking over the performance benchmarks: http://eigen.tuxfamily.org/index.php?title=Benchmark

我正在查看性能基准：http: //eigen.tuxfamily.org/index.php?title=Benchmark

I could not help but notice that eigen appears to consistently outperform all the specialized vendor libraries. The questions is: how is it possible? One would assume that mkl/goto would use processor specific tuned code, while eigen is rather generic.

我不禁注意到 eigen 似乎始终优于所有专门的供应商库。问题是：这怎么可能？人们会假设 mkl/goto 将使用处理器特定的调整代码，而 eigen 是相当通用的。

Notice this http://download.tuxfamily.org/eigen/btl-results-110323/aat.pdf, essentially a dgemm. For N=1000 Eigen gets roughly 17Gf, MKL only 12Gf

注意这个http://download.tuxfamily.org/eigen/btl-results-110323/aat.pdf，本质上是一个 dgemm。对于 N=1000，Eigen 大约为 17Gf，MKL 只有 12Gf

Answer 1

回答by chrisaycock

Eigen has lazy evaluation. From How does Eigen compare to BLAS/LAPACK?:

Eigen 有惰性求值。从Eigen 与 BLAS/LAPACK 相比如何？：

For operations involving complex expressions, Eigen is inherently faster than any BLAS implementation because it can handle and optimize a whole operation globally -- while BLAS forces the programmer to split complex operations into small steps that match the BLAS fixed-function API, which incurs inefficiency due to introduction of temporaries. See for instance the benchmark result of a Y = aX + bY operation which involves two calls to BLAS level1 routines while Eigen automatically generates a single vectorized loop.

对于涉及复杂表达式的操作，Eigen 本质上比任何 BLAS 实现都快，因为它可以全局处理和优化整个操作——而 BLAS 强制程序员将复杂操作拆分为与 BLAS 固定功能 API 匹配的小步骤，这会导致效率低下由于引入了临时工。例如，请参见 Y = a X + bY 操作的基准测试结果，该操作涉及对 BLAS level1 例程的两次调用，而 Eigen 自动生成单个矢量化循环。

The second chart in the benchmarksis Y = a*X + b*Y, which Eigen was specially designed to handle. It should be no wonder that a library wins at a benchmark it was created for. You'll notice that the more generic benchmarks, like matrix-matrix multiplication, don't show any advantage for Eigen.

基准测试中的第二个图表是Y = a*X + b*Y，Eigen 专门设计用于处理。难怪一个库在创建它的基准测试中获胜。您会注意到更通用的基准测试，如矩阵-矩阵乘法，对 Eigen 没有任何优势。

Answer 2

回答by InsideLoop

Benchmarks are designed to be misinterpreted.

基准测试旨在被误解。

Let's look at the matrix * matrix product. The benchmark available on this pagefrom the Eigen website tells you than Eigen (with its own BLAS) gives timings similar to the MKL for large matrices (n = 1000). I've compared Eigen 3.2.6 with MKL 11.3 on my computer (a laptop with a core i7) and the MKL is 3 times faster than Eigen for such matrices using one thread, and 10 times faster than Eigen using 4 threads. This looks like a completely different conclusion. There are two reasons for this. Eigen 3.2.6 (its internal BLAS) does not use AVX. Moreover, it does not seem to make a good usage of multithreading. This benchmark hides this as they use a CPU that does not have AVX support without multithreading.

我们来看看矩阵 * 矩阵的乘积。Eigen 网站上此页面上提供的基准告诉您，Eigen（带有自己的 BLAS）为大型矩阵（n = 1000）提供了类似于 MKL 的时序。我已经在我的计算机（带有核心 i7 的笔记本电脑）上将 Eigen 3.2.6 与 MKL 11.3 进行了比较，对于使用一个线程的此类矩阵，MKL 比 Eigen 快 3 倍，比使用 4 个线程的 Eigen 快 10 倍。这看起来是一个完全不同的结论。有两个原因。Eigen 3.2.6（其内部 BLAS）不使用 AVX。此外，它似乎没有很好地利用多线程。该基准测试隐藏了这一点，因为他们使用的 CPU 没有多线程就没有 AVX 支持。

Usually, those C++ libraries (Eigen, Armadillo, Blaze) bring two things:

通常，那些 C++ 库（Eigen、Armadillo、Blaze）会带来两件事：

Nice operator overloading: You can use +, * with vectors and matrices. In order to get nice performance, they have to use tricky techniques known as "Smart Template expression" in order to avoid temporary when they reduce the timing (such as y = alpha x1 + beta x2 with y, x1, x2 vectors) and introduce them when they are useful (such as A = B * C with A, B, C matrices). They can also reorder operations for less computations, for instance, if A, B, C are matrices A * B * C can be computed as (A * B) * C or A * (B * C) depending upon their sizes.
Internal BLAS: To compute the product of 2 matrices, they can either rely on their internal BLAS or one externally provided (MKL, OpenBLAS, ATLAS). On Intel chips with large matrices, the MKL il almost impossible to beat. For small matrices, one can beat the MKL as it was not designed for that kind of problems.

不错的运算符重载：您可以将 +、* 与向量和矩阵一起使用。为了获得良好的性能，他们必须使用称为“智能模板表达式”的棘手技术，以避免在减少时间时出现临时性（例如 y = alpha x1 + beta x2 with y, x1, x2 向量）并引入当它们有用时（例如 A = B * C 与 A、B、C 矩阵）。它们还可以重新排序操作以减少计算，例如，如果 A、B、C 是矩阵 A * B * C 可以计算为 (A * B) * C 或 A * (B * C)，具体取决于它们的大小。
内部 BLAS：要计算 2 个矩阵的乘积，它们可以依靠其内部 BLAS 或外部提供的 BLAS（MKL、OpenBLAS、ATLAS）。在具有大型矩阵的英特尔芯片上，MKL 几乎不可能被击败。对于小矩阵，可以击败 MKL，因为它不是为这类问题设计的。

Usually, when those libraries provide benchmarks against the MKL, they usually use old hardware, and do not turn on multithreading so they can be on par with the MKL. They might also compare BLAS level 1 operations such as y = alpha x1 + beta x2 with 2 calls to a BLAS level 1 function which is a stupid thing to do anyway.

通常，当这些库提供针对 MKL 的基准测试时，它们通常使用旧硬件，并且不会打开多线程，因此它们可以与 MKL 相提并论。他们还可能将 BLAS 1 级操作（例如 y = alpha x1 + beta x2）与对 BLAS 1 级函数的 2 次调用进行比较，这无论如何都是一件愚蠢的事情。

In a nutshell, those libraries are extremely convenient for their overloading of + and * which is extremely difficult to do without losing performance. They usually do a good job on this. But when they give you benchmark saying that they can be on par or beat the MKL with their own BLAS, be careful and do your own benchmark. You'll usually get different results ;-).

简而言之，这些库对于 + 和 * 的重载非常方便，这在不损失性能的情况下是极其困难的。他们通常在这方面做得很好。但是当他们给你基准测试说他们可以用自己的 BLAS 与 MKL 相提并论或击败 MKL 时，要小心并做你自己的基准测试。你通常会得到不同的结果;-)。

Answer 3

回答by Michael Lehn

About the comparison ATLAS vs. Eigen

关于 ATLAS 与 Eigen 的比较

Have a look at this thread on the Eigen mailing list starting here:

看看 Eigen 邮件列表上的这个线程，从这里开始：

http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00052.html

http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00052.html

It shows for instance that ATLAS outperforms Eigen on the matrix-matrix product by 46%:

例如，它表明 ATLAS 在矩阵-矩阵乘积上的表现优于 Eigen 46%：

http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00062.html

http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00062.html

More benchmarks results and details on how the benchmarks were done can be found here:

可以在此处找到更多基准测试结果和有关基准测试如何完成的详细信息：

Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz:
http://www.mathematik.uni-ulm.de/~lehn/bench_FLENS/index.html
http://sourceforge.net/tracker/index.php?func=detail&aid=3540928&group_id=23725&atid=379483

Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz：
http://www.mathematik.uni-ulm.de/~lehn/bench_FLENS/index.html
http://sourceforge.net/tracker/index.php?func=detail&aid=3540928&group_id=23725&atid=379483

Edit:

编辑：

For my lecture "Software Basics for High Performance Computing" I created a little framework called ulmBLAS. It contains the ATLAS benchmark suite and students could implement their own matrix-matrix product based on the BLISpapers. You can have a look at the final benchmarks which also measure Eigen:

在我的讲座“高性能计算的软件基础”中，我创建了一个名为 ulmBLAS 的小框架。它包含 ATLAS 基准套件，学生可以基于BLIS论文实现他们自己的矩阵产品。您可以查看也测量 Eigen 的最终基准：

http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page14/index.html#toc5

http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page14/index.html#toc5

You can use the ulmBLASframework to make your own benchmarks.

您可以使用ulmBLAS框架来制定自己的基准。

Also have a look at

也看看

Answer 4

回答by Ilya Yaroshenko

Generic code can be fast because Compile Time Function Evaluation (CTFE) allows to choose optimal register blocking strategy (small temporary sub-matrixes stored in CPU registers).

通用代码可以很快，因为编译时函数评估 (CTFE) 允许选择最佳寄存器阻塞策略（存储在 CPU 寄存器中的小型临时子矩阵）。

Mir GLAS and Intel MKL are faster than Eigen and OpenBLAS. Mir GLAS is more generic compared to Eigen. See also the benchmarkand reddit thread.

Mir GLAS 和 Intel MKL 比 Eigen 和 OpenBLAS 快。与 Eigen 相比，Mir GLAS 更通用。另请参阅基准测试和reddit 线程。

Answer 5

回答by Michael Lehn

I sent the same question to the ATLAS mailing list some time ago:

前段时间我向 ATLAS 邮件列表发送了同样的问题：

http://sourceforge.net/mailarchive/message.php?msg_id=28711667

Clint (the ATLAS developer) does not trust these benchmarks. He suggested some trustworthy benchmark procedure. As soon as I have some free time I will do this kind of benchmarking.

Clint（ATLAS 开发人员）不信任这些基准。他建议了一些值得信赖的基准程序。一旦我有空闲时间，我就会做这种基准测试。

If the BLAS functionality of Eigen is actually faster then that of GotoBLAS/GotoBLAS, ATLAS, MKL then they should provide a standard BLAS interface anyway. This would allow linking of LAPACK against such an Eigen-BLAS. In this case, it would also be an interesting option for Matlab and friends.

如果 Eigen 的 BLAS 功能实际上比 GotoBLAS/GotoBLAS、ATLAS、MKL 的功能更快，那么无论如何它们都应该提供标准的 BLAS 接口。这将允许将 LAPACK 链接到这样的 Eigen-BLAS。在这种情况下，对于 Matlab 和朋友来说，这也是一个有趣的选择。

Answer 6

回答by sth

It doesn't seem to consistently outperform other libraries, as can be seen on the graphs further down on that page you linked. So the different libraries are optimized for different use cases, and different libraries are faster for different problems.

它似乎并没有始终优于其他库，正如在您链接的那个页面上的图表中可以看到的那样。所以不同的库针对不同的用例进行了优化，不同的库针对不同的问题速度更快。

This is not surprising, since you usually cannot optimize perfectly for alluse cases. Optimizing for one specific operation usually limits the optimization options for other use cases.

这并不奇怪，因为您通常无法针对所有用例进行完美优化。针对一项特定操作进行优化通常会限制其他用例的优化选项。

C++ Eigen 库如何比专门的供应商库表现得更好？

提问by Anycorn

回答by chrisaycock

回答by InsideLoop

回答by Michael Lehn

About the comparison ATLAS vs. Eigen

关于 ATLAS 与 Eigen 的比较

回答by Ilya Yaroshenko

回答by Michael Lehn

回答by sth

相关推荐

最近更新

标签

C++ Eigen 库如何比专门的供应商库表现得更好？

提问by Anycorn

回答by chrisaycock

回答by InsideLoop

回答by Michael Lehn

About the comparison ATLAS vs. Eigen

关于 ATLAS 与 Eigen 的比较

回答by Ilya Yaroshenko

回答by Michael Lehn

回答by sth

相关推荐

C++中的变量存储在哪里？

C++ 错误处理——示例代码的良好来源？

为什么 C++ 库和框架从不使用智能指针？

C/C++ 中用于“UINT16”二维数组的快速中值滤波器

相关推荐

最近更新

标签