C++ BLAS 是如何获得如此极致的性能的？

Question

提问by DeusAduro

Out of curiosity I decided to benchmark my own matrix multiplication function versus the BLAS implementation... I was to say the least surprised at the result:

出于好奇，我决定对我自己的矩阵乘法函数与 BLAS 实现进行基准测试......我对结果最不感到惊讶：

Custom Implementation, 10 trials of 1000x1000 matrix multiplication:
Took: 15.76542 seconds.
BLAS Implementation, 10 trials of 1000x1000 matrix multiplication:
Took: 1.32432 seconds.

自定义实现，1000x1000 矩阵乘法的 10 次试验：
Took: 15.76542 seconds.
BLAS 实现，1000x1000 矩阵乘法的 10 次试验：
Took: 1.32432 seconds.

This is using single precision floating point numbers.

这是使用单精度浮点数。

My Implementation:

我的实现：

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
    if ( ADim2!=BDim1 )
        throw std::runtime_error("Error sizes off");

    memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
    int cc2,cc1,cr1;
    for ( cc2=0 ; cc2<BDim2 ; ++cc2 )
        for ( cc1=0 ; cc1<ADim2 ; ++cc1 )
            for ( cr1=0 ; cr1<ADim1 ; ++cr1 )
                C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1];
}

I have two questions:

我有两个问题：

Given that a matrix-matrix multiplication say: nxm * mxn requires n*n*m multiplications, so in the case above 1000^3 or 1e9 operations. How is it possible on my 2.6Ghz processor for BLAS to do 10*1e9 operations in 1.32 seconds? Even if multiplcations were a single operation and there was nothing else being done, it should take ~4 seconds.
Why is my implementation so much slower?

鉴于矩阵-矩阵乘法表示：nxm * mxn 需要 n*n*m 次乘法，因此在 1000^3 或 1e9 运算以上的情况下。我的 2.6Ghz 处理器如何让 BLAS 在 1.32 秒内完成 10*1e9 次操作？即使乘法是一个单一的操作并且没有做任何其他事情，它也应该需要大约 4 秒。
为什么我的实现速度这么慢？

Answer 1

回答by Michael Lehn

A good starting point is the great book The Science of Programming Matrix Computationsby Robert A. van de Geijn and Enrique S. Quintana-Ortí. They provide a free download version.

一个很好的起点是Robert A. van de Geijn 和 Enrique S. Quintana-Ortí的伟大著作The Science of Programming Matrix Computations。他们提供免费下载版本。

BLAS is divided into three levels:

BLAS分为三个层次：

Level 1 defines a set of linear algebra functions that operate on vectors only. These functions benefit from vectorization (e.g. from using SSE).
Level 2 functions are matrix-vector operations, e.g. some matrix-vector product. These functions could be implemented in terms of Level1 functions. However, you can boost the performance of this functions if you can provide a dedicated implementation that makes use of some multiprocessor architecture with shared memory.
Level 3 functions are operations like the matrix-matrix product. Again you could implement them in terms of Level2 functions. But Level3 functions perform O(N^3) operations on O(N^2) data. So if your platform has a cache hierarchy then you can boost performance if you provide a dedicated implementation that is cache optimized/cache friendly. This is nicely described in the book. The main boost of Level3 functions comes from cache optimization. This boost significantly exceeds the second boost from parallelism and other hardware optimizations.

级别 1 定义了一组仅对向量进行运算的线性代数函数。这些函数受益于矢量化（例如使用 SSE）。
2 级函数是矩阵向量运算，例如一些矩阵向量乘积。这些功能可以按照Level1 功能来实现。但是，如果您可以提供一个专用实现来利用某些具有共享内存的多处理器体系结构，则可以提高此函数的性能。
第 3 级函数是类似于矩阵-矩阵乘积的操作。同样，您可以根据 Level2 函数来实现它们。但是 Level3 函数对 O(N^2) 数据执行 O(N^3) 操作。因此，如果您的平台具有缓存层次结构，那么如果您提供缓存优化/缓存友好的专用实现，则可以提高性能。这在书中有很好的描述。Level3 函数的主要提升来自缓存优化。这种提升显着超过了并行性和其他硬件优化的第二次提升。

By the way, most (or even all) of the high performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS/OpenBLAS is implemented in C and its performance critical parts in Assembler. Only the reference implementation of BLAS is implemented in Fortran. However, all these BLAS implementations provide a Fortran interface such that it can be linked against LAPACK (LAPACK gains all its performance from BLAS).

顺便说一句，大多数（甚至全部）高性能 BLAS 实现都没有在 Fortran 中实现。ATLAS 是用 C 实现的。 GotoBLAS/OpenBLAS 是用 C 实现的，它的性能关键部分在 Assembler 中实现。Fortran中只实现了BLAS的参考实现。然而，所有这些 BLAS 实现都提供了一个 Fortran 接口，以便它可以链接到 LAPACK（LAPACK 从 BLAS 中获得所有性能）。

Optimized compilers play a minor role in this respect (and for GotoBLAS/OpenBLAS the compiler does not matter at all).

优化的编译器在这方面的作用很小（对于 GotoBLAS/OpenBLAS，编译器根本不重要）。

IMHO no BLAS implementation uses algorithms like the Coppersmith–Winograd algorithm or the Strassen algorithm. I am not exactly sure about the reason, but this is my guess:

恕我直言，没有 BLAS 实现使用像 Coppersmith-Winograd 算法或 Strassen 算法这样的算法。我不太确定原因，但这是我的猜测：

Maybe its not possible to provide a cache optimized implementation of these algorithms (i.e. you would loose more then you would win)
These algorithms are numerically not stable. As BLAS is the computational kernel of LAPACK this is a no-go.

也许不可能提供这些算法的缓存优化实现（即你会失去更多然后你会赢）
这些算法在数值上不稳定。由于 BLAS 是 LAPACK 的计算内核，因此这是行不通的。

Edit/Update:

编辑/更新：

The new and ground breaking paper for this topic are the BLIS papers. They are exceptionally well written. For my lecture "Software Basics for High Performance Computing" I implemented the matrix-matrix product following their paper. Actually I implemented several variants of the matrix-matrix product. The simplest variants is entirely written in plain C and has less than 450 lines of code. All the other variants merely optimize the loops

该主题的新论文和开创性论文是BLIS 论文。他们写得特别好。在我的讲座“高性能计算的软件基础”中，我按照他们的论文实现了矩阵-矩阵产品。实际上我实现了矩阵矩阵产品的几个变体。最简单的变体完全用纯 C 语言编写，代码不到 450 行。所有其他变体仅优化循环

    for (l=0; l<MR*NR; ++l) {
        AB[l] = 0;
    }
    for (l=0; l<kc; ++l) {
        for (j=0; j<NR; ++j) {
            for (i=0; i<MR; ++i) {
                AB[i+j*MR] += A[i]*B[j];
            }
        }
        A += MR;
        B += NR;
    }

The overall performance of the matrix-matrix product onlydepends on these loops. About 99.9% of the time is spent here. In the other variants I used intrinsics and assembler code to improve the performance. You can see the tutorial going through all the variants here:

矩阵-矩阵乘积的整体性能仅取决于这些循环。大约 99.9% 的时间都花在这里。在其他变体中，我使用内在函数和汇编代码来提高性能。您可以在此处查看介绍所有变体的教程：

ulmBLAS: Tutorial on GEMM (Matrix-Matrix Product)

ulmBLAS：GEMM（矩阵-矩阵乘积）教程

Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such a performance. And why it does not matter whether you use row or column major storage!

与 BLIS 论文一起，很容易理解像英特尔 MKL 这样的库如何获得这样的性能。以及为什么使用行或列主要存储并不重要！

The final benchmarks are here (we called our project ulmBLAS):

最终的基准测试在这里（我们称我们的项目为 ulmBLAS）：

Benchmarks for ulmBLAS, BLIS, MKL, openBLAS and Eigen

ulmBLAS、BLIS、MKL、openBLAS 和 Eigen 的基准

Another Edit/Update:

另一个编辑/更新：

I also wrote some tutorial on how BLAS gets used for numerical linear algebra problems like solving a system of linear equations:

我还写了一些关于如何将 BLAS 用于求解线性方程组等数值线性代数问题的教程：

High Performance LU Factorization

高性能 LU 分解

(This LU factorization is for example used by Matlab for solving a system of linear equations.)

（例如，此 LU 分解被 Matlab 用于求解线性方程组。）

~~I hope to find time~~to extend the tutorial to describe and demonstrate how to realise a highly scalable parallel implementation of the LU factorization like in PLASMA.

~~我希望能抽出时间~~来扩展本教程，以描述和演示如何像PLASMA一样实现 LU 分解的高度可扩展的并行实现。

Ok, here you go: Coding a Cache Optimized Parallel LU Factorization

好的，给你：编码缓存优化的并行 LU 分解

P.S.: I also did make some experiments on improving the performance of uBLAS. It actually is pretty simple to boost (yeah, play on words :) ) the performance of uBLAS:

PS：我也做了一些关于提高uBLAS性能的实验。提高 uBLAS 的性能实际上非常简单（是的，玩文字 :) ）：

Experiments on uBLAS.

uBLAS 上的实验。

Here a similar project with BLAZE:

这里有一个与BLAZE类似的项目：

Experiments on BLAZE.

BLAZE 上的实验。

Answer 2

回答by Andrew Tomazos

So first of all BLAS is just an interface of about 50 functions. There are many competing implementations of the interface.

所以首先BLAS只是一个大约有50个功能的接口。该接口有许多相互竞争的实现。

Firstly I will mention things that are largely unrelated:

首先我会提到一些基本不相关的事情：

Fortran vs C, makes no difference
Advanced matrix algorithms such as Strassen, implementations dont use them as they dont help in practice

Fortran vs C，没有区别
高级矩阵算法，如 Strassen，实现不使用它们，因为它们在实践中没有帮助

Most implementations break each operation into small-dimension matrix or vector operations in the more or less obvious way. For example a large 1000x1000 matrix multiplication may broken into a sequence of 50x50 matrix multiplications.

大多数实现以或多或少明显的方式将每个操作分解为小维矩阵或向量操作。例如，一个大的 1000x1000 矩阵乘法可能会分解为一系列 50x50 矩阵乘法。

These fixed-size small-dimension operations (called kernels) are hardcoded in CPU-specific assembly code using several CPU features of their target:

这些固定大小的小维度操作（称为内核）使用其目标的几个 CPU 功能硬编码在特定于 CPU 的汇编代码中：

SIMD-style instructions
Instruction Level Parallelism
Cache-awareness

SIMD 风格的指令
指令级并行
缓存感知

Furthermore these kernels can be executed in parallel with respect to each other using multiple threads (CPU cores), in the typical map-reduce design pattern.

此外，在典型的 map-reduce 设计模式中，这些内核可以使用多个线程（CPU 内核）相对于彼此并行执行。

Take a look at ATLAS which is the most commonly used open source BLAS implementation. It has many different competing kernels, and during the ATLAS library build process it runs a competition among them (some are even parameterized, so the same kernel can have different settings). It tries different configurations and then selects the best for the particular target system.

看看 ATLAS，它是最常用的开源 BLAS 实现。它有许多不同的竞争内核，在 ATLAS 库构建过程中，它会在它们之间进行竞争（有些甚至是参数化的，因此同一个内核可以有不同的设置）。它尝试不同的配置，然后为特定的目标系统选择最佳配置。

(Tip: That is why if you are using ATLAS you are better off building and tuning the library by hand for your particular machine then using a prebuilt one.)

（提示：这就是为什么如果您使用 ATLAS，您最好为您的特定机器手动构建和调整库，然后使用预构建的库。）

Answer 3

回答by jalf

First, there are more efficient algorithms for matrix multiplication than the one you're using.

首先，矩阵乘法的算法比您使用的算法更有效。

Second, your CPU can do much more than one instruction at a time.

其次，您的 CPU 一次可以执行不止一条指令。

Your CPU executes 3-4 instructions per cycle, and if the SIMD units are used, each instruction processes 4 floats or 2 doubles. (of course this figure isn't accurate either, as the CPU can typically only process one SIMD instruction per cycle)

您的 CPU 每个周期执行 3-4 条指令，如果使用 SIMD 单元，则每条指令处理 4 个浮点数或 2 个双精度数。（当然这个数字也不准确，因为 CPU 通常每个周期只能处理一条 SIMD 指令）

Third, your code is far from optimal:

第三，您的代码远非最佳：

You're using raw pointers, which means that the compiler has to assume they may alias. There are compiler-specific keywords or flags you can specify to tell the compiler that they don't alias. Alternatively, you should use other types than raw pointers, which take care of the problem.
You're thrashing the cache by performing a naive traversal of each row/column of the input matrices. You can use blocking to perform as much work as possible on a smaller block of the matrix, which fits in the CPU cache, before moving on to the next block.
For purely numerical tasks, Fortran is pretty much unbeatable, and C++ takes a lot of coaxing to get up to a similar speed. It can be done, and there are a few libraries demonstrating it (typically using expression templates), but it's not trivial, and it doesn't justhappen.

您使用的是原始指针，这意味着编译器必须假设它们可能是别名。您可以指定特定于编译器的关键字或标志来告诉编译器它们没有别名。或者，您应该使用原始指针以外的其他类型来解决问题。
您通过对输入矩阵的每一行/列执行简单的遍历来破坏缓存。在移动到下一个块之前，您可以使用阻塞在适合 CPU 缓存的较小矩阵块上执行尽可能多的工作。
对于纯粹的数值任务，Fortran 几乎是无与伦比的，而 C++ 需要大量的哄骗才能达到类似的速度。这是可以做到，并且有（通常使用表达式模板），这表明它的几个库，但它的不平凡，它不只是发生。

Answer 4

回答by softveda

I don't know specfically about BLAS implementation but there are more efficient alogorithms for Matrix Multiplication that has better than O(n3) complexity. A well know one is Strassen Algorithm

我不具体了解 BLAS 的实现，但是矩阵乘法有更有效的算法，其复杂度优于 O(n3) 复杂度。一个众所周知的是施特拉森算法

Answer 5

回答by Wolfgang Jansen

Most arguments to the second question -- assembler, splitting into blocks etc. (but not the less than N^3 algorithms, they are really overdeveloped) -- play a role. But the low velocity of your algorithm is caused essentially by matrix size and the unfortunate arrangement of the three nested loops. Your matrices are so large that they do not fit at once in cache memory. You can rearrange the loops such that as much as possible will be done on a row in cache, this way dramatically reducing cache refreshes (BTW splitting into small blocks has an analogue effect, best if loops over the blocks are arranged similarly). A model implementation for square matrices follows. On my computer its time consumption was about 1:10 compared to the standard implementation (as yours). In other words: never program a matrix multiplication along the "row times column" scheme that we learned in school. After having rearranged the loops, more improvements are obtained by unrolling loops, assembler code etc.

第二个问题的大多数论点——汇编程序、拆分成块等（但不少于 N^3 的算法，它们确实是过度开发的）——发挥了作用。但是您的算法的低速度主要是由矩阵大小和三个嵌套循环的不幸排列引起的。您的矩阵太大了，以至于它们无法立即放入缓存中。您可以重新排列循环，以便尽可能多地在缓存中的一行上完成，这种方式显着减少了缓存刷新（顺便说一句，拆分为小块具有类似的效果，最好是在块上的循环以类似方式排列）。方阵的模型实现如下。在我的计算机上，与标准实现（如您的）相比，其时间消耗约为 1:10。换句话说：永远不要沿“

    void vector(int m, double ** a, double ** b, double ** c) {
      int i, j, k;
      for (i=0; i<m; i++) {
        double * ci = c[i];
        for (k=0; k<m; k++) ci[k] = 0.;
        for (j=0; j<m; j++) {
          double aij = a[i][j];
          double * bj = b[j];
          for (k=0; k<m; k++)  ci[k] += aij*bj[k];
        }
      }
    }

One more remark: This implementation is even better on my computer than replacing all by the BLAS routine cblas_dgemm (try it on your computer!). But much faster (1:4) is calling dgemm_ of the Fortran library directly. I think this routine is in fact not Fortran but assembler code (I do not know what is in the library, I don't have the sources). Totally unclear to me is why cblas_dgemm is not as fast since to my knowledge it is merely a wrapper for dgemm_.

再说一句：这个实现在我的电脑上比用 BLAS 例程 cblas_dgemm 替换所有的更好（在你的电脑上试试！）。但是直接调用 Fortran 库的 dgemm_ 更快（1:4）。我认为这个例程实际上不是 Fortran 而是汇编代码（我不知道库中有什么，我没有源代码）。我完全不清楚为什么 cblas_dgemm 没有那么快，因为据我所知，它只是 dgemm_ 的包装器。

Answer 6

回答by Justicle

This is a realistic speed up. For an example of what can be done with SIMD assembler over C++ code, see some example iPhone matrix functions- these were over 8x faster than the C version, and aren't even "optimized" assembly - there's no pipe-lining yet and there is unnecessary stack operations.

这是一个现实的加速。有关通过 C++ 代码使用 SIMD 汇编器可以完成的工作的示例，请参阅一些示例iPhone 矩阵函数- 这些比 C 版本快 8 倍以上，甚至不是“优化”的汇编 - 还没有流水线是不必要的堆栈操作。

Also your code is not "restrict correct" - how does the compiler know that when it modifies C, it isn't modifying A and B?

此外，您的代码不是“限制正确” - 编译器如何知道当它修改 C 时，它没有修改 A 和 B？

Answer 7

回答by Pari Rajaram

With respect to the original code in MM multiply, memory reference for most operation is the main cause of bad performance. Memory is running at 100-1000 times slower than cache.

对于 MM 乘法中的原始代码，大多数操作的内存引用是性能不佳的主要原因。内存运行速度比缓存慢 100-1000 倍。

Most of speed up comes from employing loop optimization techniques for this triple loop function in MM multiply. Two main loop optimization techniques are used; unrolling and blocking. With respect to unrolling, we unroll the outer two most loops and block it for data reuse in cache. Outer loop unrolling helps optimize data-access temporally by reducing the number of memory references to same data at different times during the entire operation. Blocking the loop index at specific number, helps with retaining the data in cache. You can choose to optimize for L2 cache or L3 cache.

大部分加速来自于在 MM 乘法中为这个三重循环函数采用循环优化技术。使用了两种主循环优化技术；展开和阻塞。关于展开，我们展开最外面的两个循环并阻止它在缓存中重用数据。外循环展开通过减少整个操作期间不同时间对相同数据的内存引用次数，有助于临时优化数据访问。将循环索引阻塞在特定数量，有助于将数据保留在缓存中。您可以选择针对 L2 缓存或 L3 缓存进行优化。

https://en.wikipedia.org/wiki/Loop_nest_optimization

Answer 8

回答by Stefano Borini

For many reasons.

因为许多的原因。

First, Fortran compilers are highly optimized, and the language allows them to be as such. C and C++ are very loose in terms of array handling (e.g. the case of pointers referring to the same memory area). This means that the compiler cannot know in advance what to do, and is forced to create generic code. In Fortran, your cases are more streamlined, and the compiler has better control of what happens, allowing him to optimize more (e.g. using registers).

首先，Fortran 编译器经过高度优化，语言允许它们如此。C 和 C++ 在数组处理方面非常松散（例如，指针指向同一内存区域的情况）。这意味着编译器无法提前知道要做什么，并被迫创建通用代码。在 Fortran 中，您的案例更加精简，编译器可以更好地控制所发生的事情，从而允许他进行更多优化（例如使用寄存器）。

Another thing is that Fortran store stuff columnwise, while C stores data row-wise. I havent' checked your code, but be careful of how you perform the product. In C you must scan row wise: this way you scan your array along contiguous memory, reducing the cache misses. Cache miss is the first source of inefficiency.

另一件事是 Fortran 按列存储内容，而 C 按行存储数据。我还没有检查你的代码，但要注意你如何执行产品。在 C 中，您必须按行扫描：通过这种方式，您可以沿着连续内存扫描数组，从而减少缓存未命中。缓存未命中是低效率的第一个来源。

Third, it depends of the blas implementation you are using. Some implementations might be written in assembler, and optimized for the specific processor you are using. The netlib version is written in fortran 77.

第三，这取决于您使用的 blas 实现。某些实现可能是用汇编程序编写的，并针对您使用的特定处理器进行了优化。netlib 版本是用 fortran 77 编写的。

Also, you are doing a lot of operations, most of them repeated and redundant. All those multiplications to obtain the index are detrimental for the performance. I don't really know how this is done in BLAS, but there are a lot of tricks to prevent expensive operations.

此外，您正在执行大量操作，其中大部分操作是重复且多余的。所有这些获得索引的乘法都不利于性能。我真的不知道这是如何在 BLAS 中完成的，但是有很多技巧可以防止昂贵的操作。

For example, you could rework your code this way

例如，您可以通过这种方式重新编写代码

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 ) throw std::runtime_error("Error sizes off");

memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1, a1,a2,a3;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 ) {
    a1 = cc2*ADim2;
    a3 = cc2*BDim1
    for ( cc1=0 ; cc1<ADim2 ; ++cc1 ) {
          a2=cc1*ADim1;
          ValT b = B[a3+cc1];
          for ( cr1=0 ; cr1<ADim1 ; ++cr1 ) {
                    C[a1+cr1] += A[a2+cr1]*b;
           }
     }
  }
}

Try it, I am sure you will save something.

试试吧，我相信你会保存一些东西。

On you #1 question, the reason is that matrix multiplication scales as O(n^3) if you use a trivial algorithm. There are algorithms that scale much better.

在您的#1 问题上，原因是如果您使用简单算法，矩阵乘法的缩放比例为 O(n^3)。有一些算法可以更好地扩展。

C++ BLAS 是如何获得如此极致的性能的？

提问by DeusAduro

回答by Michael Lehn

回答by Andrew Tomazos

回答by jalf

回答by softveda

回答by Wolfgang Jansen

回答by Justicle

回答by Pari Rajaram

回答by Stefano Borini

相关推荐

最近更新

标签

C++ BLAS 是如何获得如此极致的性能的？

提问by DeusAduro

回答by Michael Lehn

回答by Andrew Tomazos

回答by jalf

回答by softveda

回答by Wolfgang Jansen

回答by Justicle

回答by Pari Rajaram

回答by Stefano Borini

相关推荐

C++ _T 在 CString 中代表什么

C++ nullptr 究竟是什么？

C++ 在 std::vector<std::pair> 中查找

C++ 使用 memcpy 将 std::vector 从 protobuf 复制到重复字段

相关推荐

最近更新

标签