C++ 如何达到每个周期 4 FLOP 的理论最大值？

Question

提问by user1059432

How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel CPU?

如何在现代 x86-64 Intel CPU 上实现每周期 4 次浮点运算（双精度）的理论峰值性能？

As far as I understand it take three cycles for an SSEaddand five cycles for a multo complete on most of the modern Intel CPUs (see for example Agner Fog's 'Instruction Tables'). Due to pipelining one can get a throughput of one addper cycle if the algorithm has at least three independent summations. Since that is true for packed addpdas well as the scalar addsdversions and SSE registers can contain two double's the throughput can be as much as two flops per cycle.

据我所知，在大多数现代英特尔 CPU 上，SSE需要三个周期，一个周期需要add五个周期mul才能完成（例如，参见Agner Fog 的“指令表”）。由于流水线，add如果算法具有至少三个独立的求和，则每个周期可以获得一个吞吐量。由于对于压缩版本和addpd标量addsd版本都是如此，并且 SSE 寄存器可以包含 2 double，因此每个周期的吞吐量可以高达两个触发器。

Furthermore, it seems (although I've not seen any proper documentation on this) add's and mul's can be executed in parallel giving a theoretical max throughput of four flops per cycle.

此外，似乎（尽管我没有看到任何有关此的适当文档）add's 和mul's 可以并行执行，给出每个周期四个触发器的理论最大吞吐量。

However, I've not been able to replicate that performance with a simple C/C++ programme. My best attempt resulted in about 2.7 flops/cycle. If anyone can contribute a simple C/C++ or assembler programme which demonstrates peak performance that'd be greatly appreciated.

但是，我无法使用简单的 C/C++ 程序复制该性能。我的最佳尝试导致大约 2.7 次失败/周期。如果有人可以贡献一个简单的 C/C++ 或汇编程序来展示峰值性能，那将不胜感激。

My attempt:

我的尝试：

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>

double stoptime(void) {
   struct timeval t;
   gettimeofday(&t,NULL);
   return (double) t.tv_sec + t.tv_usec/1000000.0;
}

double addmul(double add, double mul, int ops){
   // Need to initialise differently otherwise compiler might optimise away
   double sum1=0.1, sum2=-0.1, sum3=0.2, sum4=-0.2, sum5=0.0;
   double mul1=1.0, mul2= 1.1, mul3=1.2, mul4= 1.3, mul5=1.4;
   int loops=ops/10;          // We have 10 floating point operations inside the loop
   double expected = 5.0*add*loops + (sum1+sum2+sum3+sum4+sum5)
               + pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5);

   for (int i=0; i<loops; i++) {
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
   }
   return  sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - expected;
}

int main(int argc, char** argv) {
   if (argc != 2) {
      printf("usage: %s <num>\n", argv[0]);
      printf("number of operations: <num> millions\n");
      exit(EXIT_FAILURE);
   }
   int n = atoi(argv[1]) * 1000000;
   if (n<=0)
       n=1000;

   double x = M_PI;
   double y = 1.0 + 1e-8;
   double t = stoptime();
   x = addmul(x, y, n);
   t = stoptime() - t;
   printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n", t, (double)n/t/1e9, x);
   return EXIT_SUCCESS;
}

Compiled with

编译为

g++ -O2 -march=native addmul.cpp ; ./a.out 1000

produces the following output on an Intel Core i5-750, 2.66 GHz.

在 Intel Core i5-750, 2.66 GHz 上产生以下输出。

addmul:  0.270 s, 3.707 Gflops, res=1.326463

That is, just about 1.4 flops per cycle. Looking at the assembler code with g++ -S -O2 -march=native -masm=intel addmul.cppthe main loop seems kind of optimal to me:

也就是说，每个周期只有大约 1.4 个触发器。用g++ -S -O2 -march=native -masm=intel addmul.cpp主循环查看汇编代码对我来说似乎是最佳选择：

.L4:
inc    eax
mulsd    xmm8, xmm3
mulsd    xmm7, xmm3
mulsd    xmm6, xmm3
mulsd    xmm5, xmm3
mulsd    xmm1, xmm3
addsd    xmm13, xmm2
addsd    xmm12, xmm2
addsd    xmm11, xmm2
addsd    xmm10, xmm2
addsd    xmm9, xmm2
cmp    eax, ebx
jne    .L4

Changing the scalar versions with packed versions (addpdand mulpd) would double the flop count without changing the execution time and so I'd get just short of 2.8 flops per cycle. Is there a simple example which achieves four flops per cycle?

使用打包版本 (addpd和mulpd)更改标量版本将在不改变执行时间的情况下使触发器计数加倍，因此每个周期我会得到 2.8 个触发器。有没有一个简单的例子，每个周期实现四个触发器？

Nice little programme by Mysticial; here are my results (run just for a few seconds though):

Mysticial 不错的小程序；这是我的结果（虽然只运行了几秒钟）：

gcc -O2 -march=nocona: 5.6 Gflops out of 10.66 Gflops (2.1 flops/cycle)
cl /O2, openmp removed: 10.1 Gflops out of 10.66 Gflops (3.8 flops/cycle)

gcc -O2 -march=nocona：10.66 Gflops 中的 5.6 Gflops（2.1 flops/cycle）
cl /O2, openmp 移除：10.66 Gflops 中的 10.1 Gflops（3.8 flops/cycle）

It all seems a bit complex, but my conclusions so far:

这一切似乎有点复杂，但到目前为止我的结论是：

gcc -O2changes the order of independent floating point operations with the aim of alternating addpdand mulpd's if possible. Same applies to gcc-4.6.2 -O2 -march=core2.
gcc -O2 -march=noconaseems to keep the order of floating point operations as defined in the C++ source.
cl /O2, the 64-bit compiler from the SDK for Windows 7does loop-unrolling automatically and seems to try and arrange operations so that groups of three addpd's alternate with three mulpd's (well, at least on my system and for my simple programme).
My Core i5 750(Nehalem architecture) doesn't like alternating add's and mul's and seems unable to run both operations in parallel. However, if grouped in 3's it suddenly works like magic.
Other architectures (possibly Sandy Bridgeand others) appear to be able to execute add/mul in parallel without problems if they alternate in the assembly code.
Although difficult to admit, but on my system cl /O2does a much better job at low-level optimising operations for my system and achieves close to peak performance for the little C++ example above. I measured between 1.85-2.01 flops/cycle (have used clock() in Windows which is not that precise. I guess, need to use a better timer - thanks Mackie Messer).
The best I managed with gccwas to manually loop unroll and arrange additions and multiplications in groups of three. With g++ -O2 -march=nocona addmul_unroll.cppI get at best 0.207s, 4.825 Gflopswhich corresponds to 1.8 flops/cycle which I'm quite happy with now.

gcc -O2更改独立浮点运算的顺序，目的是在可能的情况下交替 addpd和mulpd。同样适用于gcc-4.6.2 -O2 -march=core2.
gcc -O2 -march=nocona似乎保持 C++ 源代码中定义的浮点运算顺序。
cl /O2，适用于 Windows 7的SDK 中的 64 位编译器会自动进行循环展开，并且似乎会尝试安排操作，以便 3 组与 3 组addpd交替mulpd（嗯，至少在我的系统和我的简单程序上） .
我的Core i5 750（Nehalem 架构）不喜欢交替使用 add 和 mul，并且似乎无法并行运行这两个操作。但是，如果分成 3 组，它会突然变魔术。
其他体系结构（可能是Sandy Bridge和其他体系结构）似乎能够并行执行 add/mul，如果它们在汇编代码中交替出现的话。
虽然很难承认，但在我的系统cl /O2上，在我的系统的低级优化操作方面做得更好，并且为上面的小 C++ 示例实现了接近峰值的性能。我测量了 1.85-2.01 触发器/周期（在 Windows 中使用了时钟（），这不是那么精确。我想，需要使用更好的计时器 - 感谢 Mackie Messer）。
我管理的最好的方法gcc是手动循环展开并以三人一组的方式安排加法和乘法。随着 g++ -O2 -march=nocona addmul_unroll.cpp我得到充其量0.207s, 4.825 Gflops相当于1.8触发器/周期，这我与现在非常高兴。

In the C++ code I've replaced the forloop with

在 C++ 代码中，我for用

   for (int i=0; i<loops/3; i++) {
       mul1*=mul; mul2*=mul; mul3*=mul;
       sum1+=add; sum2+=add; sum3+=add;
       mul4*=mul; mul5*=mul; mul1*=mul;
       sum4+=add; sum5+=add; sum1+=add;

       mul2*=mul; mul3*=mul; mul4*=mul;
       sum2+=add; sum3+=add; sum4+=add;
       mul5*=mul; mul1*=mul; mul2*=mul;
       sum5+=add; sum1+=add; sum2+=add;

       mul3*=mul; mul4*=mul; mul5*=mul;
       sum3+=add; sum4+=add; sum5+=add;
   }

And the assembly now looks like

组装现在看起来像

.L4:
mulsd    xmm8, xmm3
mulsd    xmm7, xmm3
mulsd    xmm6, xmm3
addsd    xmm13, xmm2
addsd    xmm12, xmm2
addsd    xmm11, xmm2
mulsd    xmm5, xmm3
mulsd    xmm1, xmm3
mulsd    xmm8, xmm3
addsd    xmm10, xmm2
addsd    xmm9, xmm2
addsd    xmm13, xmm2
...

Answer 1

采纳答案by Mysticial

I've done this exact task before. But it was mainly to measure power consumption and CPU temperatures. The following code (which is fairly long) achieves close to optimal on my Core i7 2600K.

我以前做过这个确切的任务。但主要是测量功耗和CPU温度。以下代码（相当长）在我的 Core i7 2600K 上接近最佳。

The key thing to note here is the massive amount of manual loop-unrolling as well as interleaving of multiplies and adds...

这里要注意的关键是大量的手动循环展开以及乘法和加法的交错......

The full project can be found on my GitHub: https://github.com/Mysticial/Flops

完整的项目可以在我的 GitHub 上找到：https: //github.com/Mysticial/Flops

Warning:

警告：

If you decide to compile and run this, pay attention to your CPU temperatures!!!
Make sure you don't overheat it. And make sure CPU-throttling doesn't affect your results!

如果您决定编译并运行它，请注意您的 CPU 温度！！！
确保你不要让它过热。并确保 CPU 节流不会影响您的结果！

Furthermore, I take no responsibility for whatever damage that may result from running this code.

此外，对于运行此代码可能导致的任何损害，我概不负责。

Notes:

笔记：

This code is optimized for x64. x86 doesn't have enough registers for this to compile well.
This code has been tested to work well on Visual Studio 2010/2012 and GCC 4.6.
ICC 11 (Intel Compiler 11) surprisingly has trouble compiling it well.
These are for pre-FMA processors. In order to achieve peak FLOPS on Intel Haswell and AMD Bulldozer processors (and later), FMA (Fused Multiply Add) instructions will be needed. These are beyond the scope of this benchmark.

此代码针对 x64 进行了优化。x86 没有足够的寄存器来很好地编译。
此代码已经过测试，可以在 Visual Studio 2010/2012 和 GCC 4.6 上正常运行。
ICC 11 (Intel Compiler 11) 出人意料地难以编译好。
这些适用于 FMA 之前的处理器。为了在 Intel Haswell 和 AMD Bulldozer 处理器（及更高版本）上实现峰值 FLOPS，将需要 FMA（融合乘加）指令。这些超出了本基准的范围。

#include <emmintrin.h>
#include <omp.h>
#include <iostream>
using namespace std;

typedef unsigned long long uint64;

double test_dp_mac_SSE(double x,double y,uint64 iterations){
    register __m128d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;

    //  Generate starting data.
    r0 = _mm_set1_pd(x);
    r1 = _mm_set1_pd(y);

    r8 = _mm_set1_pd(-0.0);

    r2 = _mm_xor_pd(r0,r8);
    r3 = _mm_or_pd(r0,r8);
    r4 = _mm_andnot_pd(r8,r0);
    r5 = _mm_mul_pd(r1,_mm_set1_pd(0.37796447300922722721));
    r6 = _mm_mul_pd(r1,_mm_set1_pd(0.24253562503633297352));
    r7 = _mm_mul_pd(r1,_mm_set1_pd(4.1231056256176605498));
    r8 = _mm_add_pd(r0,_mm_set1_pd(0.37796447300922722721));
    r9 = _mm_add_pd(r1,_mm_set1_pd(0.24253562503633297352));
    rA = _mm_sub_pd(r0,_mm_set1_pd(4.1231056256176605498));
    rB = _mm_sub_pd(r1,_mm_set1_pd(4.1231056256176605498));

    rC = _mm_set1_pd(1.4142135623730950488);
    rD = _mm_set1_pd(1.7320508075688772935);
    rE = _mm_set1_pd(0.57735026918962576451);
    rF = _mm_set1_pd(0.70710678118654752440);

    uint64 iMASK = 0x800fffffffffffffull;
    __m128d MASK = _mm_set1_pd(*(double*)&iMASK);
    __m128d vONE = _mm_set1_pd(1.0);

    uint64 c = 0;
    while (c < iterations){
        size_t i = 0;
        while (i < 1000){
            //  Here's the meat - the part that really matters.

            r0 = _mm_mul_pd(r0,rC);
            r1 = _mm_add_pd(r1,rD);
            r2 = _mm_mul_pd(r2,rE);
            r3 = _mm_sub_pd(r3,rF);
            r4 = _mm_mul_pd(r4,rC);
            r5 = _mm_add_pd(r5,rD);
            r6 = _mm_mul_pd(r6,rE);
            r7 = _mm_sub_pd(r7,rF);
            r8 = _mm_mul_pd(r8,rC);
            r9 = _mm_add_pd(r9,rD);
            rA = _mm_mul_pd(rA,rE);
            rB = _mm_sub_pd(rB,rF);

            r0 = _mm_add_pd(r0,rF);
            r1 = _mm_mul_pd(r1,rE);
            r2 = _mm_sub_pd(r2,rD);
            r3 = _mm_mul_pd(r3,rC);
            r4 = _mm_add_pd(r4,rF);
            r5 = _mm_mul_pd(r5,rE);
            r6 = _mm_sub_pd(r6,rD);
            r7 = _mm_mul_pd(r7,rC);
            r8 = _mm_add_pd(r8,rF);
            r9 = _mm_mul_pd(r9,rE);
            rA = _mm_sub_pd(rA,rD);
            rB = _mm_mul_pd(rB,rC);

            r0 = _mm_mul_pd(r0,rC);
            r1 = _mm_add_pd(r1,rD);
            r2 = _mm_mul_pd(r2,rE);
            r3 = _mm_sub_pd(r3,rF);
            r4 = _mm_mul_pd(r4,rC);
            r5 = _mm_add_pd(r5,rD);
            r6 = _mm_mul_pd(r6,rE);
            r7 = _mm_sub_pd(r7,rF);
            r8 = _mm_mul_pd(r8,rC);
            r9 = _mm_add_pd(r9,rD);
            rA = _mm_mul_pd(rA,rE);
            rB = _mm_sub_pd(rB,rF);

            r0 = _mm_add_pd(r0,rF);
            r1 = _mm_mul_pd(r1,rE);
            r2 = _mm_sub_pd(r2,rD);
            r3 = _mm_mul_pd(r3,rC);
            r4 = _mm_add_pd(r4,rF);
            r5 = _mm_mul_pd(r5,rE);
            r6 = _mm_sub_pd(r6,rD);
            r7 = _mm_mul_pd(r7,rC);
            r8 = _mm_add_pd(r8,rF);
            r9 = _mm_mul_pd(r9,rE);
            rA = _mm_sub_pd(rA,rD);
            rB = _mm_mul_pd(rB,rC);

            i++;
        }

        //  Need to renormalize to prevent denormal/overflow.
        r0 = _mm_and_pd(r0,MASK);
        r1 = _mm_and_pd(r1,MASK);
        r2 = _mm_and_pd(r2,MASK);
        r3 = _mm_and_pd(r3,MASK);
        r4 = _mm_and_pd(r4,MASK);
        r5 = _mm_and_pd(r5,MASK);
        r6 = _mm_and_pd(r6,MASK);
        r7 = _mm_and_pd(r7,MASK);
        r8 = _mm_and_pd(r8,MASK);
        r9 = _mm_and_pd(r9,MASK);
        rA = _mm_and_pd(rA,MASK);
        rB = _mm_and_pd(rB,MASK);
        r0 = _mm_or_pd(r0,vONE);
        r1 = _mm_or_pd(r1,vONE);
        r2 = _mm_or_pd(r2,vONE);
        r3 = _mm_or_pd(r3,vONE);
        r4 = _mm_or_pd(r4,vONE);
        r5 = _mm_or_pd(r5,vONE);
        r6 = _mm_or_pd(r6,vONE);
        r7 = _mm_or_pd(r7,vONE);
        r8 = _mm_or_pd(r8,vONE);
        r9 = _mm_or_pd(r9,vONE);
        rA = _mm_or_pd(rA,vONE);
        rB = _mm_or_pd(rB,vONE);

        c++;
    }

    r0 = _mm_add_pd(r0,r1);
    r2 = _mm_add_pd(r2,r3);
    r4 = _mm_add_pd(r4,r5);
    r6 = _mm_add_pd(r6,r7);
    r8 = _mm_add_pd(r8,r9);
    rA = _mm_add_pd(rA,rB);

    r0 = _mm_add_pd(r0,r2);
    r4 = _mm_add_pd(r4,r6);
    r8 = _mm_add_pd(r8,rA);

    r0 = _mm_add_pd(r0,r4);
    r0 = _mm_add_pd(r0,r8);


    //  Prevent Dead Code Elimination
    double out = 0;
    __m128d temp = r0;
    out += ((double*)&temp)[0];
    out += ((double*)&temp)[1];

    return out;
}

void test_dp_mac_SSE(int tds,uint64 iterations){

    double *sum = (double*)malloc(tds * sizeof(double));
    double start = omp_get_wtime();

#pragma omp parallel num_threads(tds)
    {
        double ret = test_dp_mac_SSE(1.1,2.1,iterations);
        sum[omp_get_thread_num()] = ret;
    }

    double secs = omp_get_wtime() - start;
    uint64 ops = 48 * 1000 * iterations * tds * 2;
    cout << "Seconds = " << secs << endl;
    cout << "FP Ops  = " << ops << endl;
    cout << "FLOPs   = " << ops / secs << endl;

    double out = 0;
    int c = 0;
    while (c < tds){
        out += sum[c++];
    }

    cout << "sum = " << out << endl;
    cout << endl;

    free(sum);
}

int main(){
    //  (threads, iterations)
    test_dp_mac_SSE(8,10000000);

    system("pause");
}

Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

输出（1 个线程，10000000 次迭代） - 使用 Visual Studio 2010 SP1 编译 - x64 版本：

Seconds = 55.5104
FP Ops  = 960000000000
FLOPs   = 1.7294e+010
sum = 2.22652

The machine is a Core i7 2600K @ 4.4 GHz. Theoretical SSE peak is 4 flops * 4.4 GHz = 17.6 GFlops. This code achieves 17.3 GFlops- not bad.

该机器是 Core i7 2600K @ 4.4 GHz。理论 SSE 峰值为 4 flops * 4.4 GHz = 17.6 GFlops。这段代码达到了17.3 GFlops- 不错。

Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

输出（8 个线程，10000000 次迭代） - 使用 Visual Studio 2010 SP1 编译 - x64 版本：

Seconds = 117.202
FP Ops  = 7680000000000
FLOPs   = 6.55279e+010
sum = 17.8122

Theoretical SSE peak is 4 flops * 4 cores * 4.4 GHz = 70.4 GFlops.Actual is 65.5 GFlops.

理论 SSE 峰值为 4 flops * 4 cores * 4.4 GHz = 70.4 GFlops。实际是65.5 GFlops。

Let's take this one step further. AVX...

让我们更进一步。AVX...

#include <immintrin.h>
#include <omp.h>
#include <iostream>
using namespace std;

typedef unsigned long long uint64;

double test_dp_mac_AVX(double x,double y,uint64 iterations){
    register __m256d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;

    //  Generate starting data.
    r0 = _mm256_set1_pd(x);
    r1 = _mm256_set1_pd(y);

    r8 = _mm256_set1_pd(-0.0);

    r2 = _mm256_xor_pd(r0,r8);
    r3 = _mm256_or_pd(r0,r8);
    r4 = _mm256_andnot_pd(r8,r0);
    r5 = _mm256_mul_pd(r1,_mm256_set1_pd(0.37796447300922722721));
    r6 = _mm256_mul_pd(r1,_mm256_set1_pd(0.24253562503633297352));
    r7 = _mm256_mul_pd(r1,_mm256_set1_pd(4.1231056256176605498));
    r8 = _mm256_add_pd(r0,_mm256_set1_pd(0.37796447300922722721));
    r9 = _mm256_add_pd(r1,_mm256_set1_pd(0.24253562503633297352));
    rA = _mm256_sub_pd(r0,_mm256_set1_pd(4.1231056256176605498));
    rB = _mm256_sub_pd(r1,_mm256_set1_pd(4.1231056256176605498));

    rC = _mm256_set1_pd(1.4142135623730950488);
    rD = _mm256_set1_pd(1.7320508075688772935);
    rE = _mm256_set1_pd(0.57735026918962576451);
    rF = _mm256_set1_pd(0.70710678118654752440);

    uint64 iMASK = 0x800fffffffffffffull;
    __m256d MASK = _mm256_set1_pd(*(double*)&iMASK);
    __m256d vONE = _mm256_set1_pd(1.0);

    uint64 c = 0;
    while (c < iterations){
        size_t i = 0;
        while (i < 1000){
            //  Here's the meat - the part that really matters.

            r0 = _mm256_mul_pd(r0,rC);
            r1 = _mm256_add_pd(r1,rD);
            r2 = _mm256_mul_pd(r2,rE);
            r3 = _mm256_sub_pd(r3,rF);
            r4 = _mm256_mul_pd(r4,rC);
            r5 = _mm256_add_pd(r5,rD);
            r6 = _mm256_mul_pd(r6,rE);
            r7 = _mm256_sub_pd(r7,rF);
            r8 = _mm256_mul_pd(r8,rC);
            r9 = _mm256_add_pd(r9,rD);
            rA = _mm256_mul_pd(rA,rE);
            rB = _mm256_sub_pd(rB,rF);

            r0 = _mm256_add_pd(r0,rF);
            r1 = _mm256_mul_pd(r1,rE);
            r2 = _mm256_sub_pd(r2,rD);
            r3 = _mm256_mul_pd(r3,rC);
            r4 = _mm256_add_pd(r4,rF);
            r5 = _mm256_mul_pd(r5,rE);
            r6 = _mm256_sub_pd(r6,rD);
            r7 = _mm256_mul_pd(r7,rC);
            r8 = _mm256_add_pd(r8,rF);
            r9 = _mm256_mul_pd(r9,rE);
            rA = _mm256_sub_pd(rA,rD);
            rB = _mm256_mul_pd(rB,rC);

            r0 = _mm256_mul_pd(r0,rC);
            r1 = _mm256_add_pd(r1,rD);
            r2 = _mm256_mul_pd(r2,rE);
            r3 = _mm256_sub_pd(r3,rF);
            r4 = _mm256_mul_pd(r4,rC);
            r5 = _mm256_add_pd(r5,rD);
            r6 = _mm256_mul_pd(r6,rE);
            r7 = _mm256_sub_pd(r7,rF);
            r8 = _mm256_mul_pd(r8,rC);
            r9 = _mm256_add_pd(r9,rD);
            rA = _mm256_mul_pd(rA,rE);
            rB = _mm256_sub_pd(rB,rF);

            r0 = _mm256_add_pd(r0,rF);
            r1 = _mm256_mul_pd(r1,rE);
            r2 = _mm256_sub_pd(r2,rD);
            r3 = _mm256_mul_pd(r3,rC);
            r4 = _mm256_add_pd(r4,rF);
            r5 = _mm256_mul_pd(r5,rE);
            r6 = _mm256_sub_pd(r6,rD);
            r7 = _mm256_mul_pd(r7,rC);
            r8 = _mm256_add_pd(r8,rF);
            r9 = _mm256_mul_pd(r9,rE);
            rA = _mm256_sub_pd(rA,rD);
            rB = _mm256_mul_pd(rB,rC);

            i++;
        }

        //  Need to renormalize to prevent denormal/overflow.
        r0 = _mm256_and_pd(r0,MASK);
        r1 = _mm256_and_pd(r1,MASK);
        r2 = _mm256_and_pd(r2,MASK);
        r3 = _mm256_and_pd(r3,MASK);
        r4 = _mm256_and_pd(r4,MASK);
        r5 = _mm256_and_pd(r5,MASK);
        r6 = _mm256_and_pd(r6,MASK);
        r7 = _mm256_and_pd(r7,MASK);
        r8 = _mm256_and_pd(r8,MASK);
        r9 = _mm256_and_pd(r9,MASK);
        rA = _mm256_and_pd(rA,MASK);
        rB = _mm256_and_pd(rB,MASK);
        r0 = _mm256_or_pd(r0,vONE);
        r1 = _mm256_or_pd(r1,vONE);
        r2 = _mm256_or_pd(r2,vONE);
        r3 = _mm256_or_pd(r3,vONE);
        r4 = _mm256_or_pd(r4,vONE);
        r5 = _mm256_or_pd(r5,vONE);
        r6 = _mm256_or_pd(r6,vONE);
        r7 = _mm256_or_pd(r7,vONE);
        r8 = _mm256_or_pd(r8,vONE);
        r9 = _mm256_or_pd(r9,vONE);
        rA = _mm256_or_pd(rA,vONE);
        rB = _mm256_or_pd(rB,vONE);

        c++;
    }

    r0 = _mm256_add_pd(r0,r1);
    r2 = _mm256_add_pd(r2,r3);
    r4 = _mm256_add_pd(r4,r5);
    r6 = _mm256_add_pd(r6,r7);
    r8 = _mm256_add_pd(r8,r9);
    rA = _mm256_add_pd(rA,rB);

    r0 = _mm256_add_pd(r0,r2);
    r4 = _mm256_add_pd(r4,r6);
    r8 = _mm256_add_pd(r8,rA);

    r0 = _mm256_add_pd(r0,r4);
    r0 = _mm256_add_pd(r0,r8);

    //  Prevent Dead Code Elimination
    double out = 0;
    __m256d temp = r0;
    out += ((double*)&temp)[0];
    out += ((double*)&temp)[1];
    out += ((double*)&temp)[2];
    out += ((double*)&temp)[3];

    return out;
}

void test_dp_mac_AVX(int tds,uint64 iterations){

    double *sum = (double*)malloc(tds * sizeof(double));
    double start = omp_get_wtime();

#pragma omp parallel num_threads(tds)
    {
        double ret = test_dp_mac_AVX(1.1,2.1,iterations);
        sum[omp_get_thread_num()] = ret;
    }

    double secs = omp_get_wtime() - start;
    uint64 ops = 48 * 1000 * iterations * tds * 4;
    cout << "Seconds = " << secs << endl;
    cout << "FP Ops  = " << ops << endl;
    cout << "FLOPs   = " << ops / secs << endl;

    double out = 0;
    int c = 0;
    while (c < tds){
        out += sum[c++];
    }

    cout << "sum = " << out << endl;
    cout << endl;

    free(sum);
}

int main(){
    //  (threads, iterations)
    test_dp_mac_AVX(8,10000000);

    system("pause");
}

Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

输出（1 个线程，10000000 次迭代） - 使用 Visual Studio 2010 SP1 编译 - x64 版本：

Seconds = 57.4679
FP Ops  = 1920000000000
FLOPs   = 3.34099e+010
sum = 4.45305

Theoretical AVX peak is 8 flops * 4.4 GHz = 35.2 GFlops. Actual is 33.4 GFlops.

理论 AVX 峰值为 8 触发器 * 4.4 GHz = 35.2 GFlops。实际是33.4 GFlops。

Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

输出（8 个线程，10000000 次迭代） - 使用 Visual Studio 2010 SP1 编译 - x64 版本：

Seconds = 111.119
FP Ops  = 15360000000000
FLOPs   = 1.3823e+011
sum = 35.6244

Theoretical AVX peak is 8 flops * 4 cores * 4.4 GHz = 140.8 GFlops.Actual is 138.2 GFlops.

理论上的 AVX 峰值是 8 个触发器 * 4 个内核 * 4.4 GHz = 140.8 GFlops。实际是138.2 GFlops。

Now for some explanations:

现在做一些解释：

The performance critical part is obviously the 48 instructions inside the inner loop. You'll notice that it's broken into 4 blocks of 12 instructions each. Each of these 12 instructions blocks are completely independent from each other - and take on average 6 cycles to execute.

性能关键部分显然是内循环中的 48 条指令。您会注意到它被分成 4 个块，每个块 12 条指令。这 12 个指令块中的每一个都完全相互独立 - 平均需要 6 个周期来执行。

So there's 12 instructions and 6 cycles between issue-to-use. The latency of multiplication is 5 cycles, so it's just enough to avoid latency stalls.

所以在问题到使用之间有 12 条指令和 6 个周期。乘法的延迟是 5 个周期，因此足以避免延迟停顿。

The normalization step is needed to keep the data from over/underflowing. This is needed since the do-nothing code will slowly increase/decrease the magnitude of the data.

需要标准化步骤来防止数据上溢/下溢。这是必需的，因为什么都不做的代码会慢慢增加/减少数据的大小。

So it's actually possible to do better than this if you just use all zeros and get rid of the normalization step. However, since I wrote the benchmark to measure power consumption and temperature, I had to make sure the flops were on "real" data, rather than zeros- as the execution units may very well have special case-handling for zeros that use less power and produce less heat.

因此，如果您只使用全零并摆脱标准化步骤，实际上可以做得比这更好。然而，由于我编写了测量功耗和温度的基准，我必须确保触发器是“真实”数据，而不是零——因为执行单元很可能对使用较少功率的零进行特殊情况处理并产生更少的热量。

More Results:

回答by Patrick Schlüter

There's a point in the Intel architecture that people often forget, the dispatch ports are shared between Int and FP/SIMD. This means that you will only get a certain amount of bursts of FP/SIMD before the loop logic will create bubbles in your floating point stream. Mystical got more flops out of his code, because he used longer strides in his unrolled loop.

Intel 架构中有一点人们经常忘记，调度端口在 Int 和 FP/SIMD 之间共享。这意味着在循环逻辑将在您的浮点流中创建气泡之前，您只会获得一定数量的 FP/SIMD 突发。Mystical 从他的代码中得到了更多的失败，因为他在展开的循环中使用了更长的步幅。

If you look at the Nehalem/Sandy Bridge architecture here http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6it's quite clear what happens.

如果您在这里查看Nehalem/Sandy Bridge 架构 http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6很清楚会发生什么。

In contrast, it should be easier to reach peak performance on AMD (Bulldozer) as the INT and FP/SIMD pipes have separate issue ports with their own scheduler.

相比之下，在 AMD (Bulldozer) 上达到峰值性能应该更容易，因为 INT 和 FP/SIMD 管道具有带有自己的调度程序的单独问题端口。

This is only theoretical as I have neither of these processors to test.

这只是理论上的，因为我没有这些处理器要测试。

Answer 3

回答by TJD

Branches can definitely keep you from sustaining peak theoretical performance. Do you see a difference if you manually do some loop-unrolling? For example, if you put 5 or 10 times as many ops per loop iteration:

分支绝对可以使您无法维持峰值理论性能。如果您手动进行一些循环展开，您是否看到了区别？例如，如果您在每次循环迭代中放置 5 或 10 倍的操作数：

for(int i=0; i<loops/5; i++) {
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
   }

Answer 4

回答by Mackie Messer

Using Intels icc Version 11.1 on a 2.4GHz Intel Core 2 Duo I get

在 2.4GHz Intel Core 2 Duo 上使用 Intels icc Version 11.1 我得到

Macintosh:~ mackie$ icc -O3 -mssse3 -oaddmul addmul.cc && ./addmul 1000
addmul:  0.105 s, 9.525 Gflops, res=0.000000
Macintosh:~ mackie$ icc -v
Version 11.1

That is very close to the ideal 9.6 Gflops.

这非常接近理想的 9.6 Gflops。

EDIT:

编辑：

Oops, looking at the assembly code it seems that icc not only vectorized the multiplication, but also pulled the additions out of the loop. Forcing a stricter fp semantics the code is no longer vectorized:

哎呀，看看汇编代码，icc 似乎不仅矢量化了乘法，而且还将加法从循环中拉了出来。强制使用更严格的 fp 语义，代码不再矢量化：

Macintosh:~ mackie$ icc -O3 -mssse3 -oaddmul addmul.cc -fp-model precise && ./addmul 1000
addmul:  0.516 s, 1.938 Gflops, res=1.326463

EDIT2:

编辑2：

As requested:

按照要求：

Macintosh:~ mackie$ clang -O3 -mssse3 -oaddmul addmul.cc && ./addmul 1000
addmul:  0.209 s, 4.786 Gflops, res=1.326463
Macintosh:~ mackie$ clang -v
Apple clang version 3.0 (tags/Apple/clang-211.10.1) (based on LLVM 3.0svn)
Target: x86_64-apple-darwin11.2.0
Thread model: posix

The inner loop of clang's code looks like this:

clang 代码的内部循环如下所示：

        .align  4, 0x90
LBB2_4:                                 ## =>This Inner Loop Header: Depth=1
        addsd   %xmm2, %xmm3
        addsd   %xmm2, %xmm14
        addsd   %xmm2, %xmm5
        addsd   %xmm2, %xmm1
        addsd   %xmm2, %xmm4
        mulsd   %xmm2, %xmm0
        mulsd   %xmm2, %xmm6
        mulsd   %xmm2, %xmm7
        mulsd   %xmm2, %xmm11
        mulsd   %xmm2, %xmm13
        incl    %eax
        cmpl    %r14d, %eax
        jl      LBB2_4

EDIT3:

编辑3：

Finally, two suggestions: First, if you like this type of benchmarking, consider using the rdtscinstruction istead of gettimeofday(2). It is much more accurate and delivers the time in cycles, which is usually what you are interested in anyway. For gcc and friends you can define it like this:

最后，两个建议：首先，如果您喜欢这种类型的基准测试，请考虑使用rdtsc指令 i 而不是gettimeofday(2)。它更准确，并按周期提供时间，这通常是您感兴趣的。对于 gcc 和朋友，您可以这样定义它：

#include <stdint.h>

static __inline__ uint64_t rdtsc(void)
{
        uint64_t rval;
        __asm__ volatile ("rdtsc" : "=A" (rval));
        return rval;
}

Second, you should run your benchmark program several times and use the best performance only. In modern operating systems many things happen in parallel, the cpu may be in a low frequency power saving mode, etc. Running the program repeatedly gives you a result that is closer to the ideal case.

其次，您应该多次运行您的基准程序并仅使用最佳性能。在现代操作系统中，许多事情并行发生，cpu 可能处于低频省电模式等。重复运行程序会给你一个更接近理想情况的结果。

C++ 如何达到每个周期 4 FLOP 的理论最大值？

提问by user1059432

采纳答案by Mysticial

Warning:

警告：

Let's take this one step further. AVX...

让我们更进一步。AVX...

More Results:

更多结果：

回答by Patrick Schlüter

回答by TJD

回答by Mackie Messer

相关推荐

最近更新

标签

C++ 如何达到每个周期 4 FLOP 的理论最大值？

提问by user1059432

采纳答案by Mysticial

Warning:

警告：

Let's take this one step further. AVX...

让我们更进一步。AVX...

More Results:

更多结果：

回答by Patrick Schlüter

回答by TJD

回答by Mackie Messer

相关推荐

C++ 从文本文件中读取行并将字符串放入向量中？

C++ 如何将函数存储到变量？

C++ 我需要购买 Qt 框架吗？

C++ int 到字节数组

相关推荐

最近更新

标签