C语言 GCC SSE 代码优化

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7919304/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 10:01:08  来源:igfitidea点击:

GCC SSE code optimization

coptimizationssecompiler-optimizationhpc

提问by Genís

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed.

这篇文章与我几天前发布的另一篇文章密切相关。这一次,我编写了一个简单的代码,它只是添加一对元素数组,将结果乘以另一个数组中的值并将其存储在第四个数组中,所有变量都是浮点双精度类型。

I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below:

我制作了该代码的两个版本:一个带有 SSE 指令,使用调用,另一个没有它们,然后我用 gcc 和 -O0 优化级别编译它们。我把它们写在下面:

// SSE VERSION

#define N 10000
#define NTIMES 100000
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>

double a[N] __attribute__((aligned(16)));
double b[N] __attribute__((aligned(16)));
double c[N] __attribute__((aligned(16)));
double r[N] __attribute__((aligned(16)));

int main(void){
  int i, times;
  for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i <N; i+= 2){ 
        __m128d mm_a = _mm_load_pd( &a[i] );  
        _mm_prefetch( &a[i+4], _MM_HINT_T0 );
        __m128d mm_b = _mm_load_pd( &b[i] );  
        _mm_prefetch( &b[i+4] , _MM_HINT_T0 );
        __m128d mm_c = _mm_load_pd( &c[i] );
        _mm_prefetch( &c[i+4] , _MM_HINT_T0 );
        __m128d mm_r;
        mm_r = _mm_add_pd( mm_a, mm_b );
        mm_a = _mm_mul_pd( mm_r , mm_c );
        _mm_store_pd( &r[i], mm_a );
      }   
   }
 }

//NO SSE VERSION
//same definitions as before
int main(void){
  int i, times;
   for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i < N; i++ ){
      r[i] = (a[i]+b[i])*c[i];
    }   
  }
}

When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsdand mulsdinstructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addpand mulpdinstructions, though a pretty larger assembly code was generated.

当使用 -O0 编译它们时,gcc 会使用 XMM/MMX 寄存器和 SSE 指令,如果没有特别给出 -mno-sse(和其他)选项。我检查了为第二个代码生成的汇编代码,我注意到它使用了movsdaddsdmulsd指令。所以它使用 SSE 指令,但只使用那些使用寄存器最低部分的指令,如果我没有错的话。正如预期的那样,为第一个 C 代码生成的汇编代码使用了addpmulpd指令,尽管生成了相当大的汇编代码。

Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What's the reason for that?

无论如何,据我所知,第一个代码应该从 SIMD 范式中获得更好的收益,因为每次迭代都会计算两个结果值。尽管如此,第二个代码的执行速度比第一个快 25%。我还使用单精度值进行了测试并获得了类似的结果。这是什么原因?

回答by chill

Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:

GCC 中的矢量化在-O3. 这就是为什么在-O0,你看到的只是普通的标量SSE2指令集(movsdaddsd,等)。使用 GCC 4.6.1 和你的第二个例子:

#define N 10000
#define NTIMES 100000

double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));

int
main (void)
{
  int i, times;
  for (times = 0; times < NTIMES; times++)
    {
      for (i = 0; i < N; ++i)
        r[i] = (a[i] + b[i]) * c[i];
    }

  return 0;
}

and compiling with gcc -S -O3 -msse2 sse.cproduces for the inner loop the following instructions, which is pretty good:

并编译gcc -S -O3 -msse2 sse.c为内循环产生以下指令,这是非常好的:

.L3:
    movapd  a(%eax), %xmm0
    addpd   b(%eax), %xmm0
    mulpd   c(%eax), %xmm0
    movapd  %xmm0, r(%eax)
    addl    , %eax
    cmpl    000, %eax
    jne .L3

As you can see, with the vectorization enabled GCC emits code to perform twoloop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.cgives for the inner loop:

如您所见,启用矢量化后,GCC 会发出代码以并行执行两个循环迭代。不过,它可以改进 - 此代码使用 SSE 寄存器的低 128 位,但通过启用 SSE 指令的 AVX 编码(如果机器上可用),它可以使用完整的 256 位 YMM 寄存器。因此,gcc -S -O3 -msse2 -mavx sse.c为内部循环编译相同的程序:

.L3:
    vmovapd a(%eax), %ymm0
    vaddpd  b(%eax), %ymm0, %ymm0
    vmulpd  c(%eax), %ymm0, %ymm0
    vmovapd %ymm0, r(%eax)
    addl    , %eax
    cmpl    000, %eax
    jne .L3

Note that vin front of each instruction and that instructions use the 256-bit YMM registers, fouriterations of the original loop are executed in parallel.

请注意,v在每条指令和使用 256 位 YMM 寄存器的指令之前,并行执行原始循环的四次迭代。

回答by Luca Citi

I would like to extend chill's answerand draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards.

我想扩展chill 的回答,并提请您注意 GCC 在向后迭代时似乎无法对 AVX 指令进行同样智能的使用这一事实。

Just replace the inner loop in chill's sample code with:

只需将 chill 示例代码中的内循环替换为:

for (i = N-1; i >= 0; --i)
    r[i] = (a[i] + b[i]) * c[i];

GCC (4.8.4) with options -S -O3 -mavxproduces:

带有选项的 GCC (4.8.4)-S -O3 -mavx产生:

.L5:
    vmovsd  a+79992(%rax), %xmm0
    subq    , %rax
    vaddsd  b+80000(%rax), %xmm0, %xmm0
    vmulsd  c+80000(%rax), %xmm0, %xmm0
    vmovsd  %xmm0, r+80000(%rax)
    cmpq    $-80000, %rax
    jne     .L5