C语言 GCC SSE 代码优化

Question

提问by Genís

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed.

这篇文章与我几天前发布的另一篇文章密切相关。这一次，我编写了一个简单的代码，它只是添加一对元素数组，将结果乘以另一个数组中的值并将其存储在第四个数组中，所有变量都是浮点双精度类型。

I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below:

我制作了该代码的两个版本：一个带有 SSE 指令，使用调用，另一个没有它们，然后我用 gcc 和 -O0 优化级别编译它们。我把它们写在下面：

// SSE VERSION

#define N 10000
#define NTIMES 100000
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>

double a[N] __attribute__((aligned(16)));
double b[N] __attribute__((aligned(16)));
double c[N] __attribute__((aligned(16)));
double r[N] __attribute__((aligned(16)));

int main(void){
  int i, times;
  for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i <N; i+= 2){ 
        __m128d mm_a = _mm_load_pd( &a[i] );  
        _mm_prefetch( &a[i+4], _MM_HINT_T0 );
        __m128d mm_b = _mm_load_pd( &b[i] );  
        _mm_prefetch( &b[i+4] , _MM_HINT_T0 );
        __m128d mm_c = _mm_load_pd( &c[i] );
        _mm_prefetch( &c[i+4] , _MM_HINT_T0 );
        __m128d mm_r;
        mm_r = _mm_add_pd( mm_a, mm_b );
        mm_a = _mm_mul_pd( mm_r , mm_c );
        _mm_store_pd( &r[i], mm_a );
      }   
   }
 }

//NO SSE VERSION
//same definitions as before
int main(void){
  int i, times;
   for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i < N; i++ ){
      r[i] = (a[i]+b[i])*c[i];
    }   
  }
}

When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsdand mulsdinstructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addpand mulpdinstructions, though a pretty larger assembly code was generated.

当使用 -O0 编译它们时，gcc 会使用 XMM/MMX 寄存器和 SSE 指令，如果没有特别给出 -mno-sse（和其他）选项。我检查了为第二个代码生成的汇编代码，我注意到它使用了movsd、addsd和mulsd指令。所以它使用 SSE 指令，但只使用那些使用寄存器最低部分的指令，如果我没有错的话。正如预期的那样，为第一个 C 代码生成的汇编代码使用了addp和mulpd指令，尽管生成了相当大的汇编代码。

Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What's the reason for that?

无论如何，据我所知，第一个代码应该从 SIMD 范式中获得更好的收益，因为每次迭代都会计算两个结果值。尽管如此，第二个代码的执行速度比第一个快 25%。我还使用单精度值进行了测试并获得了类似的结果。这是什么原因？

Answer 1

回答by chill

Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:

GCC 中的矢量化在-O3. 这就是为什么在-O0，你看到的只是普通的标量SSE2指令集（movsd，addsd，等）。使用 GCC 4.6.1 和你的第二个例子：

#define N 10000
#define NTIMES 100000

double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));

int
main (void)
{
  int i, times;
  for (times = 0; times < NTIMES; times++)
    {
      for (i = 0; i < N; ++i)
        r[i] = (a[i] + b[i]) * c[i];
    }

  return 0;
}

and compiling with gcc -S -O3 -msse2 sse.cproduces for the inner loop the following instructions, which is pretty good:

并编译gcc -S -O3 -msse2 sse.c为内循环产生以下指令，这是非常好的：

.L3:
    movapd  a(%eax), %xmm0
    addpd   b(%eax), %xmm0
    mulpd   c(%eax), %xmm0
    movapd  %xmm0, r(%eax)
    addl    , %eax
    cmpl    000, %eax
    jne .L3

As you can see, with the vectorization enabled GCC emits code to perform twoloop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.cgives for the inner loop:

如您所见，启用矢量化后，GCC 会发出代码以并行执行两个循环迭代。不过，它可以改进 - 此代码使用 SSE 寄存器的低 128 位，但通过启用 SSE 指令的 AVX 编码（如果机器上可用），它可以使用完整的 256 位 YMM 寄存器。因此，gcc -S -O3 -msse2 -mavx sse.c为内部循环编译相同的程序：

.L3:
    vmovapd a(%eax), %ymm0
    vaddpd  b(%eax), %ymm0, %ymm0
    vmulpd  c(%eax), %ymm0, %ymm0
    vmovapd %ymm0, r(%eax)
    addl    , %eax
    cmpl    000, %eax
    jne .L3

Note that vin front of each instruction and that instructions use the 256-bit YMM registers, fouriterations of the original loop are executed in parallel.

请注意，v在每条指令和使用 256 位 YMM 寄存器的指令之前，并行执行原始循环的四次迭代。

Answer 2

回答by Luca Citi

I would like to extend chill's answerand draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards.

我想扩展chill 的回答，并提请您注意 GCC 在向后迭代时似乎无法对 AVX 指令进行同样智能的使用这一事实。

Just replace the inner loop in chill's sample code with:

只需将 chill 示例代码中的内循环替换为：

for (i = N-1; i >= 0; --i)
    r[i] = (a[i] + b[i]) * c[i];

GCC (4.8.4) with options -S -O3 -mavxproduces:

带有选项的 GCC (4.8.4)-S -O3 -mavx产生：

.L5:
    vmovsd  a+79992(%rax), %xmm0
    subq    , %rax
    vaddsd  b+80000(%rax), %xmm0, %xmm0
    vmulsd  c+80000(%rax), %xmm0, %xmm0
    vmovsd  %xmm0, r+80000(%rax)
    cmpq    $-80000, %rax
    jne     .L5

C语言 GCC SSE 代码优化

提问by Genís

回答by chill

回答by Luca Citi

相关推荐

最近更新

标签

C语言 GCC SSE 代码优化

提问by Genís

回答by chill

回答by Luca Citi

相关推荐

C语言 c 获取整数的第 n 个字节

C语言 什么是MinGW的简单解释

C语言 如何知道二进制整数是否代表负数？

C语言 如何使 Makefile 仅重新编译已更改的文件？

相关推荐

最近更新

标签

C语言什么是MinGW的简单解释

C语言如何知道二进制整数是否代表负数？

C语言如何使 Makefile 仅重新编译已更改的文件？