C++ 使用 AVX 内在函数而不是 SSE 并不能提高速度——为什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8924729/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 19:20:32  来源:igfitidea点击:

Using AVX intrinsics instead of SSE does not improve speed -- why?

c++performancegccsseavx

提问by user1158218

I've been using Intel's SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunately, was not the case until now. Probably I am doing a stupid mistake, so I would be very grateful if somebody could help me out.

我一直在使用英特尔的 SSE 内在函数,并获得了良好的性能提升。因此,我希望 AVX 内在函数能够进一步加速我的程序。不幸的是,直到现在情况并非如此。可能我犯了一个愚蠢的错误,所以如果有人能帮助我,我将不胜感激。

I use Ubuntu 11.10 with g++ 4.6.1. I compiled my program (see below) with

我使用 Ubuntu 11.10 和 g++ 4.6.1。我编译了我的程序(见下文)

g++ simpleExample.cpp -O3 -march=native -o simpleExample

The test system has a Intel i7-2600 CPU.

测试系统采用 Intel i7-2600 CPU。

Here is the code which exemplifies my problem. On my system, I get the output

这是举例说明我的问题的代码。在我的系统上,我得到输出

98.715 ms, b[42] = 0.900038 // Naive
24.457 ms, b[42] = 0.900038 // SSE
24.646 ms, b[42] = 0.900038 // AVX

Note that the computation sqrt(sqrt(sqrt(x))) was only chosen to ensure that memory bandwith does not limit execution speed; it is just an example.

请注意,选择计算 sqrt(sqrt(sqrt(x))) 只是为了确保内存带宽不会限制执行速度;这只是一个例子。

simpleExample.cpp:

简单示例.cpp:

#include <immintrin.h>
#include <iostream>
#include <math.h> 
#include <sys/time.h>

using namespace std;

// -----------------------------------------------------------------------------
// This function returns the current time, expressed as seconds since the Epoch
// -----------------------------------------------------------------------------
double getCurrentTime(){
  struct timeval curr;
  struct timezone tz;
  gettimeofday(&curr, &tz);
  double tmp = static_cast<double>(curr.tv_sec) * static_cast<double>(1000000)
             + static_cast<double>(curr.tv_usec);
  return tmp*1e-6;
}

// -----------------------------------------------------------------------------
// Main routine
// -----------------------------------------------------------------------------
int main() {

  srand48(0);            // seed PRNG
  double e,s;            // timestamp variables
  float *a, *b;          // data pointers
  float *pA,*pB;         // work pointer
  __m128 rA,rB;          // variables for SSE
  __m256 rA_AVX, rB_AVX; // variables for AVX

  // define vector size 
  const int vector_size = 10000000;

  // allocate memory 
  a = (float*) _mm_malloc (vector_size*sizeof(float),32);
  b = (float*) _mm_malloc (vector_size*sizeof(float),32);

  // initialize vectors //
  for(int i=0;i<vector_size;i++) {
    a[i]=fabs(drand48());
    b[i]=0.0f;
  }

// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// Naive implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  s = getCurrentTime();
  for (int i=0; i<vector_size; i++){
    b[i] = sqrtf(sqrtf(sqrtf(a[i])));
  }
  e = getCurrentTime();
  cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;

// -----------------------------------------------------------------------------
  for(int i=0;i<vector_size;i++) {
    b[i]=0.0f;
  }
// -----------------------------------------------------------------------------

// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// SSE2 implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  pA = a; pB = b;

  s = getCurrentTime();
  for (int i=0; i<vector_size; i+=4){
    rA   = _mm_load_ps(pA);
    rB   = _mm_sqrt_ps(_mm_sqrt_ps(_mm_sqrt_ps(rA)));
    _mm_store_ps(pB,rB);
    pA += 4;
    pB += 4;
  }
  e = getCurrentTime();
  cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;

// -----------------------------------------------------------------------------
  for(int i=0;i<vector_size;i++) {
    b[i]=0.0f;
  }
// -----------------------------------------------------------------------------

// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// AVX implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  pA = a; pB = b;

  s = getCurrentTime();
  for (int i=0; i<vector_size; i+=8){
    rA_AVX   = _mm256_load_ps(pA);
    rB_AVX   = _mm256_sqrt_ps(_mm256_sqrt_ps(_mm256_sqrt_ps(rA_AVX)));
    _mm256_store_ps(pB,rB_AVX);
    pA += 8;
    pB += 8;
  }
  e = getCurrentTime();
  cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;

  _mm_free(a);
  _mm_free(b);

  return 0;
}

Any help is appreciated!

任何帮助表示赞赏!

回答by Norbert P.

This is because VSQRTPS(AVX instruction) takes exactly twice as many cycles as SQRTPS(SSE instruction) on a Sandy Bridge processor. See Agner Fog's optimize guide: instruction tables, page 88.

这是因为在 Sandy Bridge 处理器上VSQRTPS(AVX 指令)所用的周期正好是SQRTPS(SSE 指令)的两倍。请参阅 Agner Fog 的优化指南:指令表,第 88 页。

Instructions like square root and division don't benefit from AVX. On the other hand, additions, multiplications, etc., do.

像平方根和除法这样的指令不能从 AVX 中受益。另一方面,加法、乘法等也可以。

回答by Evgeny Kluev

If you are interested in increasing square root performance, instead of VSQRTPS you can use VRSQRTPS and Newton-Raphson formula:

如果您对提高平方根性能感兴趣,则可以使用 VRSQRTPS 和 Newton-Raphson 公式代替 VSQRTPS:

x0 = vrsqrtps(a)
x1 = 0.5 * x0 * (3 - (a * x0) * x0)

VRSQRTPS itself doesn't benefit from AVX, but other calculations do.

VRSQRTPS 本身不能从 AVX 中受益,但其他计算可以。

Use it if 23 bits of precision is enough for you.

如果 23 位精度对您来说足够了,请使用它。

回答by Salah Saleh

Just for completeness. The Newton-Raphson (NR) implementation for operations like the division or the square root will only be beneficial if you have a limited number of those operations in your code. This is because if you used these alternative methods you will generate more pressure on other ports such as the multiplication and addition ports. That's basically the reason why x86 architectures have special hardware unit to handle these operation instead of the alternative software solutions (like NR). I quote from Intel 64 and IA-32 Architectures Optimization Reference Manualp.556:

只是为了完整性。Newton-Raphson (NR) 对除法或平方根等运算的实现只有在代码中的这些运算数量有限时才有用。这是因为如果您使用这些替代方法,您将在其他端口(例如乘法和加法端口)上产生更大的压力。这就是为什么 x86 架构有特殊的硬件单元来处理这些操作而不是替代软件解决方案(如 NR)的原因。我引自Intel 64 and IA-32 Architectures Optimization Reference Manualp.556:

"In some cases, when the divide or square root operations are part of a larger algorithm that hides some of the latency of these operations, the approximation with Newton-Raphson can slow down execution."

“在某些情况下,当除法或平方根运算是隐藏这些运算的一些延迟的更大算法的一部分时,牛顿-拉夫森的近似值会减慢执行速度。”

So be careful when using NR in large algorithms. Actually, I had my master's thesis around this point and I will leave a link to it here for future reference, once it is published .

所以在大型算法中使用 NR 时要小心。实际上,我的硕士论文就是围绕这一点进行的,一旦发表,我将在此处留下链接以供将来参考。

Also for people how always wonder about the throughput and the latency of certain instructions, have a look on IACA. It is a very useful tool provided by Intel to statically analyze the in-core execution performance of codes.

同样对于人们如何总是想知道某些指令的吞吐量和延迟,请查看IACA。它是英特尔提供的一个非常有用的工具,用于静态分析代码的核内执行性能。

editedhere is a link to the thesis for those who are interested thesis

在这里编辑的是对论文感兴趣的人的论文链接

回答by SoapBox

Depending on your processor hardware, the AVX instructions may be emulated in the hardware as SSE instructions. You'd need to look up your processor's part number to get exact specs on it, but this is one of the main differences between low-end and high-end intel processors, the number of specialize execution units vs. hardware emulation.

根据您的处理器硬件,AVX 指令可能会在硬件中模拟为 SSE 指令。您需要查找处理器的部件号才能获得确切的规格,但这是低端和高端英特尔处理器之间的主要区别之一,即专用执行单元与硬件仿真的数量。