C++ 使用 AVX CPU 指令:没有“/arch:AVX”的性能不佳
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7839925/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using AVX CPU instructions: Poor performance without "/arch:AVX"
提问by Mike
My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.
我的 C++ 代码使用 SSE,现在我想改进它以在可用时支持 AVX。所以我检测 AVX 何时可用并调用一个使用 AVX 命令的函数。我使用 Win7 SP1 + VS2010 SP1 和带有 AVX 的 CPU。
To use AVX, it is necessary to include this:
要使用 AVX,必须包含以下内容:
#include "immintrin.h"
and then you can use intrinsics AVX functions like _mm256_mul_ps
, _mm256_add_ps
etc.
The problem is that by default, VS2010 produces code that works very slowly and shows the warning:
然后你可以使用内在的 AVX 函数,比如_mm256_mul_ps
,_mm256_add_ps
等等。问题是,默认情况下,VS2010 生成的代码运行速度非常慢,并显示警告:
warning C4752: found Intel(R) Advanced Vector Extensions; consider using /arch:AVX
警告 C4752:发现英特尔(R) 高级矢量扩展;考虑使用 /arch:AVX
It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVX
to the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!
看来 VS2010 实际上不使用 AVX 指令,而是模拟它们。我添加/arch:AVX
到编译器选项并得到了很好的结果。但是这个选项告诉编译器在可能的情况下在任何地方使用 AVX 命令。所以我的代码可能会在不支持 AVX 的 CPU 上崩溃!
So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE
. But for AVX it does not work for some reason.
所以问题是如何让 VS2010 编译器生成 AVX 代码,但只有当我直接指定 AVX 内在函数时。对于 SSE,它可以工作,我只使用 SSE 内在函数,它生成 SSE 代码而没有任何编译器选项,如/arch:SSE
. 但是对于 AVX,由于某种原因它不起作用。
回答by Mysticial
The behavior that you are seeing is the result of expensive state-switching.
您所看到的行为是昂贵的状态切换的结果。
See page 102 of Agner Fog's manual:
请参阅 Agner Fog 手册的第 102 页:
http://www.agner.org/optimize/microarchitecture.pdf
http://www.agner.org/optimize/microarchitecture.pdf
Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.
每次您在 SSE 和 AVX 指令之间不正确地来回切换时,您都将付出极高的(~70)周期损失。
When you compile without /arch:AVX
, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)
当您不使用 进行编译时/arch:AVX
,VS2010 将生成 SSE 指令,但在您拥有 AVX 内在函数的任何地方仍将使用 AVX。因此,您将获得同时具有 SSE 和 AVX 指令的代码 - 这将具有那些状态切换惩罚。(VS2010 知道这一点,所以它会发出您看到的警告。)
Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX
tells the compiler to use all AVX.
因此,您应该使用全部 SSE 或全部 AVX。指定/arch:AVX
告诉编译器使用所有 AVX。
It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX.
For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX
and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.
听起来您正在尝试创建多个代码路径:一个用于 SSE,另一个用于 AVX。为此,我建议您将 SSE 和 AVX 代码分成两个不同的编译单元。(一个编译,/arch:AVX
一个没有编译)然后将它们链接在一起并根据它运行的硬件进行选择。
If you needto mix SSE and AVX, be sure to use _mm256_zeroupper()
or _mm256_zeroall()
appropriately to avoid the state-switching penalties.
如果您需要混合使用 SSE 和 AVX,请务必使用_mm256_zeroupper()
或_mm256_zeroall()
适当地避免状态切换惩罚。
回答by chappjc
tl;dr
tl;博士
Use _mm256_zeroupper();
or _mm256_zeroall();
around sections of code using AVX (before or after depending on function arguments). Only use option /arch:AVX
for source files with AVX rather than for an entire project to avoid breaking support for legacy-encoded SSE-only code paths.
使用_mm256_zeroupper();
或_mm256_zeroall();
围绕使用 AVX 的代码部分(之前或之后,取决于函数参数)。仅对/arch:AVX
带有 AVX 的源文件而不是整个项目使用选项,以避免破坏对遗留编码的仅 SSE 代码路径的支持。
Cause
原因
I think the best explanation is in the Intel article, "Avoiding AVX-SSE Transition Penalties"(PDF). The abstract states:
我认为最好的解释是英特尔的文章“避免 AVX-SSE 转换惩罚”(PDF)。摘要说:
Transitioning between 256-bit Intel? AVX instructions and legacy Intel? SSE instructions within a program may cause performance penalties because the hardware must save and restore the upper 128 bits of the YMM registers.
在 256 位 Intel 之间转换?AVX 指令和传统英特尔?程序中的 SSE 指令可能会导致性能下降,因为硬件必须保存和恢复 YMM 寄存器的高 128 位。
Separating your AVX and SSE code into different compilation units may NOT helpif you switch between calling code from both SSE-enabled and AVX-enabled object files, because the transition may occur when AVX instructions or assembly are mixed with any of (from the Intel paper):
如果您在从启用 SSE 和启用 AVX 的目标文件调用代码之间切换,将 AVX 和 SSE 代码分离到不同的编译单元可能无济于事,因为当 AVX 指令或程序集与任何(来自 Intel纸):
- 128-bit intrinsic instructions
- SSE inline assembly
- C/C++ floating point code that is compiled to Intel? SSE
- Calls to functions or libraries that include any of the above
- 128 位内部指令
- SSE 内联汇编
- 编译为 Intel 的 C/C++ 浮点代码?上证所
- 调用包含上述任何内容的函数或库
This means there may even be penalties when linking with external codeusing SSE.
这意味着在使用 SSE与外部代码链接时甚至可能会受到惩罚。
Details
细节
There are 3 processor states defined by the AVX instructions, and one of the states is where all of the YMMregisters are split, allowing the lower half to be used by SSE instructions. The Intel document "Intel? AVX State Transitions: Migrating SSE Code to AVX" provides a diagram of these states:
AVX 指令定义了 3 种处理器状态,其中一种状态是所有YMM寄存器都被拆分,允许SSE 指令使用下半部分。Intel 文档“ Intel? AVX State Transitions: Migrating SSE Code to AVX”提供了这些状态的图表:
When in state B (AVX-256 mode), all bits of the YMM registers are in use. When an SSE instruction is called, a transition to state C must occur, and this is where there is a penalty. The upper half of all YMM registers must be saved into an internal buffer before SSE can start, even if they happen to be zeros. The cost of the transitions is on the "order of 50-80 clock cycles on Sandy Bridge hardware". There is also a penalty going from C -> A, as diagrammed in Figure 2.
当处于状态 B(AVX-256 模式)时,YMM 寄存器的所有位都在使用中。当调用 SSE 指令时,必须发生到状态 C 的转换,这就是惩罚的地方。在 SSE 启动之前,所有 YMM 寄存器的上半部分必须保存到内部缓冲区中,即使它们碰巧为零。转换的成本是“Sandy Bridge 硬件上 50-80 个时钟周期的数量级”。还有一个惩罚是从 C -> A,如图 2 所示。
You can also find details about the state switching penalty causing this slowdown on page 130, Section 9.12, "Transitions between VEXand non-VEX modes" in Agner Fog's optimization guide(of version updated 2014-08-07), referenced in Mystical's answer. According to his guide, any transition to/from this state takes "about 70 clock cycles on Sandy Bridge". Just as the Intel document states, this is an avoidable transition penalty.
您还可以在Agner Fog 的优化指南(2014 年 8 月 7 日更新的版本)中的第 130 页第 9.12 节“ VEX和非 VEX 模式之间的转换”中找到有关导致这种减速的状态切换惩罚的详细信息,在Mystical 的回答中引用. 根据他的指南,任何到/从这个状态的转换都需要“在 Sandy Bridge 上大约 70 个时钟周期”。正如英特尔文件所述,这是一种可以避免的过渡惩罚。
Resolution
解析度
To avoid the transition penalties you can either remove all legacy SSE code, instruct the compiler to convert all SSE instructions to their VEX encoded form of 128-bit instructions (if compiler is capable), or put the YMM registers in a known zero state before transitioning between AVX and SSE code. Essentially, to maintain the separate SSE code path, you must zero out the upper 128-bits of all 16 YMM registers (issuing a VZEROUPPER
instruction) after any code that uses AVX instructions. Zeroing these bits manually forces a transition to state A, and avoids the expensive penalty since the YMM values do not need to be stored in an internal buffer by hardware. The intrinsic that performs this instruction is _mm256_zeroupper
. The description for this intrinsic is very informative:
为避免转换惩罚,您可以删除所有旧的 SSE 代码,指示编译器将所有 SSE 指令转换为其 VEX 编码的 128 位指令形式(如果编译器有能力),或者将 YMM 寄存器置于已知的零状态之前在 AVX 和 SSE 代码之间转换。本质上,为了维护单独的 SSE 代码路径,您必须在任何使用 AVX 指令的代码之后将所有 16 个 YMM 寄存器(发出VZEROUPPER
指令)的高 128 位清零。手动将这些位清零会强制转换到状态 A,并避免昂贵的惩罚,因为 YMM 值不需要由硬件存储在内部缓冲区中。执行此指令的内在函数是_mm256_zeroupper
。此内在函数的描述非常有用:
This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel? Advanced Vector Extensions (Intel? AVX) instructions and legacy Intel? Supplemental SIMD Extensions (Intel? SSE) instructions. There is no transition penalty if an application clears the upper bits of all YMM registers(sets to ‘0') via
VZEROUPPER
, the corresponding instruction for this intrinsic, before transitioning between Intel? Advanced Vector Extensions (Intel? AVX) instructions and legacy Intel? Supplemental SIMD Extensions (Intel? SSE) instructions.
在 Intel? 之间转换时,这个内在函数对于清除 YMM 寄存器的高位很有用 高级矢量扩展(英特尔?AVX)指令和传统英特尔?补充 SIMD 扩展(英特尔?SSE)说明。有没有如果应用程序清除所有ymm寄存器的高位过渡惩罚(套到“0”)通过
VZEROUPPER
,该固有的相应的指令,英特尔之间转变前?高级矢量扩展(英特尔?AVX)指令和传统英特尔?补充 SIMD 扩展(英特尔?SSE)说明。
In Visual Studio 2010+ (maybe even older), you get this intrinsicwith immintrin.h.
在 Visual Studio 2010+(可能更老)中,您可以通过 immintrin.h获得这个内在函数。
Note that zeroing out the bits with other methods does not eliminate the penalty - the VZEROUPPER
or VZEROALL
instructions must be used.
请注意,使用其他方法将位清零并不能消除惩罚 -必须使用VZEROUPPER
orVZEROALL
指令。
One automatic solution implemented by the Intel Compiler is to insert a VZEROUPPER
at the beginningof each function containing Intel AVX code if none of the arguments are a YMM register or __m256
/__m256d
/__m256i
datatype, and at the endof functions if the returned value is not a YMM register or __m256
/__m256d
/__m256i
datatype.
由英特尔编译器实现的一个自动解决方案是插入一个VZEROUPPER
在开始时包含英特尔AVX代码中的每个函数的,如果没有的参数是一个YMM寄存器或__m256
/ __m256d
/__m256i
数据类型,并且在端部的功能,如果返回值不是YMM寄存器或__m256
/ __m256d
/__m256i
数据类型。
In the wild
在野外
This VZEROUPPER
solution is used by FFTW to generate a library with both SSE and AVX support. See simd-avx.h:
VZEROUPPER
FFTW 使用此解决方案生成具有 SSE 和 AVX 支持的库。见simd-avx.h:
/* Use VZEROUPPER to avoid the penalty of switching from AVX to SSE.
See Intel Optimization Manual (April 2011, version 248966), Section
11.3 */
#define VLEAVE _mm256_zeroupper
Then VLEAVE();
is called at the end of everyfunction using intrinsics for AVX instructions.
然后VLEAVE();
在每个函数结束时使用 AVX 指令的内在函数调用。