C++ 使用 AVX CPU 指令:没有“/arch:AVX”的性能不佳

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7839925/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 17:34:52  来源:igfitidea点击:

Using AVX CPU instructions: Poor performance without "/arch:AVX"

c++performancevisual-studio-2010sseavx

提问by Mike

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.

我的 C++ 代码使用 SSE,现在我想改进它以在可用时支持 AVX。所以我检测 AVX 何时可用并调用一个使用 AVX 命令的函数。我使用 Win7 SP1 + VS2010 SP1 和带有 AVX 的 CPU。

To use AVX, it is necessary to include this:

要使用 AVX,必须包含以下内容:

#include "immintrin.h"

and then you can use intrinsics AVX functions like _mm256_mul_ps, _mm256_add_psetc. The problem is that by default, VS2010 produces code that works very slowly and shows the warning:

然后你可以使用内在的 AVX 函数,比如_mm256_mul_ps_mm256_add_ps等等。问题是,默认情况下,VS2010 生成的代码运行速度非常慢,并显示警告:

warning C4752: found Intel(R) Advanced Vector Extensions; consider using /arch:AVX

警告 C4752:发现英特尔(R) 高级矢量扩展;考虑使用 /arch:AVX

It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVXto the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!

看来 VS2010 实际上不使用 AVX 指令,而是模拟它们。我添加/arch:AVX到编译器选项并得到了很好的结果。但是这个选项告诉编译器在可能的情况下在任何地方使用 AVX 命令。所以我的代码可能会在不支持 AVX 的 CPU 上崩溃!

So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE. But for AVX it does not work for some reason.

所以问题是如何让 VS2010 编译器生成 AVX 代码,但只有当我直接指定 AVX 内在函数时。对于 SSE,它可以工作,我只使用 SSE 内在函数,它生成 SSE 代码而没有任何编译器选项,如/arch:SSE. 但是对于 AVX,由于某种原因它不起作用。

回答by Mysticial

The behavior that you are seeing is the result of expensive state-switching.

您所看到的行为是昂贵的状态切换的结果。

See page 102 of Agner Fog's manual:

请参阅 Agner Fog 手册的第 102 页:

http://www.agner.org/optimize/microarchitecture.pdf

http://www.agner.org/optimize/microarchitecture.pdf

Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.

每次您在 SSE 和 AVX 指令之间不正确地来回切换时,您都将付出极高的(~70)周期损失。

When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)

当您不使用 进行编译时/arch:AVX,VS2010 将生成 SSE 指令,但在您拥有 AVX 内在函数的任何地方仍将使用 AVX。因此,您将获得同时具有 SSE 和 AVX 指令的代码 - 这将具有那些状态切换惩罚。(VS2010 知道这一点,所以它会发出您看到的警告。)

Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVXtells the compiler to use all AVX.

因此,您应该使用全部 SSE 或全部 AVX。指定/arch:AVX告诉编译器使用所有 AVX。

It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX. For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVXand one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.

听起来您正在尝试创建多个代码路径:一个用于 SSE,另一个用于 AVX。为此,我建议您将 SSE 和 AVX 代码分成两个不同的编译单元。(一个编译,/arch:AVX一个没有编译)然后将它们链接在一起并根据它运行的硬件进行选择。

If you needto mix SSE and AVX, be sure to use _mm256_zeroupper()or _mm256_zeroall()appropriately to avoid the state-switching penalties.

如果您需要混合使用 SSE 和 AVX,请务必使用_mm256_zeroupper()_mm256_zeroall()适当地避免状态切换惩罚。

回答by chappjc

tl;dr

tl;博士

Use _mm256_zeroupper();or _mm256_zeroall();around sections of code using AVX (before or after depending on function arguments). Only use option /arch:AVXfor source files with AVX rather than for an entire project to avoid breaking support for legacy-encoded SSE-only code paths.

使用_mm256_zeroupper();_mm256_zeroall();围绕使用 AVX 的代码部分(之前或之后,取决于函数参数)。仅对/arch:AVX带有 AVX 的源文件而不是整个项目使用选项,以避免破坏对遗留编码的仅 SSE 代码路径的支持。

Cause

原因

I think the best explanation is in the Intel article, "Avoiding AVX-SSE Transition Penalties"(PDF). The abstract states:

我认为最好的解释是英特尔的文章“避免 AVX-SSE 转换惩罚”PDF)。摘要说:

Transitioning between 256-bit Intel? AVX instructions and legacy Intel? SSE instructions within a program may cause performance penalties because the hardware must save and restore the upper 128 bits of the YMM registers.

在 256 位 Intel 之间转换?AVX 指令和传统英特尔?程序中的 SSE 指令可能会导致性能下降,因为硬件必须保存和恢复 YMM 寄存器的高 128 位。

Separating your AVX and SSE code into different compilation units may NOT helpif you switch between calling code from both SSE-enabled and AVX-enabled object files, because the transition may occur when AVX instructions or assembly are mixed with any of (from the Intel paper):

如果您在从启用 SSE 和启用 AVX 的目标文件调用代码之间切换,将 AVX 和 SSE 代码分离到不同的编译单元可能无济于事,因为当 AVX 指令或程序集与任何(来自 Intel纸):

  • 128-bit intrinsic instructions
  • SSE inline assembly
  • C/C++ floating point code that is compiled to Intel? SSE
  • Calls to functions or libraries that include any of the above
  • 128 位内部指令
  • SSE 内联汇编
  • 编译为 Intel 的 C/C++ 浮点代码?上证所
  • 调用包含上述任何内容的函数或库

This means there may even be penalties when linking with external codeusing SSE.

这意味着在使用 SSE与外部代码链接时甚至可能会受到惩罚。

Details

细节

There are 3 processor states defined by the AVX instructions, and one of the states is where all of the YMMregisters are split, allowing the lower half to be used by SSE instructions. The Intel document "Intel? AVX State Transitions: Migrating SSE Code to AVX" provides a diagram of these states:

AVX 指令定义了 3 种处理器状态,其中一种状态是所有YMM寄存器都被拆分,允许SSE 指令使用下半部分。Intel 文档“ Intel? AVX State Transitions: Migrating SSE Code to AVX”提供了这些状态的图表:

enter image description here

enter image description here

When in state B (AVX-256 mode), all bits of the YMM registers are in use. When an SSE instruction is called, a transition to state C must occur, and this is where there is a penalty. The upper half of all YMM registers must be saved into an internal buffer before SSE can start, even if they happen to be zeros. The cost of the transitions is on the "order of 50-80 clock cycles on Sandy Bridge hardware". There is also a penalty going from C -> A, as diagrammed in Figure 2.

当处于状态 B(AVX-256 模式)时,YMM 寄存器的所有位都在使用中。当调用 SSE 指令时,必须发生到状态 C 的转换,这就是惩罚的地方。在 SSE 启动之前,所有 YMM 寄存器的上半部分必须保存到内部缓冲区中,即使它们碰巧为零。转换的成本是“Sandy Bridge 硬件上 50-80 个时钟周期的数量级”。还有一个惩罚是从 C -> A,如图 2 所示。

You can also find details about the state switching penalty causing this slowdown on page 130, Section 9.12, "Transitions between VEXand non-VEX modes" in Agner Fog's optimization guide(of version updated 2014-08-07), referenced in Mystical's answer. According to his guide, any transition to/from this state takes "about 70 clock cycles on Sandy Bridge". Just as the Intel document states, this is an avoidable transition penalty.

您还可以在Agner Fog 的优化指南2014 年 8 月 7 日更新的版本)中的第 130 页第 9.12 节“ VEX和非 VEX 模式之间的转换”中找到有关导致这种减速的状态切换惩罚的详细信息,在Mystical 的回答中引用. 根据他的指南,任何到/从这个状态的转换都需要“在 Sandy Bridge 上大约 70 个时钟周期”。正如英特尔文件所述,这是一种可以避免的过渡惩罚。

Resolution

解析度

To avoid the transition penalties you can either remove all legacy SSE code, instruct the compiler to convert all SSE instructions to their VEX encoded form of 128-bit instructions (if compiler is capable), or put the YMM registers in a known zero state before transitioning between AVX and SSE code. Essentially, to maintain the separate SSE code path, you must zero out the upper 128-bits of all 16 YMM registers (issuing a VZEROUPPERinstruction) after any code that uses AVX instructions. Zeroing these bits manually forces a transition to state A, and avoids the expensive penalty since the YMM values do not need to be stored in an internal buffer by hardware. The intrinsic that performs this instruction is _mm256_zeroupper. The description for this intrinsic is very informative:

为避免转换惩罚,您可以删除所有旧的 SSE 代码,指示编译器将所有 SSE 指令转换为其 VEX 编码的 128 位指令形式(如果编译器有能力),或者将 YMM 寄存器置于已知的零状态之前在 AVX 和 SSE 代码之间转换。本质上,为了维护单独的 SSE 代码路径,您必须在任何使用 AVX 指令的代码之后将所有 16 个 YMM 寄存器(发出VZEROUPPER指令)的高 128 位清零。手动将这些位清零会强制转换到状态 A,并避免昂贵的惩罚,因为 YMM 值不需要由硬件存储在内部缓冲区中。执行此指令的内在函数是_mm256_zeroupper。此内在函数的描述非常有用:

This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel? Advanced Vector Extensions (Intel? AVX) instructions and legacy Intel? Supplemental SIMD Extensions (Intel? SSE) instructions. There is no transition penalty if an application clears the upper bits of all YMM registers(sets to ‘0') via VZEROUPPER, the corresponding instruction for this intrinsic, before transitioning between Intel? Advanced Vector Extensions (Intel? AVX) instructions and legacy Intel? Supplemental SIMD Extensions (Intel? SSE) instructions.

在 Intel? 之间转换时,这个内在函数对于清除 YMM 寄存器的高位很有用 高级矢量扩展(英特尔?AVX)指令和传统英特尔?补充 SIMD 扩展(英特尔?SSE)说明。有没有如果应用程序清除所有ymm寄存器的高位过渡惩罚(套到“0”)通过VZEROUPPER,该固有的相应的指令,英特尔之间转变前?高级矢量扩展(英特尔?AVX)指令和传统英特尔?补充 SIMD 扩展(英特尔?SSE)说明。

In Visual Studio 2010+ (maybe even older), you get this intrinsicwith immintrin.h.

在 Visual Studio 2010+(可能更老)中,您可以通过 immintrin.h获得这个内在函数。

Note that zeroing out the bits with other methods does not eliminate the penalty - the VZEROUPPERor VZEROALLinstructions must be used.

请注意,使用其他方法将位清零并不能消除惩罚 -必须使用VZEROUPPERorVZEROALL指令。

One automatic solution implemented by the Intel Compiler is to insert a VZEROUPPERat the beginningof each function containing Intel AVX code if none of the arguments are a YMM register or __m256/__m256d/__m256idatatype, and at the endof functions if the returned value is not a YMM register or __m256/__m256d/__m256idatatype.

由英特尔编译器实现的一个自动解决方案是插入一个VZEROUPPER在开始时包含英特尔AVX代码中的每个函数的,如果没有的参数是一个YMM寄存器或__m256/ __m256d/__m256i数据类型,并且在端部的功能,如果返回值不是YMM寄存器或__m256/ __m256d/__m256i数据类型。

In the wild

在野外

This VZEROUPPERsolution is used by FFTW to generate a library with both SSE and AVX support. See simd-avx.h:

VZEROUPPERFFTW 使用此解决方案生成具有 SSE 和 AVX 支持的库。见simd-avx.h

/* Use VZEROUPPER to avoid the penalty of switching from AVX to SSE.
   See Intel Optimization Manual (April 2011, version 248966), Section
   11.3 */
#define VLEAVE _mm256_zeroupper

Then VLEAVE();is called at the end of everyfunction using intrinsics for AVX instructions.

然后VLEAVE();每个函数结束时使用 AVX 指令的内在函数调用。