C语言 如何在 SSE/AVX 中使用融合乘加 (FMA) 指令

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15933100/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 06:00:30  来源:igfitidea点击:

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

cssecpu-architectureavxfma

提问by

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX:
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.

我了解到一些 Intel/AMD CPU 可以同时进行乘法和加法 SSE/AVX:
对于 sandy-bridge 和 haswell SSE2/AVX/AVX2 的每周期 FLOPS

I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE:

我想知道如何在代码中做到最好,我也想知道它是如何在 CPU 内部完成的。我的意思是超标量架构。假设我想在 SSE 中做一个长和,如下所示:

//sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication)
sum = _mm_set1_ps(0.0f);
a1  = _mm_set1_ps(a[0]); 
b1  = _mm_load_ps(&b[0]);
sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1));

a2  = _mm_set1_ps(a[1]); 
b2  = _mm_load_ps(&b[4]);
sum = _mm_add_ps(sum, _mm_mul_ps(a2, b2));

a3  = _mm_set1_ps(a[2]); 
b3  = _mm_load_ps(&b[8]);
sum = _mm_add_ps(sum, _mm_mul_ps(a3, b3));
...

My question is how does this get converted to simultaneous multiply and add? Can the data be dependent? I mean can the CPU do _mm_add_ps(sum, _mm_mul_ps(a1, b1))simultaneously or do the registers used in the multiplication and add have to be independent?

我的问题是如何将其转换为同时乘法和加法?数据可以依赖吗?我的意思是 CPU 可以_mm_add_ps(sum, _mm_mul_ps(a1, b1))同时执行还是在乘法和加法中使用的寄存器必须是独立的?

Lastly how does this apply to FMA (with Haswell)? Is _mm_add_ps(sum, _mm_mul_ps(a1, b1))automatically converted to a single FMA instruction or micro-operation?

最后,这如何适用于 FMA(与 Haswell)?是_mm_add_ps(sum, _mm_mul_ps(a1, b1))自动转换为单条FMA指令还是微操作?

采纳答案by Mysticial

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).

允许编译器融合单独的加法和乘法,即使这会改变最终结果(通过使其更准确)。

An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.

FMA 只有一个舍入(它有效地为内部临时乘法结果保持无限精度),而 ADD + MUL 有两个。

The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ONis in effect, and compilers are allowed to have it ONby default(but not all do). Gcc contracts into FMA by default (with the default -std=gnu*, but not -std=c*, e.g. -std=c++14). For Clang, it's only enabled with -ffp-contract=fast. (With just the #pragmaenabled, only within a single expression like a+b*c, not across separate C++ statements.).

IEEE 和 C 标准#pragma STDC FP_CONTRACT ON在生效时允许这样做,并且默认情况下允许编译器拥有它ON(但并非所有人都这样做)。默认情况下-std=gnu*,Gcc 合同为 FMA(使用默认的,而不是-std=c*,例如-std=c++14)。 对于 Clang,它只能通过-ffp-contract=fast. (仅#pragma启用,仅在单个表达式中,如a+b*c,而不是跨单独的 C++ 语句。)。

This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-mathvs. -fno-fast-math) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn't be allowed in strict FP.

这不同于严格与宽松浮点(或在 gcc 术语中,-ffast-math-fno-fast-math)不同,后者将允许其他类型的优化,这些优化可能会根据输入值增加舍入误差。这个很特别,因为 FMA 内部临时的无限精度;如果内部临时文件中有任何舍入,则在严格的 FP 中是不允许的。

Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.

即使您启用了宽松浮点,编译器仍可能选择不融合,因为如果您已经在使用内在函数,它可能希望您知道自己在做什么。



So the best wayto make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:

因此确保您真正获得所需 FMA 指令的最佳方法是您实际为它们使用提供的内在函数:

FMA3 Intrinsics:(AVX2 - Intel Haswell)

FMA3 内部函数:(AVX2 - Intel Haswell)

  • _mm_fmadd_pd(), _mm256_fmadd_pd()
  • _mm_fmadd_ps(), _mm256_fmadd_ps()
  • and about a gazillion other variations...
  • _mm_fmadd_pd(), _mm256_fmadd_pd()
  • _mm_fmadd_ps(), _mm256_fmadd_ps()
  • 以及无数其他变化......

FMA4 Intrinsics:(XOP - AMD Bulldozer)

FMA4 内在:(XOP - AMD 推土机)

  • _mm_macc_pd(), _mm256_macc_pd()
  • _mm_macc_ps(), _mm256_macc_ps()
  • and about a gazillion other variations...
  • _mm_macc_pd(), _mm256_macc_pd()
  • _mm_macc_ps(), _mm256_macc_ps()
  • 以及无数其他变化......

回答by Z boson

I tested the following code in GCC 5.3, Clang 3.7, ICC 13.0.1 and MSVC 2015 (compiler version 19.00).

我在 GCC 5.3、Clang 3.7、ICC 13.0.1 和 MSVC 2015(编译器版本 19.00)中测试了以下代码。

float mul_add(float a, float b, float c) {
    return a*b + c;
}

__m256 mul_addv(__m256 a, __m256 b, __m256 c) {
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

With the right compiler options (see below) every compiler will generate a vfmaddinstruction (e.g. vfmadd213ss) from mul_add. However, only MSVC fails to contract mul_addvto a single vfmaddinstruction (e.g. vfmadd213ps).

有了正确的编译器选项(见下文),每一个编译器会生成一个vfmadd指令(例如vfmadd213ss从)mul_add。但是,只有 MSVC 无法收缩mul_addv到单个vfmadd指令(例如vfmadd213ps)。

The following compiler options are sufficient to generate vfmaddinstructions (except with mul_addvwith MSVC).

以下编译器选项足以生成vfmadd指令(mul_addv使用 MSVC 时除外)。

GCC:   -O2 -mavx2 -mfma
Clang: -O1 -mavx2 -mfma -ffp-contract=fast
ICC:   -O1 -march=core-avx2
MSVC:  /O1 /arch:AVX2 /fp:fast

GCC 4.9 will not contract mul_addvto a single fma instruction but since at least GCC 5.1 it does. I don't know when the other compilers started doing this.

GCC 4.9 不会收缩mul_addv到单个 fma 指令,但至少从 GCC 5.1 开始。我不知道其他编译器何时开始这样做。