C语言 如何使用 C 中的 SSE 内在函数计算单向量点积
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4120681/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to Calculate single-vector Dot Product using SSE intrinsic functions in C
提问by Sam
I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:
我试图将两个向量相乘,其中一个向量的每个元素乘以另一个向量的相同索引中的元素。然后我想对结果向量的所有元素求和以获得一个数字。例如,向量 {1,2,3,4} 和 {5,6,7,8} 的计算如下所示:
1*5 + 2*6 + 3*7 + 4*8
Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.
本质上,我正在取两个向量的点积。我知道有一个 SSE 命令可以执行此操作,但是该命令没有与之关联的内在函数。在这一点上,我不想在我的 C 代码中编写内联汇编,所以我只想使用内部函数。这似乎是一个常见的计算,所以我自己很惊讶我在谷歌上找不到答案。
Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.
注意:我正在针对支持高达 SSE 4.2 的特定微架构进行优化。
回答by caf
If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps(or FMA) inside the inner loop.Save the horizontal sum until the end.
如果您正在做更长向量的点积,请_mm_add_ps在内循环中使用乘法和正则(或 FMA)。将水平总和保存到最后。
But if you are doing a dot product of just a single pair of SIMD vectors:
但是,如果您只计算一对 SIMD 向量的点积:
GCC (at least version 4.3) includes <smmintrin.h>with SSE4.1 level intrinsics, including the single and double-precision dot products:
GCC(至少 4.3 版)包含<smmintrin.h>SSE4.1 级别的内在函数,包括单精度和双精度点积:
_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);
On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.
在 Intel 主流 CPU(不是 Atom/Silvermont)上,这些比使用多条指令手动执行要快一些。
But on AMD (including Ryzen), dppsis significantly slower. (See Agner Fog's instruction tables)
但是在 AMD(包括 Ryzen)上,dpps速度要慢得多。(参见Agner Fog 的说明表)
As a fallback for older processors, you can use this algorithm to create the dot product of the vectors aand b:
作为旧处理器的回退,您可以使用此算法创建向量的点积a和b:
__m128 r1 = _mm_mul_ps(a, b);
and then horizontal sum r1using Fastest way to do horizontal float vector sum on x86(see there for a commented version of this, and why it's faster.)
然后r1使用最快的方式在 x86 上进行水平浮点向量求和的水平求和(请参阅那里的注释版本,以及为什么它更快。)
__m128 shuf = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
__m128 sums = _mm_add_ps(r1, shuf);
shuf = _mm_movehl_ps(shuf, sums);
sums = _mm_add_ss(sums, shuf);
float result = _mm_cvtss_f32(sums);
A slow alternative costs 2 shuffles per hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.
一个缓慢的替代方案每 2 次 shuffle hadd,这很容易成为 shuffle 吞吐量的瓶颈,尤其是在 Intel CPU 上。
r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);
回答by Royi
I'd say the fastest SSE method would be:
我想说最快的 SSE 方法是:
static inline float CalcDotProductSse(__m128 x, __m128 y) {
__m128 mulRes, shufReg, sumsReg;
mulRes = _mm_mul_ps(x, y);
// Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
shufReg = _mm_movehdup_ps(mulRes); // Broadcast elements 3,1 to 2,0
sumsReg = _mm_add_ps(mulRes, shufReg);
shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
sumsReg = _mm_add_ss(sumsReg, shufReg);
return _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
}
I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.
我遵循了 -在 x86 上进行水平浮点向量求和的最快方法。
回答by Ben Hymanson
I wrote this and compiled it with gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c
我写了这个并编译它 gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c
void f(int * __restrict__ a, int * __restrict__ b, int * __restrict__ c, int * __restrict__ d,
int * __restrict__ e, int * __restrict__ f, int * __restrict__ g, int * __restrict__ h,
int * __restrict__ o)
{
int i;
for (i = 0; i < 8; ++i)
o[i] = a[i]*e[i] + b[i]*f[i] + c[i]*g[i] + d[i]*h[i];
}
And GCC 4.3.0 auto-vectorized it:
GCC 4.3.0 自动矢量化了它:
sse.c:5: note: LOOP VECTORIZED.
sse.c:2: note: vectorized 1 loops in function.
However, it would only do that if I used a loop with enough iterations -- otherwise the verbose output would clarify that vectorization was unprofitable or the loop was too small. Without the __restrict__keywords it has to generate separate, non-vectorized versions to deal with cases where the output omay point into one of the inputs.
但是,只有当我使用具有足够迭代次数的循环时,它才会这样做——否则冗长的输出将阐明矢量化无利可图或循环太小。如果没有__restrict__关键字,它必须生成单独的非矢量化版本来处理输出o可能指向输入之一的情况。
I would paste the instructions as an example, but since part of the vectorization unrolled the loop it's not very readable.
我将粘贴说明作为示例,但由于矢量化的一部分展开了循环,因此可读性不高。

