C语言 如何使用 C 中的 SSE 内在函数计算单向量点积

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4120681/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 06:56:23  来源:igfitidea点击:

How to Calculate single-vector Dot Product using SSE intrinsic functions in C

coptimizationvectorizationssesimd

提问by Sam

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:

我试图将两个向量相乘,其中一个向量的每个元素乘以另一个向量的相同索引中的元素。然后我想对结果向量的所有元素求和以获得一个数字。例如,向量 {1,2,3,4} 和 {5,6,7,8} 的计算如下所示:

1*5 + 2*6 + 3*7 + 4*8

Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.

本质上,我正在取两个向量的点积。我知道有一个 SSE 命令可以执行此操作,但是该命令没有与之关联的内在函数。在这一点上,我不想在我的 C 代码中编写内联汇编,所以我只想使用内部函数。这似乎是一个常见的计算,所以我自己很惊讶我在谷歌上找不到答案。

Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.

注意:我正在针对支持高达 SSE 4.2 的特定微架构进行优化。

回答by caf

If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps(or FMA) inside the inner loop.Save the horizontal sum until the end.

如果您正在做更长向量的点积,请_mm_add_ps在内循环中使用乘法和正则(或 FMA)。将水平总和保存到最后。



But if you are doing a dot product of just a single pair of SIMD vectors:

但是,如果您只计算一对 SIMD 向量的点积:

GCC (at least version 4.3) includes <smmintrin.h>with SSE4.1 level intrinsics, including the single and double-precision dot products:

GCC(至少 4.3 版)包含<smmintrin.h>SSE4.1 级别的内在函数,包括单精度和双精度点积:

_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);

On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.

在 Intel 主流 CPU(不是 Atom/Silvermont)上,这些比使用多条指令手动执行要快一些。

But on AMD (including Ryzen), dppsis significantly slower. (See Agner Fog's instruction tables)

但是在 AMD(包括 Ryzen)上,dpps速度要慢得多。(参见Agner Fog 的说明表



As a fallback for older processors, you can use this algorithm to create the dot product of the vectors aand b:

作为旧处理器的回退,您可以使用此算法创建向量的点积ab

__m128 r1 = _mm_mul_ps(a, b);

and then horizontal sum r1using Fastest way to do horizontal float vector sum on x86(see there for a commented version of this, and why it's faster.)

然后r1使用最快的方式在 x86 上进行水平浮点向量求和的水平求和(请参阅那里的注释版本,以及为什么它更快。)

__m128 shuf   = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
__m128 sums   = _mm_add_ps(r1, shuf);
shuf          = _mm_movehl_ps(shuf, sums);
sums          = _mm_add_ss(sums, shuf);
float result =  _mm_cvtss_f32(sums);

A slow alternative costs 2 shuffles per hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.

一个缓慢的替代方案每 2 次 shuffle hadd,这很容易成为 shuffle 吞吐量的瓶颈,尤其是在 Intel CPU 上。

r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);

回答by Royi

I'd say the fastest SSE method would be:

我想说最快的 SSE 方法是:

static inline float CalcDotProductSse(__m128 x, __m128 y) {
    __m128 mulRes, shufReg, sumsReg;
    mulRes = _mm_mul_ps(x, y);

    // Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
    shufReg = _mm_movehdup_ps(mulRes);        // Broadcast elements 3,1 to 2,0
    sumsReg = _mm_add_ps(mulRes, shufReg);
    shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
    sumsReg = _mm_add_ss(sumsReg, shufReg);
    return  _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
}

I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.

我遵循了 -在 x86 上进行水平浮点向量求和的最快方法

回答by Ben Hymanson

I wrote this and compiled it with gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c

我写了这个并编译它 gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c

void f(int * __restrict__ a, int * __restrict__ b, int * __restrict__ c, int * __restrict__ d,
       int * __restrict__ e, int * __restrict__ f, int * __restrict__ g, int * __restrict__ h,
       int * __restrict__ o)
{
    int i;

    for (i = 0; i < 8; ++i)
        o[i] = a[i]*e[i] + b[i]*f[i] + c[i]*g[i] + d[i]*h[i];
}

And GCC 4.3.0 auto-vectorized it:

GCC 4.3.0 自动矢量化了它:

sse.c:5: note: LOOP VECTORIZED.
sse.c:2: note: vectorized 1 loops in function.

However, it would only do that if I used a loop with enough iterations -- otherwise the verbose output would clarify that vectorization was unprofitable or the loop was too small. Without the __restrict__keywords it has to generate separate, non-vectorized versions to deal with cases where the output omay point into one of the inputs.

但是,只有当我使用具有足够迭代次数的循环时,它才会这样做——否则冗长的输出将阐明矢量化无利可图或循环太小。如果没有__restrict__关键字,它必须生成单独的非矢量化版本来处理输出o可能指向输入之一的情况。

I would paste the instructions as an example, but since part of the vectorization unrolled the loop it's not very readable.

我将粘贴说明作为示例,但由于矢量化的一部分展开了循环,因此可读性不高。

回答by DennyRolling

There is an article by Intel herewhich touches on dot-product implementations.

英特尔在这里有一篇文章涉及点积实现。