C++ 使用 SSE 指令
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/586609/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using SSE instructions
提问by Naveen
I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?
我有一个用 C++ 编写的循环,它为大整数数组的每个元素执行。在循环内部,我屏蔽了整数的一些位,然后找到最小值和最大值。我听说如果我对这些操作使用 SSE 指令,与使用按位 AND 和 if-else 条件编写的普通循环相比,它的运行速度会快得多。我的问题是我应该按照这些 SSE 说明进行操作吗?另外,如果我的代码在不同的处理器上运行会发生什么?它仍然有效还是这些指令是特定于处理器的?
采纳答案by Niki
- SSE instructions are processor specific. You can look up which processor supports which SSE version on wikipedia.
- If SSE code will be faster or not depends on many factors: The first is of course whether the problem is memory-bound or CPU-bound. If the memory bus is the bottleneck SSE will not help much. Try simplifying your integer calculations, if that makes the code faster, it's probably CPU-bound, and you have a good chance of speeding it up.
- Be aware that writing SIMD-code is a lot harder than writing C++-code, and that the resulting code is much harder to change. Always keep the C++ code up to date, you'll want it as a comment and to check the correctness of your assembler code.
- Think about using a library like the IPP, that implements common low-level SIMD operations optimized for various processors.
- SSE 指令是特定于处理器的。您可以在维基百科上查找哪个处理器支持哪个 SSE 版本。
- SSE 代码是否更快取决于许多因素:首先当然是问题是否受内存限制或受 CPU 限制。如果内存总线是瓶颈,SSE 将无济于事。尝试简化您的整数计算,如果这使代码更快,则它可能受 CPU 限制,并且您很有可能加快速度。
- 请注意,编写 SIMD 代码比编写 C++ 代码要困难得多,并且生成的代码更难更改。始终保持 C++ 代码最新,您需要将其作为注释并检查汇编代码的正确性。
- 考虑使用像 IPP 这样的库,它实现了针对各种处理器优化的常见低级 SIMD 操作。
回答by Skizz
SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.
SIMD(以 SSE 为例)允许您对多个数据块执行相同的操作。因此,使用 SSE 作为整数运算的直接替代将不会获得任何优势,只有当您可以一次对多个数据项进行运算时,您才会获得优势。这涉及加载一些在内存中连续的数据值,执行所需的处理,然后步进到数组中的下一组值。
Problems:
问题:
1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:
1 如果代码路径依赖于正在处理的数据,SIMD 将变得更加难以实现。例如:
a = array [index];
a &= mask;
a >>= shift;
if (a < somevalue)
{
a += 2;
array [index] = a;
}
++index;
is not easy to do as SIMD:
不像 SIMD 那样容易做到:
a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3]
a1 &= mask a2 &= mask a3 &= mask a4 &= mask
a1 >>= shift a2 >>= shift a3 >>= shift a4 >>= shift
if (a1<somevalue) if (a2<somevalue) if (a3<somevalue) if (a4<somevalue)
// help! can't conditionally perform this on each column, all columns must do the same thing
index += 4
2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome
2 如果数据不是连续的,那么将数据加载到 SIMD 指令中是很麻烦的
3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.
3 代码是特定于处理器的。SSE 仅在 IA32 (Intel/AMD) 上,并非所有 IA32 cpu 都支持 SSE。
You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.
您需要分析算法和数据以查看它是否可以进行 SSE,这需要了解 SSE 的工作原理。英特尔网站上有大量文档。
回答by Peter Jeffery
This kind of problem is a perfect example of where a good low level profiler is essential. (Something like VTune) It can give you a much more informed idea of where your hotspots lie.
这种问题是一个很好的例子,说明一个好的低级分析器是必不可少的。(类似于 VTune)它可以让您更了解热点所在的位置。
My guess, from what you describe is that your hotspot will probably be branch prediction failures resulting from min/max calculations using if/else. Therefore, using SIMD intrinsics should allow you to use the min/max instructions, however, it might be worth just trying to use a branchless min/max caluculation instead. This might achieve most of the gains with less pain.
我的猜测是,根据您的描述,您的热点可能是由于使用 if/else 进行最小/最大计算而导致的分支预测失败。因此,使用 SIMD 内在函数应该允许您使用最小/最大指令,但是,尝试使用无分支最小/最大计算可能是值得的。这可能会以较少的痛苦获得大部分收益。
Something like this:
像这样的东西:
inline int
minimum(int a, int b)
{
int mask = (a - b) >> 31;
return ((a & mask) | (b & ~mask));
}
回答by jalf
If you use SSE instructions, you're obviously limited to processors that support these. That means x86, dating back to the Pentium 2 or so (can't remember exactly when they were introduced, but it's a long time ago)
如果您使用 SSE 指令,您显然仅限于支持这些指令的处理器。这意味着 x86,可以追溯到 Pentium 2 左右(不记得它们是什么时候推出的,但那是很久以前的事了)
SSE2, which, as far as I can recall, is the one that offers integer operations, is somewhat more recent (Pentium 3? Although the first AMD Athlon processors didn't support them)
SSE2,据我所知,是提供整数运算的那个,有点更新(Pentium 3?虽然第一个 AMD Athlon 处理器不支持它们)
In any case, you have two options for using these instructions. Either write the entire block of code in assembly (probably a bad idea. That makes it virtually impossible for the compiler to optimize your code, and it's very hard for a human to write efficient assembler).
在任何情况下,您都有两种使用这些说明的选项。要么用汇编编写整个代码块(可能是个坏主意。这使得编译器几乎不可能优化您的代码,而且人类很难编写高效的汇编程序)。
Alternatively, use the intrinsics available with your compiler (if memory serves, they're usually defined in xmmintrin.h)
或者,使用编译器可用的内部函数(如果没有记错,它们通常在 xmmintrin.h 中定义)
But again, the performance may not improve. SSE code poses additional requirements of the data it processes. Mainly, the one to keep in mind is that data must be aligned on 128-bit boundaries. There should also be few or no dependencies between the values loaded into the same register (a 128 bit SSE register can hold 4 ints. Adding the first and the second one together is not optimal. But adding all four ints to the corresponding 4 ints in another register will be fast)
但同样,性能可能不会提高。SSE 代码对其处理的数据提出了额外要求。主要要记住的是,数据必须在 128 位边界上对齐。加载到同一个寄存器中的值之间也应该很少或没有依赖关系(一个 128 位 SSE 寄存器可以容纳 4 个整数。将第一个和第二个加在一起不是最佳的。但是将所有四个整数添加到相应的 4 个整数中另一个寄存器会很快)
It may be tempting to use a library that wraps all the low-level SSE fiddling, but that might also ruin any potential performance benefit.
使用包含所有低级 SSE 摆弄的库可能很诱人,但这也可能会破坏任何潜在的性能优势。
I don't know how good SSE's integer operation support is, so that may also be a factor that can limit performance. SSE is mainly targeted at speeding up floating point operations.
我不知道 SSE 的整数运算支持有多好,所以这也可能是限制性能的一个因素。SSE 主要针对加速浮点运算。
回答by Migol
If you intend to use Microsoft Visual C++, you should read this:
如果您打算使用 Microsoft Visual C++,您应该阅读以下内容:
回答by Quonux
I can tell from my experince that SSE brings a huge (4x and up) speedup over a plain c version of the code (no inline asm, no intrinsics used) but hand-optimized assembler can beat Compiler-generated assembly if the compiler can't figure out what the programmer intended (belive me, compilers don't cover all possible code combinations and they never will). Oh and, the compiler can't everytime layout the data that it runs at the fastest-possible speed. But you need much experince for a speedup over an Intel-compiler (if possible).
我可以从我的经验中看出,SSE 比普通的 c 版本代码(没有内联 asm,没有使用内在函数)带来了巨大的(4 倍及以上)加速,但是如果编译器可以,手动优化的汇编程序可以击败编译器生成的程序集t 弄清楚程序员的意图(相信我,编译器不会涵盖所有可能的代码组合,他们永远不会)。哦,编译器不能每次都以最快的速度布局它运行的数据。但是你需要很多经验来加速英特尔编译器(如果可能的话)。
回答by Dani van der Meer
We have implemented some image processing code, similar to what you describe but on a byte array, In SSE. The speedup compared to C code is considerable, depending on the exact algorithm more than a factor of 4, even in respect to the Intel compiler. However, as you already mentioned you have the following drawbacks:
我们已经实现了一些图像处理代码,类似于您所描述的,但在 SSE 中的字节数组上。与 C 代码相比,加速是相当可观的,这取决于超过 4 倍的精确算法,即使是在英特尔编译器方面也是如此。但是,正如您已经提到的,您有以下缺点:
Portability. The code will run on every Intel-like CPU, so also AMD, but not on other CPUs. That is not a problem for us because we control the target hardware. Switching compilers and even to a 64 bit OS can also be a problem.
You have a steep learning curve, but I found that after you grasp the principles writing new algorithms is not that hard.
Maintainability. Most C or C++ programmers have no knowledge of assembly/SSE.
可移植性。代码将在每个类似 Intel 的 CPU 上运行,AMD 也是如此,但不能在其他 CPU 上运行。这对我们来说不是问题,因为我们控制目标硬件。切换编译器甚至 64 位操作系统也可能是一个问题。
你有一个陡峭的学习曲线,但我发现在你掌握了原则之后编写新算法并不难。
可维护性。大多数 C 或 C++ 程序员不了解汇编/SSE。
My advice to you will be to go for it only if you really need the performance improvement, and you can't find a function for your problem in a library like the intel IPP, and if you can live with the portability issues.
我对您的建议是,只有当您确实需要提高性能,并且在像 intel IPP 这样的库中找不到解决您的问题的函数,并且您可以忍受可移植性问题时才使用它。
回答by Mike
SSE instructions were originally just on Intel chips, but recently (since Athlon?) AMD supports them as well, so if you do code against the SSE instruction set, you should be portable to most x86 procs.
SSE 指令最初只在 Intel 芯片上,但最近(因为 Athlon?)AMD 也支持它们,所以如果你针对 SSE 指令集编写代码,你应该可以移植到大多数 x86 进程。
That being said, it may not be worth your time to learn SSE coding unless you're already familiar with assembler on x86's - an easier option might be to check your compiler docs and see if there are options to allow the compiler to autogenerate SSE code for you. Some compilers do very well vectorizing loops in this way. (You're probably not surprised to hear that the Intel compilers do a good job of this :)
话虽如此,除非您已经熟悉 x86 上的汇编程序,否则学习 SSE 编码可能不值得您花时间 - 一个更简单的选择可能是检查您的编译器文档并查看是否有选项允许编译器自动生成 SSE 代码为你。一些编译器以这种方式很好地矢量化循环。(听到英特尔编译器在这方面做得很好,您可能不会感到惊讶:)
回答by LiraNuna
Write code that helps the compiler understand what you are doing. GCC will understand and optimize SSE code such as this:
编写有助于编译器理解您在做什么的代码。GCC 会理解并优化 SSE 代码,例如:
typedef union Vector4f
{
// Easy constructor, defaulted to black/0 vector
Vector4f(float a = 0, float b = 0, float c = 0, float d = 1.0f):
X(a), Y(b), Z(c), W(d) { }
// Cast operator, for []
inline operator float* ()
{
return (float*)this;
}
// Const ast operator, for const []
inline operator const float* () const
{
return (const float*)this;
}
// ---------------------------------------- //
inline Vector4f operator += (const Vector4f &v)
{
for(int i=0; i<4; ++i)
(*this)[i] += v[i];
return *this;
}
inline Vector4f operator += (float t)
{
for(int i=0; i<4; ++i)
(*this)[i] += t;
return *this;
}
// Vertex / Vector
// Lower case xyzw components
struct {
float x, y, z;
float w;
};
// Upper case XYZW components
struct {
float X, Y, Z;
float W;
};
};
Just don't forget to have -msse -msse2 on your build parameters!
只是不要忘记在构建参数上使用 -msse -msse2 !
回答by Ogan Ocali
I agree with the previous posters. Benefits can be quite large but to get it may require a lot of work. Intel documentation on these instructions is over 4K pages. You may want to check out EasySSE (c++ wrappers library over intrinsics + examples) free from Ocali Inc.
我同意以前的海报。好处可能很大,但要获得它可能需要做很多工作。有关这些说明的英特尔文档超过 4K 页。您可能想从 Ocali Inc. 免费查看 EasySSE(内在函数 + 示例上的 C++ 包装库)。
I assume my affiliation with this EasySSE is clear.
我认为我与这个 EasySSE 的关系很明确。