C++ 矢量化是什么意思?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1516622/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What does vectorization mean?
提问by vehomzzz
Is it a good idea to vectorize the code? What are good practices in terms of when to do it? What happens underneath?
将代码矢量化是个好主意吗?在何时执行方面有哪些好的做法?下面会发生什么?
回答by Zed
Vectorization means that the compiler detects that your independent instructions can be executed as one SIMDinstruction. Usual example is that if you do something like
向量化意味着编译器检测到您的独立指令可以作为一条SIMD指令执行。通常的例子是,如果你做类似的事情
for(i=0; i<N; i++){
a[i] = a[i] + b[i];
}
It will be vectorized as (using vector notation)
它将被矢量化为(使用矢量符号)
for (i=0; i<(N-N%VF); i+=VF){
a[i:i+VF] = a[i:i+VF] + b[i:i+VF];
}
Basically the compiler picks one operation that can be done on VF elements of the array at the same time and does this N/VF times instead of doing the single operation N times.
基本上,编译器选择一个可以同时对数组的 VF 元素执行的操作,并执行 N/VF 次,而不是执行 N 次单个操作。
It increases performance, but puts more requirement on the architecture.
它提高了性能,但对架构提出了更多要求。
回答by Gautham Ganapathy
As mentioned above, vectorization is used to make use of SIMD instructions, which can perform identical operations of different data packed into large registers.
如上所述,向量化用于利用 SIMD 指令,该指令可以对打包到大寄存器中的不同数据执行相同的操作。
A generic guideline to enable a compiler to autovectorize a loop is to ensure that there are no flow- and anti-dependencies b/w data elements in different iterations of a loop.
使编译器能够自动向量化循环的通用准则是确保在循环的不同迭代中不存在流和反依赖性 b/w 数据元素。
http://en.wikipedia.org/wiki/Data_dependency
http://en.wikipedia.org/wiki/Data_dependency
Some compilers like the Intel C++/Fortran compilers are capable of autovectorizing code. In case it was not able to vectorize a loop, the Intel compiler is capable of reporting why it could not do that. There reports can be used to modify the code such that it becomes vectorizable (assuming it's possible)
某些编译器(如英特尔 C++/Fortran 编译器)能够自动向量化代码。如果它无法矢量化循环,英特尔编译器能够报告为什么它不能这样做。有报告可用于修改代码,使其变得可矢量化(假设可能)
Dependencies are covered in depth in the book 'Optimizing Compilers for Modern Architectures: A Dependence-based Approach'
《为现代架构优化编译器:一种基于依赖的方法》一书中深入介绍了依赖关系
回答by Ganesh Gopalasubramanian
Vectorization need not be limited to single register which can hold large data. Like using '128' bit register to hold '4 x 32' bit data. It depends on architectural limitations. Some architecture have different execution units which have registers of their own. In that case, a part of the data can be fed to that execution unit and the result can be taken from a register corresponding to that execution unit.
矢量化不必局限于可以保存大量数据的单个寄存器。就像使用“128”位寄存器来保存“4 x 32”位数据一样。这取决于架构限制。某些体系结构具有不同的执行单元,它们具有自己的寄存器。在那种情况下,可以将部分数据馈送到该执行单元,并且可以从对应于该执行单元的寄存器中获取结果。
For example, consider the below case.
例如,考虑以下情况。
for(i=0; i < N; i++)
{
a[i] = a[i] + b[i];
}
for(i=0; i < N; i++)
{
a[i] = a[i] + b[i];
}
If I am working on an architecture which has two execution units, then my vector size is defined as two. The loop mentioned above will be reframed as
如果我正在研究具有两个执行单元的架构,那么我的向量大小定义为两个。上面提到的循环将被重新定义为
for(i=0; i<(N/2); i+=2)
{
a[i] = a[i] + b[i] ;
a[i+1] = a[i+1] + b[i+1];
}NOTE: The 2 inside the for statement is derived from the vector size.
for(i=0; i<(N/2); i+=2)
{
a[i] = a[i] + b[i] ;
a[i+1] = a[i+1] + b[i+1];
}注意:for 语句中的 2 来自向量大小。
As I am having two execution units the two statements inside the loop will be fed into the two execution units. The sum will be accumulated in the execution units separately. Finally the sum of accumulated values (from two execution units) will be carried out.
The good practices are
1. The constraints like dependency (between different iterations of the loop) needs to be checked before vectorizing the loop.
2. Function calls needs to be prevented.
3. Pointer access can create aliasing and it needs to be prevented.
由于我有两个执行单元,循环内的两个语句将被送入两个执行单元。总和将分别在执行单元中累加。最后将执行累加值(来自两个执行单元)的总和。
好的做法是
1. 在对循环进行矢量化之前,需要检查依赖项(在循环的不同迭代之间)等约束。
2. 需要防止函数调用。
3. 指针访问会产生别名,需要加以防止。
回答by toto
It's SSE code Generation.
它是 SSE 代码生成。
You have a loop with float matrix code in it matrix1[i][j] + matrix2[i][j] and the compiler generates SSE code.
您有一个带有浮点矩阵代码的循环 matrix1[i][j] + matrix2[i][j] 并且编译器生成 SSE 代码。
回答by toto
Maybe also have a look at libSIMDx86 (source code).
也许也看看 libSIMDx86 (源代码)。
A nice example well explained is:
一个很好解释的例子是: