C++ 并行 vs omp simd:何时使用?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14674049/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 18:34:28  来源:igfitidea点击:

Parallel for vs omp simd: when to use each?

c++cperformanceopenmpsimd

提问by zr.

OpenMP 4.0introduces a new construct called "omp simd". What is the benefit of using this construct over the old "parallel for"? When would each be a better choice over the other?

OpenMP 4.0引入了一种称为“omp simd”的新结构。与旧的“parallel for”相比,使用此构造有什么好处?什么时候每个人都会成为比另一个更好的选择?

EDIT: Here is an interesting paperrelated to the SIMD directive.

编辑:这是一篇与 SIMD 指令相关的有趣论文

采纳答案by Jonathan Dursi

The linked-to standard is relatively clear (p 13, lines 19+20)

链接到的标准比较明确(第 13 页,第 19+20 行)

When any thread encounters a simd construct, the iterations of the loop associated with the construct can be executed by the SIMD lanes that are available to the thread.

当任何线程遇到 simd 构造时,与该构造关联的循环的迭代可以由线程可用的 SIMD 通道执行。

SIMDis a sub-thread thing. To make it more concrete, on a CPU you could imagine using simddirectives to specifically request vectorizationof chunks of loop iterations that individually belong to the same thread. It's exposing the multiple levels of parallelism that exist within a single multicore processor, in a platform-independent way. See for instance the discussion (along with the accelerator stuff) on this intel blog post.

SIMD是一个子线程的东西。更具体地说,在 CPU 上,您可以想象使用simd指令专门请求对单独属于同一线程的循环迭代块进行矢量化。它以独立于平台的方式暴露了单个多核处理器中存在的多个并行级别。例如,请参阅这篇英特尔博客文章中的讨论(以及加速器内容)。

So basically, you'll want to use omp parallelto distribute work onto different threads, which can then migrate to multiple cores; and you'll want to use omp simdto make use of vector pipelines (say) within each core. Normally omp parallelwould go on the "outside" to deal with coarser-grained parallel distribution of work and omp simdwould go around tight loops inside of that to exploit fine-grained parallelism.

所以基本上,你会想要使用omp parallel将工作分配到不同的线程上,然后可以迁移到多个内核;并且您将希望使用omp simd每个核心内的矢量管道(例如)。通常omp parallel会在“外部”处理更粗粒度的并行工作分配,并omp simd会在内部进行紧密循环以利用细粒度并行性。

回答by minjang

A simple answer:

一个简单的答案:

OpenMP only used to exploit multiple threads for multiple cores. This new simdextention allows you to explicitly use SIMD instructionson modern CPUs, such as Intel's AVX/SSE and ARM's NEON.

OpenMP 仅用于为多个内核开发多个线程。这个新simd扩展允许您在现代 CPU 上明确使用SIMD 指令,例如 Intel 的 AVX/SSE 和 ARM 的 NEON。

(Note that a SIMD instruction is executed in a single thread and a single core, by design. However, the meaning of SIMD can be quite expanded for GPGPU. But, but I don't think you need to consider GPGPU for OpenMP 4.0.)

(请注意,按照设计,SIMD 指令是在单线程和单核中执行的。但是,对于 GPGPU,SIMD 的含义可以相当扩展。但是,但我认为您不需要为 OpenMP 4.0 考虑 GPGPU。 )

So, once you know SIMD instructions, you can use this new construct.

因此,一旦您知道 SIMD 指令,您就可以使用这个新结构。



In a modern CPU, roughly there are three types of parallelism: (1) instruction-level parallelism (ILP), (2) thread-level parallelism (TLP), and (3) SIMD instructions (we could say this is vector-level or so).

在现代 CPU 中,并行性大致分为三种类型:(1) 指令级并行性 (ILP),(2) 线程级并行性 (TLP),以及 (3) SIMD 指令(我们可以说这是向量级并行性)或者)。

ILP is done automatically by your out-of-order CPUs, or compilers. You can exploit TLP using OpenMP's parallel forand other threading libraries. So, what about SIMD? Intrinsics were a way to use them (as well as compilers' automatic vectorization). OpenMP's simdis a new way to use SIMD.

ILP 由您的乱序 CPU 或编译器自动完成。您可以使用 OpenMPparallel for和其他线程库来利用 TLP 。那么,SIMD 呢?内部函数是使用它们的一种方式(以及编译器的自动矢量化)。OpenMPsimd是一种使用 SIMD 的新方法。

Take a very simple example:

举一个非常简单的例子:

for (int i = 0; i < N; ++i)
  A[i] = B[i] + C[i];

The above code computes a sum of two N-dimensional vectors. As you can easily see, there is no (loop-carried) data dependencyon the array A[]. This loop is embarrassingly parallel.

上面的代码计算两个 N 维向量的和。正如您可以轻松看到的,对 array没有(循环携带的)数据依赖性A[]。这个循环是令人尴尬的并行

There could be multiple ways to parallelize this loop. For example, until OpenMP 4.0, this can be parallelized using only parallel forconstruct. Each thread will perform N/#threaditerations on multiple cores.

可以有多种方法来并行化这个循环。例如,直到 OpenMP 4.0,这可以仅使用parallel for构造进行并行化。每个线程将N/#thread在多个内核上执行迭代。

However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions.

但是,您可能认为使用多个线程进行这种简单的加法是一种矫枉过正。这就是为什么有向量化的原因,它主要是通过 SIMD 指令实现的。

Using a SIMD would be like this:

使用 SIMD 将是这样的:

for (int i = 0; i < N/8; ++i)
  VECTOR_ADD(A + i, B + i, C + i);

This code assumes that (1) the SIMD instruction (VECTOR_ADD) is 256-bit or 8-way (8 * 32 bits); and (2) Nis a multiple of 8.

此代码假设 (1) SIMD 指令 ( VECTOR_ADD) 为 256 位或 8 路(8 * 32 位);(2)N是 8 的倍数。

An 8-way SIMD instruction means that 8 items in a vector can be executed in a single machine instruction. Note that Intel's latest AVX provides such 8-way (32-bit * 8 = 256 bits) vector instructions.

8 路 SIMD 指令意味着一个向量中的 8 个项目可以在一条机器指令中执行。请注意,Intel 最新的 AVX 提供了这样的 8 路(32 位 * 8 = 256 位)向量指令。

In SIMD, you still use a single core (again, this is only for conventional CPUs, not GPU). But, you can use a hidden parallelism in hardware. Modern CPUs dedicate hardware resources for SIMD instructions, where each SIMD lanecan be executed in parallel.

在 SIMD 中,您仍然使用单核(同样,这仅适用于传统 CPU,不适用于 GPU)。但是,您可以在硬件中使用隐藏的并行性。现代 CPU 将硬件资源专用于 SIMD 指令,其中每个 SIMD通道都可以并行执行。

You can use thread-level parallelism at the same time. The above example can be further parallelized by parallel for.

您可以同时使用线程级并行。上面的例子可以进一步并行化parallel for

(However, I have a doubt how many loops can be really transformed to SIMDized loops. The OpenMP 4.0 specification seems a bit unclear on this. So, real performance and practical restrictions would be dependent on actual compilers' implementations.)

(但是,我怀疑有多少循环可以真正转换为 SIMDized 循环。OpenMP 4.0 规范对此似乎有点不清楚。因此,实际性能和实际限制将取决于实际编译器的实现。)



To summarize, simdconstruct allows you to use SIMD instructions, in turn, more parallelism can be exploited along with thread-level parallelism. However, I think actual implementations would matter.

总而言之,simdconstruct 允许您使用 SIMD 指令,反过来,可以利用更多的并行性以及线程级并行性。但是,我认为实际的实现很重要。

回答by tim18

Compilers aren't required to make simd optimization in a parallel region conditional on presence of the simd clause. Compilers I'm familiar with continue to support nested loops, parallel outer, vector inner, in the same way as before.
In the past, OpenMP directives were usually taken to prevent loop-switching optimizations involving the outer parallelized loop (multiple loops with collapse clause). This seems to have changed in a few compilers. OpenMP 4 opens up new possibilities including optimization of a parallel outer loop with a non-vectorizable inner loop, by a sort of strip mining, when omp parallel do [for] simd is set. ifort sometimes reports it as outer loop vectorization when it is done without the simd clause. It may then be optimized for a smaller number of threads than the omp parallel do simd, which seems to need more threads than the simd vector width to pay off. Such a distinction might be inferred, as, without the simd clause, the compiler is implicitly asked to optimize for a loop count such as 100 or 300, while the simd clause requests unconditional simd optimization. gcc 4.9 omp parallel for simd looked quite effective when I had a 24-core platform.

编译器不需要在以 simd 子句存在为条件的并行区域中进行 simd 优化。我熟悉的编译器继续以与以前相同的方式支持嵌套循环、并行外部、向量内部。
过去,通常采用 OpenMP 指令来防止涉及外部并行化循环(带有崩溃子句的多个循环)的循环切换优化。这似乎在一些编译器中发生了变化。当设置 omp parallel do [for] simd 时,OpenMP 4 开辟了新的可能性,包括通过一种条带挖掘优化具有不可矢量化内循环的并行外循环。在没有 simd 子句的情况下,ifort 有时会将其报告为外循环矢量化。然后它可能会针对比 omp parallel do simd 更少的线程进行优化,这似乎需要比 simd 向量宽度更多的线程才能获得回报。可以推断出这种区别,因为如果没有 simd 子句,编译器会被隐式要求优化循环计数,例如 100 或 300,而 simd 子句要求无条件的 simd 优化。当我有一个 24 核平台时,gcc 4.9 omp parallel for simd 看起来非常有效。