C++ OpenMP 中的“静态”和“动态”计划有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10850155/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 14:32:12  来源:igfitidea点击:

What's the difference between "static" and "dynamic" schedule in OpenMP?

c++multithreadingopenmp

提问by Irineu Licks Filho

I started working with OpenMP using C++.

我开始使用 C++ 使用 OpenMP。

I have two questions:

我有两个问题:

  1. What is #pragma omp for schedule?
  2. What is the difference between dynamicand static?
  1. 什么是#pragma omp for schedule
  2. dynamic和 和有static什么区别?

Please, explain with examples.

请举例说明。

回答by Hristo Iliev

Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.

其他人已经回答了大部分问题,但我想指出一些特定的情况,其中特定的调度类型比其他的更适合。调度控制如何在线程之间划分循环迭代。选择正确的时间表会对应用程序的速度产生很大的影响。

staticschedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.

static调度意味着迭代块以循环方式静态映射到执行线程。静态调度的好处在于,OpenMP 运行时保证如果您有两个具有相同迭代次数的独立循环并使用静态调度以相同数量的线程执行它们,那么每个线程将获得完全相同的迭代范围( s) 在两个平行区域。这在 NUMA 系统上非常重要:如果您在第一个循环中接触一些内存,它将驻留在执行线程所在的 NUMA 节点上。然后在第二个循环中,同一个线程可以更快地访问同一个内存位置,因为它将驻留在同一个 NUMA 节点上。

Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:

假设有两个 NUMA 节点:节点 0 和节点 1,例如一个双插槽 Intel Nehalem 板,两个插槽中都有 4 核 CPU。然后线程 0、1、2 和 3 将驻留在节点 0 上,线程 4、5、6 和 7 将驻留在节点 1 上:

|             | core 0 | thread 0 |
| socket 0    | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
|             | core 3 | thread 3 |

|             | core 4 | thread 4 |
| socket 1    | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
|             | core 7 | thread 7 |

Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:

每个内核都可以从每个 NUMA 节点访问内存,但远程访问比本地节点访问慢(在 Intel 上慢 1.5 到 1.9 倍)。你运行这样的东西:

char *a = (char *)malloc(8*4096);

#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
   memset(&a[i*4096], 0, 4096);

4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a. The malloc()call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of mallocis used, e.g. one that zeroes the memory like calloc()does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touchNUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.

如果不使用大页面,在这种情况下,4096 字节是 x86 上 Linux 上一个内存页面的标准大小。此代码将整个 32 KiB 数组归零a。该malloc()调用仅保留虚拟地址空间,但实际上并未“接触”物理内存(这是默认行为,除非malloc使用其他版本,例如将内存归零的版本calloc())。现在这个数组是连续的,但只在虚拟内存中。在物理内存中,一半位于连接到插槽 0 的内存中,另一半位于连接到插槽 1 的内存中。 这是因为不同的部分被不同的线程归零,并且这些线程驻留在不同的内核上,并且有一种叫做第一次接触的东西NUMA 策略,这意味着内存页分配在第一个“接触”内存页的线程所在的 NUMA 节点上。

|             | core 0 | thread 0 | a[0]     ... a[4095]
| socket 0    | core 1 | thread 1 | a[4096]  ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192]  ... a[12287]
|             | core 3 | thread 3 | a[12288] ... a[16383]

|             | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1    | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
|             | core 7 | thread 7 | a[28672] ... a[32768]

Now lets run another loop like this:

现在让我们像这样运行另一个循环:

#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
   memset(&a[i*4096], 1, 4096);

Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.

每个线程将访问已映射的物理内存,并且它将具有与第一个循环期间相同的线程到内存区域的映射。这意味着线程将只访问位于其本地内存块中的内存,这将是快速的。

Now imagine that another scheduling scheme is used for the second loop: schedule(static,2). This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):

现在想象一下,另一个调度方案用于第二循环:schedule(static,2)。这会将迭代空间“切割”为两个迭代的块,总共会有 4 个这样的块。将会发生的是,我们将有以下线程到内存位置映射(通过迭代次数):

|             | core 0 | thread 0 | a[0]     ... a[8191]  <- OK, same memory node
| socket 0    | core 1 | thread 1 | a[8192]  ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
|             | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory

|             | core 4 | thread 4 | <idle>
| socket 1    | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
|             | core 7 | thread 7 | <idle>

Two bad things happen here:

这里发生了两件坏事:

  • threads 4 to 7 remain idle and half of the compute capability is lost;
  • threads 2 and 3 access non-local memory and it will take them about twice as much time to finish during which time threads 0 and 1 will remain idle.
  • 线程 4 到 7 保持空闲,一半的计算能力丢失;
  • 线程 2 和 3 访问非本地内存,在线程 0 和 1 将保持空闲状态期间,它们将花费大约两倍的时间来完成。

So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.

所以使用静态调度的优点之一是它提高了内存访问的局部性。缺点是调度参数选择不当会破坏性能。

dynamicscheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:

dynamic调度工作以“先到先得”为基础。具有相同线程数的两次运行可能(并且很可能会)产生完全不同的“迭代空间”->“线程”映射,因为可以轻松验证:

$ cat dyn.c
#include <stdio.h>
#include <omp.h>

int main (void)
{
  int i;

  #pragma omp parallel num_threads(8)
  {
    #pragma omp for schedule(dynamic,1)
    for (i = 0; i < 8; i++)
      printf("[1] iter %0d, tid %0d\n", i, omp_get_thread_num());

    #pragma omp for schedule(dynamic,1)
    for (i = 0; i < 8; i++)
      printf("[2] iter %0d, tid %0d\n", i, omp_get_thread_num());
  }

  return 0;
}

$ icc -openmp -o dyn.x dyn.c

$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4

(same behaviour is observed when gccis used instead)

gcc改为使用时观察到相同的行为)

If the sample code from the staticsection was run with dynamicscheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.

如果本static节中的示例代码使用dynamic调度运行,则只有 1/70 (1.4%) 的机会保留原始位置,而 69/70 (98.6%) 的机会发生远程访问。这个事实经常被忽视,因此实现了次优的性能。

There is another reason to choose between staticand dynamicscheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamicscheduling is the guidedwhere smaller and smaller iteration blocks are given to each task as the work progresses.

staticdynamic调度之间进行选择还有另一个原因- 工作负载平衡。如果每次迭代花费的时间与完成的平均时间有很大不同,那么在静态情况下可能会出现高度的工作不平衡。以完成一次迭代的时间随迭代次数线性增长的情况为例。如果迭代空间在两个线程之间静态划分,则第二个线程的工作量将是第一个线程的三倍,因此在 2/3 的计算时间中,第一个线程将处于空闲状态。动态调度引入了一些额外的开销,但在这种特殊情况下会导致更好的工作负载分配。一种特殊的dynamic调度是guided随着工作的进行,给每个任务分配越来越小的迭代块。

Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime)clause. With runtimescheduling the type is taken from the content of the environment variable OMP_SCHEDULE. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.

由于预编译的代码可以在各种平台上运行,如果最终用户可以控制调度,那就太好了。这就是 OpenMP 提供特殊schedule(runtime)条款的原因。对于runtime调度,类型取自环境变量的内容OMP_SCHEDULE。这允许在不重新编译应用程序的情况下测试不同的调度类型,还允许最终用户针对他或她的平台进行微调。

回答by Eugeniu Torica

I think the misunderstanding comes from the fact that you miss the point about OpenMP. In a sentence OpenMP allows you to execute you program faster by enabling parallelism. In a program parallelism can be enabled in many ways and one of the is by using threads. Suppose you have and array:

我认为误解来自于你错过了关于 OpenMP 的观点。一句话,OpenMP 允许您通过启用并行性来更快地执行您的程序。在程序中可以通过多种方式启用并行性,其中之一是使用线程。假设你有数组:

[1,2,3,4,5,6,7,8,9,10]

and you want to increment all elements by 1 in this array.

并且您希望将此数组中的所有元素都增加 1。

If you are going to use

如果你要使用

#pragma omp for schedule(static, 5)

it means that to each of the threads will be assigned 5 contiguous iterations. In this case the first thread will take 5 numbers. The second one will take another 5 and so on until there are no more data to process or the maximum number of threads is reached (typically equal to the number of cores). Sharing of workload is done during the compilation.

这意味着每个线程将被分配 5 个连续的迭代。在这种情况下,第一个线程将采用 5 个数字。第二个将需要另外 5 个,依此类推,直到没有更多数据要处理或达到最大线程数(通常等于内核数)。工作量的共享是在编译期间完成的。

In case of

的情况下

#pragma omp for schedule(dynamic, 5)

The work will be shared amongst threads but this procedure will occur at a runtime. Thus involving more overhead. Second parameter specifies size of the chunk of the data.

工作将在线程之间共享,但此过程将在运行时发生。因此涉及更多的开销。第二个参数指定数据块的大小。

Not being very familiar to OpenMP I risk to assume that dynamic type is more appropriate when compiled code is going to run on the system that has a different configuration that the one on which code was compiled.

对 OpenMP 不太熟悉 我冒着假设动态类型更合适的风险,当编译后的代码将在具有与编译代码不同的配置的系统上运行时。

I would recommend the page bellow where there are discussed techniques used for parallelizing the code, preconditions and limitations

我会推荐下面的页面,其中讨论了用于并行化代码、前提条件和限制的技术

https://computing.llnl.gov/tutorials/parallel_comp/

https://computing.llnl.gov/tutorials/parallel_comp/

Additional links:
http://en.wikipedia.org/wiki/OpenMP
Difference between static and dynamic schedule in openMP in C
http://openmp.blogspot.se/

附加链接
http: //en.wikipedia.org/wiki/OpenMP
C 中 openMP 中静态和动态调度的区别
http://openmp.blogspot.se/

回答by Sebastian Mach

The loop partitioning scheme is different. The static scheduler would divide a loop over N elements into M subsets, and each subset would then contain strictly N/M elements.

循环分区方案不同。静态调度器会将 N 个元素上的循环划分为 M 个子集,然后每个子集将严格包含 N/M 个元素。

The dynamic approach calculates the size of the subsets on the fly, which can be useful if the subsets' computation times vary.

动态方法动态计算子集的大小,如果子集的计算时间不同,这会很有用。

The static approach should be used if computation times vary not much.

如果计算时间变化不大,则应使用静态方法。