C++ 减少 OpenMP 中的数组

Question

提问by user2891902

I am trying to parallelize the following program, but don't know how to reduce on an array. I know it is not possible to do so, but is there an alternative? Thanks. (I added reduction on m which is wrong but would like to have an advice on how to do it.)

我正在尝试并行化以下程序，但不知道如何减少数组。我知道这是不可能的，但有没有其他选择？谢谢。（我增加了对 m 的减少，这是错误的，但想就如何做到这一点提出建议。）

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <omp.h>
using namespace std;

int main ()
{
  int A [] = {84, 30, 95, 94, 36, 73, 52, 23, 2, 13};
  int S [10];

  time_t start_time = time(NULL);
  #pragma omp parallel for private(m) reduction(+:m)
  for (int n=0 ; n<10 ; ++n ){
    for (int m=0; m<=n; ++m){
      S[n] += A[m];
    }
  }
  time_t end_time = time(NULL);
  cout << end_time-start_time;

  return 0;
}

Answer 1

回答by Z boson

Yes it is possible to do an array reduction with OpenMP. In Fortran it even has construct for this. In C/C++ you have to do it yourself. Here are two ways to do it.

是的，可以使用 OpenMP 进行阵列缩减。在 Fortran 中，它甚至为此有构造。在 C/C++ 中，你必须自己做。这里有两种方法可以做到。

The first method makes private version of Sfor each thread, fill them in parallel, and then merges them into Sin a critical section (see the code below). The second method makes an array with dimentions 10*nthreads. Fills this array in parallel and then merges it into Swithout using a critical section. The second method is much more complicated and can have cache issues especially on multi-socket systems if you are not careful. For more details see this Fill histograms (array reduction) in parallel with OpenMP without using a critical section

第一种方法S为每个线程制作私有版本，并行填充它们，然后将它们合并到S临界区中（见下面的代码）。第二种方法创建一个维度为 10*nthreads 的数组。并行填充此数组，然后在S不使用临界区的情况下将其合并。第二种方法要复杂得多，如果您不小心，可能会出现缓存问题，尤其是在多路系统上。有关更多详细信息，请参阅此Fill histograms (array reduction) in parallel with OpenMP without using a critical section

First method

第一种方法

int A [] = {84, 30, 95, 94, 36, 73, 52, 23, 2, 13};
int S [10] = {0};
#pragma omp parallel
{
    int S_private[10] = {0};
    #pragma omp for
    for (int n=0 ; n<10 ; ++n ) {
        for (int m=0; m<=n; ++m){
            S_private[n] += A[m];
        }
    }
    #pragma omp critical
    {
        for(int n=0; n<10; ++n) {
            S[n] += S_private[n];
        }
    }
}

Second method

第二种方法

int A [] = {84, 30, 95, 94, 36, 73, 52, 23, 2, 13};
int S [10] = {0};
int *S_private;
#pragma omp parallel
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();

    #pragma omp single 
    {
        S_private = new int[10*nthreads];
        for(int i=0; i<(10*nthreads); i++) S_private[i] = 0;
    }
    #pragma omp for
    for (int n=0 ; n<10 ; ++n )
    {
        for (int m=0; m<=n; ++m){
            S_private[ithread*10+n] += A[m];
        }
    }
    #pragma omp for
    for(int i=0; i<10; i++) {
        for(int t=0; t<nthreads; t++) {
            S[i] += S_private[10*t + i];
        }
    }
}
delete[] S_private;

Answer 2

回答by NameOfTheRose

I have two remarks concerning Zboson's answer:
1. Method 1 is certainly correct but the reduction loop is actually run serially, because of the #pragma omp criticalwhich is of course necessary as the partial matrices are local to each thread and the corresponding reduction has to be done by the thread owing the matrix.
2. Method 2: The initialization loop can be moved outside the single section and therefore become parallelizable.

我对 Zboson 的回答有两点评论：
1. 方法 1 当然是正确的，但缩减循环实际上是串行运行的，因为#pragma omp critical这当然是必要的，因为部分矩阵对于每个线程都是本地的，并且相应的缩减有由由于矩阵的线程完成。
2. 方法2：初始化循环可以移到单节之外，因此变得可并行化。

The following program implementsarray reduction using openMP v4.0 user defined reduction facility:

以下程序使用 openMP v4.0 用户定义的缩减工具实现数组缩减：

/* Compile with:
     gcc -Wall -fopenmp -o ar ar.c
   Run with:
     OMP_DISPLAY_ENV=TRUE OMP_NUM_THREADS=10 OMP_NESTED=TRUE ./ar
*/
#include <stdio.h>
#include <omp.h>
struct m10x1 {int v[10];};
int A [] =       {84, 30, 95, 94, 36, 73, 52, 23, 2, 13};  
struct m10x1 S = {{ 0,  0,  0,  0,  0,  0,  0,  0, 0,  0}};
int n,m=0;

void print_m10x1(struct m10x1 x){
  int i;
  for(i=0;i<10;i++) printf("%d ",x.v[i]);
  printf("\n");
}

struct m10x1 add_m10x1(struct m10x1 x,struct m10x1 y){
  struct m10x1 r ={{ 0,  0,  0,  0,  0,  0,  0,  0, 0,  0}};
  int i;
  for (i=0;i<10;i++) r.v[i]=x.v[i]+y.v[i];
  return r;
}

#pragma omp declare reduction(m10x1Add: struct m10x1: \
omp_out=add_m10x1(omp_out, omp_in)) initializer( \
omp_priv={{ 0,  0,  0,  0,  0,  0,  0,  0, 0,  0}} )

int main ()
{
  #pragma omp parallel for reduction(m10x1Add: S)
  for ( n=0 ; n<10 ; ++n )
    {
      for (m=0; m<=n; ++m){
        S.v[n] += A[m];
      }
    }
  print_m10x1(S);
}

This follows verbatim the complex number reduction example on page 97 of OpenMP 4.0 features.

这逐字遵循OpenMP 4.0 特性的第 97 页上的复数减少示例。

Although the parallel version works correctly, there probably are performance issues, which I have not investigated:

虽然并行版本工作正常，但可能存在性能问题，我还没有调查过：

add_m10x1 inputs and output are passed by value.
The loop in add_m10x1 is run serially.

add_m10x1 输入和输出按值传递。
add_m10x1 中的循环是串行运行的。

Said "performance issues" are of my own making and it is completely straightforward not to introduce them:

所说的“性能问题”是我自己造成的，不介绍它们是完全直接的：

Parameters to add_m10x1should be passed by reference (via pointers in C, references in C++)
The computation in add_m10x1should be done in place.
add_m10x1should be declared void and the return statement deleted. The result is returned via the first parameter.
The declare reduction pragma should be accordingly modified, the combiner should be just a function call and not an assignment (v4.0 specs p181 lines 9,10).
The for loop in add_m10x1can be parallelized via an omp parallel for pragma
Parallel nesting should be enabled (e.g. via OMP_NESTED=TRUE)

add_m10x1 的参数应该通过引用传递（通过 C 中的指针，C++ 中的引用）
add_m10x1 中的计算应该就地完成。
add_m10x1应声明为 void 并删除 return 语句。结果通过第一个参数返回。
应相应地修改声明减少编译指示，组合器应该只是一个函数调用而不是赋值（v4.0 规范 p181 第 9,10 行）。
add_m10x1 中的 for 循环可以通过 omp parallel for pragma 并行化
应启用并行嵌套（例如，通过 OMP_NESTED=TRUE）

The modified part of the code then is:

修改后的代码部分是：

void add_m10x1(struct m10x1 * x,struct m10x1 * y){
  int i;
  #pragma omp parallel for
  for (i=0;i<10;i++) x->v[i] += y->v[i];
}

#pragma omp declare reduction(m10x1Add: struct m10x1: \
add_m10x1(&omp_out, &omp_in)) initializer( \
omp_priv={{ 0,  0,  0,  0,  0,  0,  0,  0, 0,  0}} )

Answer 3

回答by High Performance Mark

If translating your code to Fortran, which can use arrays in OpenMP reduction operations, doesn't appeal, you could use a bunch of temporary variables. For example

如果将您的代码转换为 Fortran（它可以在 OpenMP 缩减操作中使用数组）没有吸引力，您可以使用一堆临时变量。例如

int S0, S1, S2, ..., S9;
...
#pragma omp parallel for private(...) shared(S0, S1, S2, ..., S9) \
            reduction(+:S0, S1, S2, ..., S9)
for ...

This leaves you with the unappealing prospect of having to write some kind of ifor casestatement to determine which of the temporaries is to be updated. If your code is just an example you want to use for learning, carry on.

这让您不得不编写某种if或case语句来确定要更新哪个临时文件，这种前景并不吸引人。如果您的代码只是您想用于学习的示例，请继续。

But if your intention is genuinely to write a parallel prefix sum routine then search around. This is a good place to start.

但是，如果您的意图是真正编写并行前缀和例程，请四处搜索。这是一个很好的起点。

C++ 减少 OpenMP 中的数组

提问by user2891902

回答by Z boson

回答by NameOfTheRose

回答by High Performance Mark

相关推荐

最近更新

标签

C++ 减少 OpenMP 中的数组

提问by user2891902

回答by Z boson

回答by NameOfTheRose

回答by High Performance Mark

相关推荐

C++ 对“dlsym”的未定义引用

Qt 插槽和 C++11 lambda

C++ int 的大小是否取决于编译器和/或处理器？

C++ 用C++从文件中逐字读取

相关推荐

最近更新

标签