C++ openmp中的并行for循环

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11773115/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 15:31:01  来源:igfitidea点击:

Parallel for loop in openmp

c++parallel-processingopenmp

提问by dsign

I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:

我正在尝试并行化一个非常简单的 for 循环,但这是我很长时间以来第一次尝试使用 openMP。我对运行时间感到困惑。这是我的代码:

#include <vector>
#include <algorithm>

using namespace std;

int main () 
{
    int n=400000,  m=1000;  
    double x=0,y=0;
    double s=0;
    vector< double > shifts(n,0);


    #pragma omp parallel for 
    for (int j=0; j<n; j++) {

        double r=0.0;
        for (int i=0; i < m; i++){

            double rand_g1 = cos(i/double(m));
            double rand_g2 = sin(i/double(m));     

            x += rand_g1;
            y += rand_g2;
            r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
        }
        shifts[j] = r / m;
    }

    cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}

I compile it with

我编译它

g++ -O3 testMP.cc -o testMP  -I /opt/boost_1_48_0/include

that is, no "-fopenmp", and I get these timings:

也就是说,没有“-fopenmp”,我得到了这些时间:

real    0m18.417s
user    0m18.357s
sys     0m0.004s

when I do use "-fopenmp",

当我使用“-fopenmp”时,

g++ -O3 -fopenmp testMP.cc -o testMP  -I /opt/boost_1_48_0/include

I get these numbers for the times:

我得到了这些数字:

real    0m6.853s
user    0m52.007s
sys     0m0.008s

which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?

这对我来说没有意义。如何使用八核只能使性能提高 3 倍?我是否正确编码循环?

回答by Hristo Iliev

You should make use of the OpenMP reductionclause for xand y:

您应该reductionxand使用 OpenMP子句y

#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {

    double r=0.0;
    for (int i=0; i < m; i++){

        double rand_g1 = cos(i/double(m));
        double rand_g2 = sin(i/double(m));     

        x += rand_g1;
        y += rand_g2;
        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
    }
    shifts[j] = r / m;
}

With reductioneach thread accumulates its own partial sum in xand yand in the end all partial values are summed together in order to obtain the final values.

reduction每个线程积聚在自己的部分和xy和在结束时,所有的部分值,以获得最终值相加。

Serial version:
25.05s user 0.01s system 99% cpu 25.059 total
OpenMP version w/ OMP_NUM_THREADS=16:
24.76s user 0.02s system 1590% cpu 1.559 total

See - superlinear speed-up :)

见 - 超线性加速:)

回答by Basheer AL-MOMANI

let's try to understand how parallelize simple for loop using OpenMP

让我们尝试了解如何使用 OpenMP 并行化简单的 for 循环

#pragma omp parallel
#pragma omp for
    for(i = 1; i < 13; i++)
    {
       c[i] = a[i] + b[i];
    }

assume that we have 3available threads, this is what will happen

假设我们有3可用的线程,这就是会发生的事情

enter image description here

在此处输入图片说明

firstly

首先

  • Threads are assigned an independent set of iterations
  • 线程被分配了一组独立的迭代

and finally

最后

  • Threads must wait at the end of work-sharing construct
  • 线程必须在工作共享构造结束时等待

回答by Nox

What you can achieve at most(!) is a linear speedup. Now I don't remember which is which with the times from linux, but I'd suggest you to use time.h or (in c++ 11) "chrono" and measure the runtime directly from the programm. Best pack the entire code into a loop, run it 10 times and average to get approx runtime by the prog.

您最多可以实现(!)是线性加速。现在我不记得哪个是来自 linux 的时间,但我建议您使用 time.h 或(在 c++ 11 中)“chrono”并直接从程序中测量运行时间。最好将整个代码打包成一个循环,运行 10 次,平均得到 prog 的大约运行时间。

Furthermore you've got imo a problem with x,y - which do not adhere to the paradigm of data locality in parallel programming.

此外,您在 imo 上遇到了 x,y 问题 - 这不符合并行编程中数据局部性的范式。