C++ openmp中的并行for循环

Question

提问by dsign

I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:

我正在尝试并行化一个非常简单的 for 循环，但这是我很长时间以来第一次尝试使用 openMP。我对运行时间感到困惑。这是我的代码：

#include <vector>
#include <algorithm>

using namespace std;

int main () 
{
    int n=400000,  m=1000;  
    double x=0,y=0;
    double s=0;
    vector< double > shifts(n,0);


    #pragma omp parallel for 
    for (int j=0; j<n; j++) {

        double r=0.0;
        for (int i=0; i < m; i++){

            double rand_g1 = cos(i/double(m));
            double rand_g2 = sin(i/double(m));     

            x += rand_g1;
            y += rand_g2;
            r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
        }
        shifts[j] = r / m;
    }

    cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}

I compile it with

我编译它

g++ -O3 testMP.cc -o testMP  -I /opt/boost_1_48_0/include

that is, no "-fopenmp", and I get these timings:

也就是说，没有“-fopenmp”，我得到了这些时间：

real    0m18.417s
user    0m18.357s
sys     0m0.004s

when I do use "-fopenmp",

当我使用“-fopenmp”时，

g++ -O3 -fopenmp testMP.cc -o testMP  -I /opt/boost_1_48_0/include

I get these numbers for the times:

我得到了这些数字：

real    0m6.853s
user    0m52.007s
sys     0m0.008s

which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?

这对我来说没有意义。如何使用八核只能使性能提高 3 倍？我是否正确编码循环？

Answer 1

回答by Hristo Iliev

You should make use of the OpenMP reductionclause for xand y:

您应该reduction对xand使用 OpenMP子句y：

#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {

    double r=0.0;
    for (int i=0; i < m; i++){

        double rand_g1 = cos(i/double(m));
        double rand_g2 = sin(i/double(m));     

        x += rand_g1;
        y += rand_g2;
        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
    }
    shifts[j] = r / m;
}

With reductioneach thread accumulates its own partial sum in xand yand in the end all partial values are summed together in order to obtain the final values.

与reduction每个线程积聚在自己的部分和x和y和在结束时，所有的部分值，以获得最终值相加。

Serial version:
25.05s user 0.01s system 99% cpu 25.059 total
OpenMP version w/ OMP_NUM_THREADS=16:
24.76s user 0.02s system 1590% cpu 1.559 total

See - superlinear speed-up :)

见 - 超线性加速:)

Answer 2

回答by Basheer AL-MOMANI

let's try to understand how parallelize simple for loop using OpenMP

让我们尝试了解如何使用 OpenMP 并行化简单的 for 循环

#pragma omp parallel
#pragma omp for
    for(i = 1; i < 13; i++)
    {
       c[i] = a[i] + b[i];
    }

assume that we have 3available threads, this is what will happen

假设我们有3可用的线程，这就是会发生的事情

firstly

首先

Threads are assigned an independent set of iterations

线程被分配了一组独立的迭代

and finally

最后

Threads must wait at the end of work-sharing construct

线程必须在工作共享构造结束时等待

Answer 3

回答by Nox

What you can achieve at most(!) is a linear speedup. Now I don't remember which is which with the times from linux, but I'd suggest you to use time.h or (in c++ 11) "chrono" and measure the runtime directly from the programm. Best pack the entire code into a loop, run it 10 times and average to get approx runtime by the prog.

您最多可以实现（！）是线性加速。现在我不记得哪个是来自 linux 的时间，但我建议您使用 time.h 或（在 c++ 11 中）“chrono”并直接从程序中测量运行时间。最好将整个代码打包成一个循环，运行 10 次，平均得到 prog 的大约运行时间。

Furthermore you've got imo a problem with x,y - which do not adhere to the paradigm of data locality in parallel programming.

此外，您在 imo 上遇到了 x,y 问题 - 这不符合并行编程中数据局部性的范式。

C++ openmp中的并行for循环

提问by dsign

回答by Hristo Iliev

回答by Basheer AL-MOMANI

回答by Nox

相关推荐

最近更新

标签

C++ openmp中的并行for循环

提问by dsign

回答by Hristo Iliev

回答by Basheer AL-MOMANI

回答by Nox

相关推荐

C++ 捕获访问冲突异常？

C++ 将向量复制到 STL 列表的最佳方法？

C++ 如何从其值中获取枚举项名称

在 C++ 中定义类字符串常量？

相关推荐

最近更新

标签