C++ openmp中的并行for循环
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11773115/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parallel for loop in openmp
提问by dsign
I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:
我正在尝试并行化一个非常简单的 for 循环,但这是我很长时间以来第一次尝试使用 openMP。我对运行时间感到困惑。这是我的代码:
#include <vector>
#include <algorithm>
using namespace std;
int main ()
{
int n=400000, m=1000;
double x=0,y=0;
double s=0;
vector< double > shifts(n,0);
#pragma omp parallel for
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}
I compile it with
我编译它
g++ -O3 testMP.cc -o testMP -I /opt/boost_1_48_0/include
that is, no "-fopenmp", and I get these timings:
也就是说,没有“-fopenmp”,我得到了这些时间:
real 0m18.417s
user 0m18.357s
sys 0m0.004s
when I do use "-fopenmp",
当我使用“-fopenmp”时,
g++ -O3 -fopenmp testMP.cc -o testMP -I /opt/boost_1_48_0/include
I get these numbers for the times:
我得到了这些数字:
real 0m6.853s
user 0m52.007s
sys 0m0.008s
which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?
这对我来说没有意义。如何使用八核只能使性能提高 3 倍?我是否正确编码循环?
回答by Hristo Iliev
You should make use of the OpenMP reduction
clause for x
and y
:
您应该reduction
对x
and使用 OpenMP子句y
:
#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
With reduction
each thread accumulates its own partial sum in x
and y
and in the end all partial values are summed together in order to obtain the final values.
与reduction
每个线程积聚在自己的部分和x
和y
和在结束时,所有的部分值,以获得最终值相加。
Serial version:
25.05s user 0.01s system 99% cpu 25.059 total
OpenMP version w/ OMP_NUM_THREADS=16:
24.76s user 0.02s system 1590% cpu 1.559 total
See - superlinear speed-up :)
见 - 超线性加速:)
回答by Basheer AL-MOMANI
let's try to understand how parallelize simple for loop using OpenMP
让我们尝试了解如何使用 OpenMP 并行化简单的 for 循环
#pragma omp parallel
#pragma omp for
for(i = 1; i < 13; i++)
{
c[i] = a[i] + b[i];
}
assume that we have 3
available threads, this is what will happen
假设我们有3
可用的线程,这就是会发生的事情
firstly
首先
- Threads are assigned an independent set of iterations
- 线程被分配了一组独立的迭代
and finally
最后
- Threads must wait at the end of work-sharing construct
- 线程必须在工作共享构造结束时等待
回答by Nox
What you can achieve at most(!) is a linear speedup. Now I don't remember which is which with the times from linux, but I'd suggest you to use time.h or (in c++ 11) "chrono" and measure the runtime directly from the programm. Best pack the entire code into a loop, run it 10 times and average to get approx runtime by the prog.
您最多可以实现(!)是线性加速。现在我不记得哪个是来自 linux 的时间,但我建议您使用 time.h 或(在 c++ 11 中)“chrono”并直接从程序中测量运行时间。最好将整个代码打包成一个循环,运行 10 次,平均得到 prog 的大约运行时间。
Furthermore you've got imo a problem with x,y - which do not adhere to the paradigm of data locality in parallel programming.
此外,您在 imo 上遇到了 x,y 问题 - 这不符合并行编程中数据局部性的范式。