C语言 使用 openMP 并行化嵌套的 for 循环

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16751445/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 06:27:58  来源:igfitidea点击:

Parallelize nested for loop with openMP

cfor-loopparallel-processingopenmp

提问by seb

I am trying to optimize the nested for loop in the function generate_histogram() below with openMP. I have tried around a lot with different combinations of pragmas based on what I've read in this SE post.

我正在尝试使用 openMP 优化下面函数 generate_histogram() 中的嵌套 for 循环。根据我在这篇 SE 文章中读到的内容,我尝试了很多不同的编译指示组合。

The problem is that the nested for loop performs faster without openMP than with openMP!

问题是嵌套 for 循环在没有 openMP 的情况下执行得比有 openMP 快!

If I try to parallelize my code with reduction instead of the atomic pragma, I end up with netchunk fails. Does anybody know a fancy tweak for this one? I am trying to bin data into a histogram. So the histogram is variable in size in the real code, unlike in the snippet below.

如果我尝试使用reduction 而不是atomic pragma 来并行化我的代码,我最终会遇到netchunk 失败。有人知道这个花哨的调整吗?我正在尝试将数据合并为直方图。所以直方图在实际代码中的大小是可变的,这与下面的代码片段不同。

#include<stdio.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#define float_t float
#include <time.h>
#include <omp.h>

float_t generate_histogram(float_t **matrix, int *histogram, int mat_size, int hist_size)
{
int i,j,k,count;
float_t max = 0.;
float_t sum;

//set histogram to zero everywhere
for(i = 0; i < hist_size; i++)
    histogram[i] = 0;


//matrix computations
#pragma omp parallel for private(i) shared(histogram,j,k,max) schedule(dynamic)
//#pragma omp parallel for schedule(runtime)
for (i = 1; i < (mat_size-1); i++)
{
    #pragma omp parallel for private(j,k) shared(histogram,max) schedule(dynamic)
    //pragma omp prallel for schedule(dynamic)
    for(j = 1; j < (mat_size-1); j++)
    {

        //assign current matrix[i][j] to element in order to reduce memory access
        sum = fabs(matrix[i][j]-matrix[i-1][j]) + fabs(matrix[i][j] - matrix[i+1][j])
            + fabs(matrix[i][j]-matrix[i][j-1]) + fabs(matrix[i][j] - matrix[i][j+1]);

        //compute index of histogram bin
        k = (int)(sum * (float)mat_size);
        #pragma omp atomic
        histogram[k] += 1;

        //keep track of largest element
        if(sum > max)
            max = sum;

    }//end inner for
}//end outer for

return max;
}


main()
{
int i,j,N,boxes;
N = 10000;
float_t **matrix;
int* histogram;
boxes = N / 2;

//allocate a matrix with some numbers
matrix = calloc(N, sizeof(float_t **));
for(i = 0; i < N; i++)
    matrix[i] = calloc(N, sizeof(float_t *));
for(i = 0; i < N; i++)
    for(j = 0; j < N; j++)
        matrix[i][j] = 1./(float_t) N * (float_t) i;


histogram = malloc(boxes * sizeof(int));

generate_histogram(matrix, histogram, N, boxes);

}

回答by

This is an interesting problem. I fixed your code. @KunHuang had the right idea but you have several more problems with private and shared variables.

这是一个有趣的问题。我修复了你的代码。@KunHuang 的想法是正确的,但是您在私有变量和共享变量方面还有更多问题。

Your old function is called generate_histogramin which I commented out the omp stuff. The new one which uses OpenMP is called generate_histogram_omp. The old code finishes in time 0.67 seconds on my system (ivy bridge dual core) and the new code finishes in 0.32 seconds.

你的旧函数被调用generate_histogram,我在其中注释掉了 omp 的东西。使用 OpenMP 的新方法称为generate_histogram_omp. 在我的系统(常春藤桥双核)上,旧代码在 0.67 秒内完成,新代码在 0.32 秒内完成。

Also, I tried fusing your loop but it made the performance much worse (probably a cache issue) so I only parallelize the first loop and I still get a 2x speed up on two cores with the current code anyway. I left the fused code commented out if you want to play with it.

此外,我尝试融合您的循环,但它使性能变得更糟(可能是缓存问题),所以我只并行化第一个循环,无论如何我仍然使用当前代码在两个内核上获得 2 倍的加速。如果你想玩的话,我把融合的代码注释掉了。

Lastly, your initial values of the matrix don't really fill out the histogram much i.e. only a few bins are being filled.

最后,矩阵的初始值并没有真正填充直方图,即只有几个 bin 被填充。

I compiled with

我编译了

g++ hist.cpp -o hist -fopenmp -O3

The code:

编码:

#include<stdio.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#define float_t float
#include <time.h>
#include <omp.h>

float_t generate_histogram(float_t **matrix, int *histogram, int mat_size, int hist_size)
{
int i,j,k,count;
float_t max = 0.;
float_t sum;

//set histogram to zero everywhere
for(i = 0; i < hist_size; i++)
    histogram[i] = 0;


//matrix computations
//#pragma omp parallel for schedule(runtime)
for (i = 1; i < (mat_size-1); i++)
{
    //pragma omp prallel for schedule(dynamic)
    for(j = 1; j < (mat_size-1); j++)
    {

        //assign current matrix[i][j] to element in order to reduce memory access
        sum = fabs(matrix[i][j]-matrix[i-1][j]) + fabs(matrix[i][j] - matrix[i+1][j])
            + fabs(matrix[i][j]-matrix[i][j-1]) + fabs(matrix[i][j] - matrix[i][j+1]);

        //compute index of histogram bin
        k = (int)(sum * (float)mat_size);
        histogram[k] += 1;

        //keep track of largest element
        if(sum > max)
            max = sum;

    }//end inner for
}//end outer for

return max;
}

float_t generate_histogram_omp(float_t **matrix, int *histogram, int mat_size, int hist_size) {
    float_t max = 0.;
    //set histogram to zero everywhere
    int i;
    for(i = 0; i < hist_size; i++)
        histogram[i] = 0;

    //matrix computations
    #pragma omp parallel 
    {
        int *histogram_private = (int*)malloc(hist_size * sizeof(int));
        int i;
        for(i = 0; i < hist_size; i++)
            histogram_private[i] = 0;
        float_t max_private = 0.;
        int n;
        int j;
        #pragma omp for
        for (i = 1; i < (mat_size-1); i++) {
            for(j = 1; j < (mat_size-1); j++) {
         //   for (n=0; n < (mat_size-2)*(mat_size-2); n++) {
          //      int i = n/(mat_size-2)+1;
          //      int j = n%(mat_size-2)+1;

                float_t sum = fabs(matrix[i][j]-matrix[i-1][j]) + fabs(matrix[i][j] - matrix[i+1][j])
                    + fabs(matrix[i][j]-matrix[i][j-1]) + fabs(matrix[i][j] - matrix[i][j+1]);

                //compute index of histogram bin
                int k = (int)(sum * (float)mat_size);
                histogram_private[k] += 1;

                //keep track of largest element
                if(sum > max_private)
                    max_private = sum;
            }
        }
        #pragma omp critical
        {

            for(i = 0; i < hist_size; i++)
                histogram[i] += histogram_private[i];
            if(max_private>max)
                max = max_private;
        }

        free(histogram_private);
    }
    return max;
}

int compare_hists(int *hist1, int *hist2, int N) {
    int i;
    int diff = 0;
    for(i =0; i < N; i++) {
        int tmp = hist1[i] - hist2[i];
        diff += tmp;
        if(tmp!=0) {
            printf("i %d, hist1 %d, hist2  %d\n", i, hist1[i], hist2[i]);
        }
    }
    return diff;
}

main() {
    int i,j,N,boxes;
    N = 10000;
    float_t **matrix;
    int* histogram1;
    int* histogram2;
    boxes = N / 2;

    //allocate a matrix with some numbers
    matrix = (float_t**)calloc(N, sizeof(float_t **));
    for(i = 0; i < N; i++)
        matrix[i] = (float_t*)calloc(N, sizeof(float_t *));
    for(i = 0; i < N; i++)
        for(j = 0; j < N; j++)
            matrix[i][j] = 1./(float_t) N * (float_t) i;


    histogram1 = (int*)malloc(boxes * sizeof(int));
    histogram2 = (int*)malloc(boxes * sizeof(int));

    for(i = 0; i<boxes; i++) {
        histogram1[i] = 0;
        histogram2[i] = 0;
    }
    double dtime;
    dtime = omp_get_wtime();
    generate_histogram(matrix, histogram1, N, boxes);
    dtime = omp_get_wtime() - dtime;
    printf("time %f\n", dtime);

    dtime = omp_get_wtime();
    generate_histogram_omp(matrix, histogram2, N, boxes);
    dtime = omp_get_wtime() - dtime;
    printf("time %f\n", dtime);

    int diff = compare_hists(histogram1, histogram2, boxes);
    printf("diff %d\n", diff);

}

回答by konjac

It is not possible to reduce an array or an struct in OpenMP, which is mentioned here: https://computing.llnl.gov/tutorials/openMP/#REDUCTION.

不可能在 OpenMP 中减少数组或结构,这里提到:https: //computing.llnl.gov/tutorials/openMP/#REDUCTION

I think you can declare multiple copies of histogram, each of which is used in one thread. After then use another OpenMP loop to add them up.

我认为您可以声明 的多个副本histogram,每个副本都在一个线程中使用。然后使用另一个 OpenMP 循环将它们相加。