C++ CUDA如何获取网格、块、线程大小和并行化非方阵计算

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5643178/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 18:34:46  来源:igfitidea点击:

CUDA how to get grid, block, thread size and parallalize non square matrix calculation

c++visual-studio-2008gpucuda

提问by user656210

I am new to CUDA and need help understanding some things. I need help parallelizing these two for loops. Specifically how to setup the dimBlock and dimGrid to make this run faster. I know this looks like the vector add example in the sdk but that example is only for square matrices and when I try to modify that code for my 128 x 1024 matrix it doesn't work properly.

我是 CUDA 的新手,需要帮助理解一些事情。我需要帮助并行化这两个 for 循环。具体如何设置dimBlock 和dimGrid 使其运行得更快。我知道这看起来像 sdk 中的向量添加示例,但该示例仅适用于方阵,当我尝试为 128 x 1024 矩阵修改该代码时,它无法正常工作。

__global__ void mAdd(float* A, float* B, float* C)
{
    for(int i = 0; i < 128; i++)
    {
        for(int j = 0; j < 1024; j++)
        {
            C[i * 1024 + j] = A[i * 1024 + j] + B[i * 1024 + j];
        }
    }
}

This code is part of a larger loop and is the simplest portion of the code, so I decided to try to paralleize thia and learn CUDA at same time. I have read the guides but still do not understand how to get the proper no. of grids/block/threads going and use them effectively.

这段代码是一个更大循环的一部分,是代码中最简单的部分,所以我决定尝试并行化 thia 并同时学习 CUDA。我已经阅读了指南,但仍然不明白如何获得正确的编号。网格/块/线程去并有效地使用它们。

回答by talonmies

As you have written it, that kernel is completely serial. Every thread launched to execute it is going to performing the same work.

正如您所写,该内核是完全串行的。为执行它而启动的每个线程都将执行相同的工作。

The main idea behind CUDA (and OpenCL and other similar "single program, multiple data" type programming models) is that you take a "data parallel" operation - so one where the same, largely independent, operation must be performed many times - and write a kernel which performs that operation. A large number of (semi)autonomous threads are then launched to perform that operation across the input data set.

CUDA(以及 OpenCL 和其他类似的“单程序、多数据”类型的编程模型)背后的主要思想是您进行“数据并行”操作——因此必须多次执行相同的、很大程度上独立的操作——并且编写一个执行该操作的内核。然后启动大量(半)自治线程以跨输入数据集执行该操作。

In your array addition example, the data parallel operation is

在您的数组加法示例中,数据并行操作是

C[k] = A[k] + B[k];

for all k between 0 and 128 * 1024. Each addition operation is completely independent and has no ordering requirements, and therefore can be performed by a different thread. To express this in CUDA, one might write the kernel like this:

对于 0 到 128 * 1024 之间的所有 k。每个加法操作都是完全独立的,没有排序要求,因此可以由不同的线程执行。要在 CUDA 中表达这一点,可以这样编写内核:

__global__ void mAdd(float* A, float* B, float* C, int n)
{
    int k = threadIdx.x + blockIdx.x * blockDim.x;

    if (k < n)
        C[k] = A[k] + B[k];
}

[disclaimer: code written in browser, not tested, use at own risk]

[免责声明:在浏览器中编写的代码,未经测试,使用风险自负]

Here, the inner and outer loop from the serial code are replaced by one CUDA thread per operation, and I have added a limit check in the code so that in cases where more threads are launched than required operations, no buffer overflow can occur. If the kernel is then launched like this:

在这里,串行代码的内循环和外循环被每个操作的一个 CUDA 线程替换,并且我在代码中添加了限制检查,以便在启动的线程多于所需操作的情况下,不会发生缓冲区溢出。如果内核是这样启动的:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / blocksize; // value determine by block size and total work

madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

Then 256 blocks, each containing 512 threads will be launched onto the GPU hardware to perform the array addition operation in parallel. Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.

然后将 256 个块(每个块包含 512 个线程)启动到 GPU 硬件上,以并行执行阵列加法操作。请注意,如果输入数据大小不能表示为块大小的整数倍,则需要将块数四舍五入以覆盖完整的输入数据集。

All of the above is a hugely simplified overview of the CUDA paradigm for a very trivial operation, but perhaps it gives enough insight for you to continue yourself. CUDA is rather mature these days and there is a lot of good, free educational material floating around the web you can probably use to further illuminate many of the aspects of the programming model I have glossed over in this answer.

以上所有内容都是对 CUDA 范式的一个非常简单的操作的极大简化概述,但也许它为您提供了足够的洞察力让您继续自己。CUDA 这些天相当成熟,网上有很多很好的免费教育材料,您可能可以用来进一步阐明我在这个答案中掩盖的编程模型的许多方面。