C++ CUDA __global__ 函数中的 printf

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2173771/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 22:24:02  来源:igfitidea点击:

printf inside CUDA __global__ function

c++ccudagpu-programming

提问by Jose Vega

I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function:

我目前正在 GPU 上编写矩阵乘法并想调试我的代码,但由于我无法在设备函数中使用 printf,我还能做些什么来查看该函数内部发生了什么。这是我目前的功能:

__global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){

    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int bx = blockIdx.x;
    int by = blockIdx.y;

    float sum = 0;

    for( int k = 0; k < Ad.width ; ++k){
        float Melement = Ad.elements[ty * Ad.width + k];
        float Nelement = Bd.elements[k * Bd.width + tx];
        sum += Melement * Nelement;
    }

    Xd.elements[ty * Xd.width + tx] = sum;
}

I would love to know if Ad and Bd is what I think it is, and see if that function is actually being called.

我很想知道 Ad 和 Bd 是否是我认为的那样,看看该函数是否真的被调用了。

采纳答案by Tom

EDIT

编辑

To avoid misleading people, as M. Tibbits points out printf is available in any GPU of compute capability 2.0 and higher.

为避免误导人们,正如 M. Tibbits 指出的那样,printf 可用于任何具有 2.0 及更高计算能力的 GPU。

END OF EDIT

编辑结束

You have choices:

你有选择:

  • Use a GPU debugger, i.e. cuda-gdb on Linux or Nexus on Windows
  • Use cuprintf, which is available for registered developers (sign up here)
  • Manually copy the data that you want to see, then dump that buffer on the host after your kernel has completed (remember to synchronise)
  • 使用 GPU 调试器,即 Linux 上的 cuda-gdb 或 Windows 上的 Nexus
  • 使用 cuprintf,它可供注册开发人员使用(在此处注册)
  • 手动复制您想查看的数据,然后在内核完成后将该缓冲区转储到主机上(记得同步)

Regarding your code snippet:

关于您的代码片段:

  • Consider passing the Matrixstructs in via pointer (i.e. cudaMemcpythem to the device, then pass in the device pointer), right now you will have no problem but if the function signature gets very large then you may hit the 256 byte limit
  • You have inefficient reads from Ad, you will have a 32-byte transaction to the memory for each read into Melement - consider using shared memory as a staging area (c.f. the transposeNewsample in the SDK)
  • 考虑Matrix通过指针传递结构(即cudaMemcpy它们到设备,然后传递设备指针),现在你不会有问题,但如果函数签名变得非常大,那么你可能会达到 256 字节的限制
  • 您对 Ad 的读取效率低下,每次读取 Melement 时都会有一个 32 字节的内存事务 - 考虑使用共享内存作为暂存区(参见SDK 中的transposeNew示例)

回答by M. Tibbits

CUDA now supports printfs directly in the kernel. For formal description see Appendix B.16 of the CUDA C Programming Guide.

CUDA 现在printf直接在内核中支持s。有关正式说明,请参阅CUDA C 编程指南的附录 B.16 。

回答by Juan Leni

by the way..

顺便一提..

回答by Andrei Pokrovsky

See "Formatted output" (currently B.17) section of CUDA C Programming Guide.

请参阅 CUDA C 编程指南的“格式化输出”(当前为 B.17)部分。

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html