C++ CUDA global 函数中的 printf

Question

提问by Jose Vega

I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function:

我目前正在 GPU 上编写矩阵乘法并想调试我的代码，但由于我无法在设备函数中使用 printf，我还能做些什么来查看该函数内部发生了什么。这是我目前的功能：

__global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){

    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int bx = blockIdx.x;
    int by = blockIdx.y;

    float sum = 0;

    for( int k = 0; k < Ad.width ; ++k){
        float Melement = Ad.elements[ty * Ad.width + k];
        float Nelement = Bd.elements[k * Bd.width + tx];
        sum += Melement * Nelement;
    }

    Xd.elements[ty * Xd.width + tx] = sum;
}

I would love to know if Ad and Bd is what I think it is, and see if that function is actually being called.

我很想知道 Ad 和 Bd 是否是我认为的那样，看看该函数是否真的被调用了。

Answer 1

采纳答案by Tom

EDIT

编辑

To avoid misleading people, as M. Tibbits points out printf is available in any GPU of compute capability 2.0 and higher.

为避免误导人们，正如 M. Tibbits 指出的那样，printf 可用于任何具有 2.0 及更高计算能力的 GPU。

END OF EDIT

编辑结束

You have choices:

你有选择：

Use a GPU debugger, i.e. cuda-gdb on Linux or Nexus on Windows
Use cuprintf, which is available for registered developers (sign up here)
Manually copy the data that you want to see, then dump that buffer on the host after your kernel has completed (remember to synchronise)

使用 GPU 调试器，即 Linux 上的 cuda-gdb 或 Windows 上的 Nexus
使用 cuprintf，它可供注册开发人员使用（在此处注册）
手动复制您想查看的数据，然后在内核完成后将该缓冲区转储到主机上（记得同步）

Regarding your code snippet:

关于您的代码片段：

Consider passing the Matrixstructs in via pointer (i.e. cudaMemcpythem to the device, then pass in the device pointer), right now you will have no problem but if the function signature gets very large then you may hit the 256 byte limit
You have inefficient reads from Ad, you will have a 32-byte transaction to the memory for each read into Melement - consider using shared memory as a staging area (c.f. the transposeNewsample in the SDK)

考虑Matrix通过指针传递结构（即cudaMemcpy它们到设备，然后传递设备指针），现在你不会有问题，但如果函数签名变得非常大，那么你可能会达到 256 字节的限制
您对 Ad 的读取效率低下，每次读取 Melement 时都会有一个 32 字节的内存事务 - 考虑使用共享内存作为暂存区（参见SDK 中的transposeNew示例）

Answer 2

回答by M. Tibbits

CUDA now supports printfs directly in the kernel. For formal description see Appendix B.16 of the CUDA C Programming Guide.

CUDA 现在printf直接在内核中支持s。有关正式说明，请参阅CUDA C 编程指南的附录 B.16 。

Answer 3

回答by Juan Leni

cuprintf
try Nexus http://developer.nvidia.com/object/nexus.html

cuprintf
尝试 Nexus http://developer.nvidia.com/object/nexus.html

by the way..

顺便一提..

use shared memory
multiply outside of the loop
Look at this: http://www.seas.upenn.edu/~cis665/LECTURES/Lecture11.ppt

使用共享内存
在循环外相乘
看看这个：http: //www.seas.upenn.edu/~cis665/LECTURES/Lecture11.ppt

Answer 4

回答by Andrei Pokrovsky

See "Formatted output" (currently B.17) section of CUDA C Programming Guide.

请参阅 CUDA C 编程指南的“格式化输出”（当前为 B.17）部分。

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

C++ CUDA global 函数中的 printf

提问by Jose Vega

采纳答案by Tom

回答by M. Tibbits

回答by Juan Leni

回答by Andrei Pokrovsky

相关推荐

最近更新

标签

C++ CUDA __global__ 函数中的 printf

提问by Jose Vega

采纳答案by Tom

回答by M. Tibbits

回答by Juan Leni

回答by Andrei Pokrovsky

相关推荐

C++ 带有参数“-S -save-temps”的gcc将中间文件放在当前目录中

C++:: 将 ASCII 值转换为字符串

C++ 如何使用c ++获得线性回归线的斜率？

C++ 从 std::string 转换为 bool

相关推荐

最近更新

标签

C++ CUDA global 函数中的 printf