Linux 是否可以将汇编指令放入 CUDA 代码中?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3677220/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 23:27:13  来源:igfitidea点击:

Is it possible to put assembly instructions into CUDA code?

cassemblycudainline-assemblyptx

提问by superscalar

I want to use assembly code in CUDA C code in order to reduce expensive executions as we do using asmin c programming.

我想在 CUDA C 代码中使用汇编代码,以便像我们在 c 编程中使用asm一样减少昂贵的执行。

Is it possible?

是否可以?

采纳答案by Dr. Snoopy

No, you can't, there is nothing like the asm constructs from C/C++. What you can do is tweak the generated PTX assembly and then use it with CUDA.

不,你不能,没有什么比 C/C++ 的 asm 构造更像的了。您可以做的是调整生成的 PTX 程序集,然后将其与 CUDA 一起使用。

See thisfor an example.

请参阅示例。

But for GPUs, assembly optimizations are NOT necessary, you should do other optimizations first, such as memory coalescency and occupancy. See the CUDA Best Practices guidefor more information.

但是对于 GPU 来说,程序集优化不是必须的,你应该先做其他优化,比如内存合并和占用。有关更多信息,请参阅CUDA 最佳实践指南

回答by njuffa

Since CUDA 4.0, inline PTX is supported by the CUDA toolchain. There is a document in the toolkit that describes it: Using_Inline_PTX_Assembly_In_CUDA.pdf

从 CUDA 4.0 开始,CUDA 工具链支持内联 PTX。工具包中有一个文档对其进行了描述:Using_Inline_PTX_Assembly_In_CUDA.pdf

Below is some code demonstrating use of inline PTX in CUDA 4.0. Note that this code should not be used as a replacement for CUDA's built-in __clz() function, I merely wrote it to explore aspects of the new inline PTX capability.

下面是一些演示在 CUDA 4.0 中使用内联 PTX 的代码。请注意,此代码不应用作 CUDA 内置 __clz() 函数的替代品,我编写它只是为了探索新的内联 PTX 功能的各个方面。

__device__ __forceinline__ int my_clz (unsigned int x)
{
    int res;

    asm ("{\n"
         "        .reg .pred iszero, gezero;\n"
         "        .reg .u32 t1, t2;\n"
         "        mov.b32         t1, %1;\n"
         "        shr.u32         %0, t1, 16;\n"
         "        setp.eq.b32     iszero, %0, 0;\n"
         "        mov.b32         %0, 0;\n"
         "@iszero shl.b32         t1, t1, 16;\n"
         "@iszero or.b32          %0, %0, 16;\n"
         "        and.b32         t2, t1, 0xff000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 8;\n"
         "@iszero or.b32          %0, %0, 8;\n"
         "        and.b32         t2, t1, 0xf0000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 4;\n"
         "@iszero or.b32          %0, %0, 4;\n"
         "        and.b32         t2, t1, 0xc0000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 2;\n"
         "@iszero or.b32          %0, %0, 2;\n"
         "        setp.ge.s32     gezero, t1, 0;\n"
         "        setp.eq.b32     iszero, t1, 0;\n"
         "@gezero or.b32          %0, %0, 1;\n"
         "@iszero add.u32         %0, %0, 1;\n\t"
         "}"
         : "=r"(res)
         : "r"(x));
    return res;
}