Linux 是否可以将汇编指令放入 CUDA 代码中？

Question

提问by superscalar

I want to use assembly code in CUDA C code in order to reduce expensive executions as we do using asmin c programming.

我想在 CUDA C 代码中使用汇编代码，以便像我们在 c 编程中使用asm一样减少昂贵的执行。

Is it possible?

是否可以？

Answer 1

采纳答案by Dr. Snoopy

No, you can't, there is nothing like the asm constructs from C/C++. What you can do is tweak the generated PTX assembly and then use it with CUDA.

不，你不能，没有什么比 C/C++ 的 asm 构造更像的了。您可以做的是调整生成的 PTX 程序集，然后将其与 CUDA 一起使用。

See thisfor an example.

请参阅此示例。

But for GPUs, assembly optimizations are NOT necessary, you should do other optimizations first, such as memory coalescency and occupancy. See the CUDA Best Practices guidefor more information.

但是对于 GPU 来说，程序集优化不是必须的，你应该先做其他优化，比如内存合并和占用。有关更多信息，请参阅CUDA 最佳实践指南。

Answer 2

回答by njuffa

Since CUDA 4.0, inline PTX is supported by the CUDA toolchain. There is a document in the toolkit that describes it: Using_Inline_PTX_Assembly_In_CUDA.pdf

从 CUDA 4.0 开始，CUDA 工具链支持内联 PTX。工具包中有一个文档对其进行了描述：Using_Inline_PTX_Assembly_In_CUDA.pdf

Below is some code demonstrating use of inline PTX in CUDA 4.0. Note that this code should not be used as a replacement for CUDA's built-in __clz() function, I merely wrote it to explore aspects of the new inline PTX capability.

下面是一些演示在 CUDA 4.0 中使用内联 PTX 的代码。请注意，此代码不应用作 CUDA 内置 __clz() 函数的替代品，我编写它只是为了探索新的内联 PTX 功能的各个方面。

__device__ __forceinline__ int my_clz (unsigned int x)
{
    int res;

    asm ("{\n"
         "        .reg .pred iszero, gezero;\n"
         "        .reg .u32 t1, t2;\n"
         "        mov.b32         t1, %1;\n"
         "        shr.u32         %0, t1, 16;\n"
         "        setp.eq.b32     iszero, %0, 0;\n"
         "        mov.b32         %0, 0;\n"
         "@iszero shl.b32         t1, t1, 16;\n"
         "@iszero or.b32          %0, %0, 16;\n"
         "        and.b32         t2, t1, 0xff000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 8;\n"
         "@iszero or.b32          %0, %0, 8;\n"
         "        and.b32         t2, t1, 0xf0000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 4;\n"
         "@iszero or.b32          %0, %0, 4;\n"
         "        and.b32         t2, t1, 0xc0000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 2;\n"
         "@iszero or.b32          %0, %0, 2;\n"
         "        setp.ge.s32     gezero, t1, 0;\n"
         "        setp.eq.b32     iszero, t1, 0;\n"
         "@gezero or.b32          %0, %0, 1;\n"
         "@iszero add.u32         %0, %0, 1;\n\t"
         "}"
         : "=r"(res)
         : "r"(x));
    return res;
}

Linux 是否可以将汇编指令放入 CUDA 代码中？

提问by superscalar

采纳答案by Dr. Snoopy

回答by njuffa

相关推荐

最近更新

标签

Linux 是否可以将汇编指令放入 CUDA 代码中？

提问by superscalar

采纳答案by Dr. Snoopy

回答by njuffa

相关推荐

Linux IDEA JetBrains IntelliJ - 'make' 上的编译错误，但使用 Maven 编译时很好，并且 IntelliJ 在类文件中没有报告错误

C# 检查枚举是否具有等于字符串的字段

C# 如果包含不受支持，您如何在 LINQ to Entities（实体框架）中执行 SQL 样式的“IN”语句？

Linux 如何在 SQL Windows 中验证 AZ、0-9 和“*”的数据字段？

相关推荐

最近更新

标签