C++ 如何将CUDA代码分成多个文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2090974/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 22:04:21  来源:igfitidea点击:

How to separate CUDA code into multiple files

c++cvisual-studio-2008cuda

提问by Mr Bell

I am trying separate a CUDA program into two separate .cu files in effort to edge closer to writing a real app in C++. I have a simple little program that:

我正在尝试将 CUDA 程序分成两个单独的 .cu 文件,以便更接近于用 C++ 编写真正的应用程序。我有一个简单的小程序:

Allocates a memory on the host and the device.
Initializes the host array to a series of numbers. Copies the host array to a device array Finds the square of all the elements in the array using a device kernel Copies the device array back to the host array Prints the results

在主机和设备上分配内存。
将主机数组初始化为一系列数字。将主机阵列复制到设备阵列 使用设备内核计算阵列中所有元素的平方 将设备阵列复制回主机阵列 打印结果

This works great if I put it all in one .cu file and run it. When I split it into two separate files I start getting linking errors. Like all my recent questions, I know this is something small, but what is it?

如果我把它全部放在一个 .cu 文件中并运行它,这会很好用。当我将它分成两个单独的文件时,我开始出现链接错误。就像我最近的所有问题一样,我知道这是一件小事,但它是什么?

KernelSupport.cu

内核支持.cu

#ifndef _KERNEL_SUPPORT_
#define _KERNEL_SUPPORT_

#include <iostream>
#include <MyKernel.cu>

int main( int argc, char** argv) 
{
    int* hostArray;
    int* deviceArray;
    const int arrayLength = 16;
    const unsigned int memSize = sizeof(int) * arrayLength;

    hostArray = (int*)malloc(memSize);
    cudaMalloc((void**) &deviceArray, memSize);

    std::cout << "Before device\n";
    for(int i=0;i<arrayLength;i++)
    {
        hostArray[i] = i+1;
        std::cout << hostArray[i] << "\n";
    }
    std::cout << "\n";

    cudaMemcpy(deviceArray, hostArray, memSize, cudaMemcpyHostToDevice);
    TestDevice <<< 4, 4 >>> (deviceArray);
    cudaMemcpy(hostArray, deviceArray, memSize, cudaMemcpyDeviceToHost);

    std::cout << "After device\n";
    for(int i=0;i<arrayLength;i++)
    {
        std::cout << hostArray[i] << "\n";
    }

    cudaFree(deviceArray);
    free(hostArray);

    std::cout << "Done\n";
}

#endif

MyKernel.cu

我的内核

#ifndef _MY_KERNEL_
#define _MY_KERNEL_

__global__ void TestDevice(int *deviceArray)
{
    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    deviceArray[idx] = deviceArray[idx]*deviceArray[idx];
}


#endif

Build Log:

构建日志:

1>------ Build started: Project: CUDASandbox, Configuration: Debug x64 ------
1>Compiling with CUDA Build Rule...
1>"C:\CUDA\bin64\nvcc.exe"    -arch sm_10 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MT  "  -maxrregcount=32  --compile -o "x64\Debug\KernelSupport.cu.obj" "d:\Stuff\Programming\Visual Studio 2008\Projects\CUDASandbox\CUDASandbox\KernelSupport.cu" 
1>KernelSupport.cu
1>tmpxft_000016f4_00000000-3_KernelSupport.cudafe1.gpu
1>tmpxft_000016f4_00000000-8_KernelSupport.cudafe2.gpu
1>tmpxft_000016f4_00000000-3_KernelSupport.cudafe1.cpp
1>tmpxft_000016f4_00000000-12_KernelSupport.ii
1>Linking...
1>KernelSupport.cu.obj : error LNK2005: __device_stub__Z10TestDevicePi already defined in MyKernel.cu.obj
1>KernelSupport.cu.obj : error LNK2005: "void __cdecl TestDevice__entry(int *)" (?TestDevice__entry@@YAXPEAH@Z) already defined in MyKernel.cu.obj
1>D:\Stuff\Programming\Visual Studio 2008\Projects\CUDASandbox\x64\Debug\CUDASandbox.exe : fatal error LNK1169: one or more multiply defined symbols found
1>Build log was saved at "file://d:\Stuff\Programming\Visual Studio 2008\Projects\CUDASandbox\CUDASandbox\x64\Debug\BuildLog.htm"
1>CUDASandbox - 3 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

I am running Visual Studio 2008 on Windows 7 64bit.

我在 Windows 7 64 位上运行 Visual Studio 2008。



Edit:

编辑:

I think I need to elaborate on this a little bit. The end result I am looking for here is to have a normal C++ application with something like Main.cpp with the int main()event and have things run from there. At certains point in my .cpp code I want to be able to reference CUDA bits. So my thinking (and correct me if there a more standard convention here) is that I will put the CUDA Kernel code into their on .cu files, and then have a supporting .cu file that will take care of talking to the device and calling kernel functions and what not.

我想我需要详细说明一下。我在这里寻找的最终结果是拥有一个普通的 C++ 应用程序,其中包含诸如 Main.cpp 之类的int main()事件,并从那里运行。在我的 .cpp 代码中的某些点,我希望能够引用 CUDA 位。所以我的想法(如果这里有更标准的约定,请纠正我)是我将 CUDA 内核代码放入他们的 .cu 文件中,然后有一个支持 .cu 文件,它将负责与设备交谈并调用内核函数和什么不是。

采纳答案by Scott Wales

You are including mykernel.cuin kernelsupport.cu, when you try to link the compiler sees mykernel.cu twice. You'll have to create a header defining TestDevice and include that instead.

您包含mykernel.cu在 中kernelsupport.cu,当您尝试链接时,编译器会看到 mykernel.cu 两次。您必须创建一个定义 TestDevice 的标头并包含它。

re comment:

重新评论:

Something like this should work

这样的事情应该工作

// MyKernel.h
#ifndef mykernel_h
#define mykernel_h
__global__ void TestDevice(int* devicearray);
#endif

and then change the including file to

然后将包含文件更改为

//KernelSupport.cu
#ifndef _KERNEL_SUPPORT_
#define _KERNEL_SUPPORT_

#include <iostream>
#include <MyKernel.h>
// ...

re your edit

重新编辑

As long as the header you use in c++ code doesn't have any cuda specific stuff (__kernel__,__global__, etc) you should be fine linking c++ and cuda code.

只要你在C使用的头++代码没有任何CUDA具体的东西(__kernel____global__,等),你应该罚款链接C ++和CUDA代码。

回答by tkerwin

If you look at the CUDA SDK code examples, they have extern C defines that reference functions compiled from .cu files. This way, the .cu files are compiled by nvcc and only linked into the main program while the .cpp files are compiled normally.

如果您查看 CUDA SDK 代码示例,它们有 extern C 定义了从 .cu 文件编译的引用函数。这样,.cu 文件由 nvcc 编译,只链接到主程序,而 .cpp 文件则正常编译。

For example, in marchingCubes_kernel.cu has the function body:

例如,在marchingCubes_kernel.cu 中有函数体:

extern "C" void
launch_classifyVoxel( dim3 grid, dim3 threads, uint* voxelVerts, uint *voxelOccupied, uchar *volume,
                      uint3 gridSize, uint3 gridSizeShift, uint3 gridSizeMask, uint numVoxels,
                      float3 voxelSize, float isoValue)
{
    // calculate number of vertices need per voxel
    classifyVoxel<<<grid, threads>>>(voxelVerts, voxelOccupied, volume, 
                                     gridSize, gridSizeShift, gridSizeMask, 
                                     numVoxels, voxelSize, isoValue);
    cutilCheckMsg("classifyVoxel failed");
}

While in marchingCubes.cpp (where main() resides) just has a definition:

而在marchingCubes.cpp(main()所在的地方)只有一个定义:

extern "C" void
launch_classifyVoxel( dim3 grid, dim3 threads, uint* voxelVerts, uint *voxelOccupied, uchar *volume,
                      uint3 gridSize, uint3 gridSizeShift, uint3 gridSizeMask, uint numVoxels,
                      float3 voxelSize, float isoValue);

You can put these in a .h file too.

您也可以将这些放在 .h 文件中。

回答by Tom

Getting the separation is actually quite simple, please check out this answerfor how to set it up. Then you simply put your host code in .cpp files and your device code in .cu files, the build rules tell Visual Studio how to link them together into the final executable.

获得分离实际上非常简单,请查看此答案以了解如何设置。然后,您只需将主机代码放在 .cpp 文件中,将设备代码放在 .cu 文件中,构建规则会告诉 Visual Studio 如何将它们链接到最终的可执行文件中。

The immediate problem in your code that you are defining the __global__ TestDevicefunction twice, once when you #includeMyKernel.cu and once when you compile the MyKernel.cu independently.

代码中的直接问题是您定义了__global__ TestDevice两次函数,一次是在#includeMyKernel.cu 时,一次是在独立编译 MyKernel.cu 时。

You will need to put a wrapper into a .cu file too - at the moment you are calling TestDevice<<<>>>from your main function but when you move this into a .cpp file it will be compiled with cl.exe, which doesn't understand the <<<>>>syntax. Therefore you would simply call TestDeviceWrapper(griddim, blockdim, params)in the .cpp file and provide this function in your .cu file.

您还需要将包装器放入 .cu 文件中 - 目前您正在TestDevice<<<>>>从主函数调用,但是当您将其移动到 .cpp 文件中时,它将使用 cl.exe 进行编译,而 cl.exe 无法理解<<<>>>语法. 因此,您只需调用TestDeviceWrapper(griddim, blockdim, params).cpp 文件并在您的 .cu 文件中提供此功能。

If you want an example, the SobolQRNG sample in the SDK achieves nice separation, although it still uses cutil and I would always recommend avoiding cutil.

如果你想要一个例子,SDK 中的 SobolQRNG 示例实现了很好的分离,尽管它仍然使用 cutil,我总是建议避免 cutil。

回答by thebaldwin

The simple solution is to turn off building of your MyKernel.cu file.

简单的解决方案是关闭 MyKernel.cu 文件的构建。

Properties -> General -> Excluded from build

属性 -> 常规 -> 从构建中排除

The better solution imo is to split your kernel into a cu and a cuh file, and include that, for example:

imo 更好的解决方案是将内核拆分为一个 cu 和一个 cuh 文件,并将其包括在内,例如:

//kernel.cu
#include "kernel.cuh"
#include <cuda_runtime.h>

__global__ void increment_by_one_kernel(int* vals) {
  vals[threadIdx.x] += 1;
}

void increment_by_one(int* a) {
  int* a_d;

  cudaMalloc(&a_d, 1);
  cudaMemcpy(a_d, a, 1, cudaMemcpyHostToDevice);
  increment_by_one_kernel<<<1, 1>>>(a_d);
  cudaMemcpy(a, a_d, 1, cudaMemcpyDeviceToHost);

  cudaFree(a_d);
}

 

 

//kernel.cuh
#pragma once

void increment_by_one(int* a);

 

 

//main.cpp
#include "kernel.cuh"

int main() {
  int a[] = {1};

  increment_by_one(a);

  return 0;
}