C语言 如何在内核中动态分配数组?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13480213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 04:30:36  来源:igfitidea点击:

How to dynamically allocate arrays inside a kernel?

ccudagpgpu

提问by Granada

I need to dynamically allocate some arrays inside the kernel function. How can a I do that?

我需要在内核函数中动态分配一些数组。我怎么能这样做?

My code is something like that:

我的代码是这样的:

__global__ func(float *grid_d,int n, int nn){  
    int i,j;  
    float x[n],y[nn];  
    //Do some really cool and heavy computations here that takes hours.  
}

But that will not work. If this was inside the host code I could use malloc. cudaMalloc needs a pointer on host, and other on device. Inside the kernel function I don't have the host pointer.

但这行不通。如果这是在主机代码中,我可以使用 malloc。cudaMalloc 需要主机上的指针和设备上的其他指针。在内核函数中,我没有主机指针。

So, what should I do?

所以我该怎么做?

If takes too long (some seconds) to allocate all the arrays (I need about 4 of size n and 5 of size nn), this won't be a problem. Since the kernel will probably run for 20 minutes, at least.

如果分配所有数组所需的时间太长(几秒钟)(我需要大约 4 个大小为 n 的数组和 5 个大小为 nn 的数组),这不会成为问题。由于内核可能至少运行 20 分钟。

回答by talonmies

Dynamic memory allocation is only supported on compute capability 2.x and newer hardware. You can use either the C++ new keyword or malloc in the kernel, so your example could become:

动态内存分配仅在计算能力 2.x 和更新的硬件上受支持。您可以在内核中使用 C++ new 关键字或 malloc,因此您的示例可能变为:

__global__ func(float *grid_d,int n, int nn){  
    int i,j;  
    float *x = new float[n], *y = new float[nn];   
}

This allocates memory on a local memory runtime heap which has the lifetime of the context, so make sure you free the memory after the kernel finishes running if your intention is not to use the memory again. You should also note that runtime heap memory cannot be accessed directly from the host APIs, so you cannot pass a pointer allocated inside a kernel as an argument to cudaMemcpy, for example.

这会在具有上下文生命周期的本地内存运行时堆上分配内存,因此如果您不打算再次使用内存,请确保在内核完成运行后释放内存。您还应该注意,不能直接从主机 API 访问运行时堆内存,因此您不能将在内核内部分配的指针作为参数传递cudaMemcpy给 例如。

回答by Roger Dahl

@talonmies answered your question on how to dynamically allocate memory within a kernel. This is intended as a supplemental answer, addressing performance of __device__ malloc()and an alternative you might want to consider.

@talonmies 回答了您关于如何在内核中动态分配内存的问题。这旨在作为补充答案,解决__device__ malloc()您可能想要考虑的性能和替代方案。

Allocating memory dynamically in the kernel can be tempting because it allows GPU code to look more like CPU code. But it can seriously affect performance. I wrote a self contained test and have included it below. The test launches some 2.6 million threads. Each thread populates 16 integers of global memory with some values derived from the thread index, then sums up the values and returns the sum.

在内核中动态分配内存可能很诱人,因为它允许 GPU 代码看起来更像 CPU 代码。但它会严重影响性能。我写了一个自包含测试并将其包含在下面。该测试启动了大约 260 万个线程。每个线程使用从线程索引派生的一些值填充全局内存的 16 个整数,然后对这些值求和并返回总和。

The test implements two approaches. The first approach uses __device__ malloc()and the second approach uses memory that is allocated before the kernel runs.

该测试实现了两种方法。第一种方法使用__device__ malloc(),第二种方法使用在内核运行之前分配的内存。

On my 2.0 device, the kernel runs in 1500ms when using __device__ malloc()and 27ms when using pre-allocated memory. In other words, the test takes 56x longerto run when memory is allocated dynamically within the kernel. The time includes the outer loop cudaMalloc()/ cudaFree(), which is not part of the kernel. If the same kernel is launched many times with the same number of threads, as is often the case, the cost of the cudaMalloc()/ cudaFree()is amortized over all the kernel launches. That brings the difference even higher, to around 60x.

在我的 2.0 设备上,内核使用__device__ malloc()时运行时间为 1500毫秒,使用预分配内存时运行时间为 27毫秒。换句话说,当内存在内核中动态分配时,测试运行时间长 56 倍。时间包括外部循环cudaMalloc()/ cudaFree(),它不是内核的一部分。如果使用相同数量的线程多次启动同一个内核(通常情况下),则cudaMalloc()/的成本cudaFree()会在所有内核启动时分摊。这使得差异更大,达到 60 倍左右。

Speculating, I think that the performance hit is in part caused by implicit serialization. The GPU must probably serialize all simultaneous calls to __device__ malloc()in order to provide separate chunks of memory to each caller.

推测,我认为性能下降部分是由隐式序列化引起的。GPU 可能必须将所有并发调用序列化__device__ malloc(),以便为每个调用者提供单独的内存块。

The version that does not use __device__ malloc()allocates all the GPU memory before running the kernel. A pointer to the memory is passed to the kernel. Each thread calculates an index into the previously allocated memory instead of using a __device__ malloc().

未使用的版本__device__ malloc()在运行内核之前分配所有 GPU 内存。指向内存的指针被传递给内核。每个线程计算之前分配的内存的索引,而不是使用__device__ malloc().

The potential issue with allocating memory up front is that, if only some threads need to allocate memory, and it is not known which threads those are, it will be necessary to allocate memory for all the threads. If there is not enough memory for that, it might be more efficient to reduce the number of threads per kernel call then using __device__ malloc(). Other workarounds would probably end up reimplementing what __device__ malloc()is doing in the background, and would see a similar performance hit.

预先分配内存的潜在问题是,如果只有一些线程需要分配内存,并且不知道那些是哪些线程,则需要为所有线程分配内存。如果没有足够的内存,减少每个内核调用的线程数可能比使用__device__ malloc(). 其他解决方法可能最终会重新实现 __device__ malloc()在后台执行的操作,并且会看到类似的性能下降。

Test the performance of __device__ malloc():

测试性能__device__ malloc()

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

const int N_ITEMS(16);

#define USE_DYNAMIC_MALLOC

__global__ void test_malloc(int* totals)
{
  int tx(blockIdx.x * blockDim.x + threadIdx.x);

  int* s(new int[N_ITEMS]);

  for (int i(0); i < N_ITEMS; ++i) {
    s[i] = tx * i;
  }

  int total(0);
  for (int i(0); i < N_ITEMS; ++i) {
    total += s[i];
  }

  totals[tx] = total;

  delete[] s;
}

__global__ void test_malloc_2(int* items, int* totals)
{
  int tx(blockIdx.x * blockDim.x + threadIdx.x);

  int* s(items + tx * N_ITEMS);

  for (int i(0); i < N_ITEMS; ++i) {
    s[i] = tx * i;
  }

  int total(0);
  for (int i(0); i < N_ITEMS; ++i) {
    total += s[i];
  }

  totals[tx] = total;
}

int main()
{
  cudaError_t cuda_status;

  cudaSetDevice(0);

  int blocks_per_launch(1024 * 10);
  int threads_per_block(256);

  int threads_per_launch(blocks_per_launch * threads_per_block);

  int* totals_d;
  cudaMalloc((void**)&totals_d, threads_per_launch * sizeof(int));

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  cudaDeviceSynchronize();
  cudaEventRecord(start, 0);

#ifdef USE_DYNAMIC_MALLOC
  cudaDeviceSetLimit(cudaLimitMallocHeapSize, threads_per_launch * N_ITEMS * sizeof(int));

  test_malloc<<<blocks_per_launch, threads_per_block>>>(totals_d);
#else
  int* items_d;
  cudaMalloc((void**)&items_d, threads_per_launch * sizeof(int) * N_ITEMS);

  test_malloc_2<<<blocks_per_launch, threads_per_block>>>(items_d, totals_d);

  cudaFree(items_d);
#endif

  cuda_status = cudaDeviceSynchronize();
  if (cuda_status != cudaSuccess) {
    printf("Error: %d\n", cuda_status);
    exit(1);
  }

  cudaEventRecord(stop, 0);
  cudaEventSynchronize(stop);
  float elapsedTime;
  cudaEventElapsedTime(&elapsedTime, start, stop);

  printf("Elapsed: %f\n", elapsedTime);

  int* totals_h(new int[threads_per_launch]);
  cuda_status = cudaMemcpy(totals_h, totals_d, threads_per_launch * sizeof(int), cudaMemcpyDeviceToHost);
  if (cuda_status != cudaSuccess) {
    printf("Error: %d\n", cuda_status);
    exit(1);
  }

  for (int i(0); i < 10; ++i) {
    printf("%d ", totals_h[i]);
  }
  printf("\n");

  cudaFree(totals_d);
  delete[] totals_h;

  return cuda_status;
}

Output:

输出:

C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 27.311169
0 120 240 360 480 600 720 840 960 1080

C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 1516.711914
0 120 240 360 480 600 720 840 960 1080

回答by Hong Zhou

If the value of n and nn were known before the kernel is called, then why not cudaMalloc the memory on host side and pass in the device memory pointer to the kernel?

如果在调用内核之前就知道 n 和 nn 的值,那么为什么不 cudaMalloc 主机端的内存并将设备内存指针传递给内核呢?

回答by ragerdl

Ran an experiment based on the concepts in @rogerdahl's post. Assumptions:

根据@rogerdahl 帖子中的概念进行了一项实验。假设:

  • 4MB of memory allocated in 64B chunks.
  • 1 GPU block and 32 warp threads in that block
  • Run on a P100
  • 4MB 内存分配在 64B 块中。
  • 1 个 GPU 块和该块中的 32 个扭曲线程
  • 在 P100 上运行

The malloc+free calls local to the GPU seemed to be much faster than the cudaMalloc+ cudaFreecalls. The program's output:

GPU 本地的 malloc+free 调用似乎比cudaMalloc+cudaFree调用快得多。程序的输出:

Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
         timer for cuda malloc timer took 1.169631s
Starting timer for device malloc timer
Stopping timer for device malloc timer
         timer for device malloc timer took 0.029794s

I'm leaving out the code for timer.hand timer.cpp, but here's the code for the test itself:

我省略了timer.hand的代码timer.cpp,但这里是测试本身的代码:

#include "cuda_runtime.h"
#include <stdio.h>
#include <thrust/system/cuda/error.h>

#include "timer.h"

static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 32;
const int ITERATIONS = 1 << 12;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);

const int ARRAY_SIZE = 64;


void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
    if (err == cudaSuccess)
        return;
    std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
    exit (1);
}

__global__ void mallocai() {
    for (int i = 0; i < ITERATIONS_PER_BLOCKTHREAD; ++i) {
        int * foo;
        foo = (int *) malloc(sizeof(int) * ARRAY_SIZE);
        free(foo);
    }
}

int main() {

    Timer cuda_malloc_timer("cuda malloc timer");

    for (int i = 0; i < ITERATIONS; ++ i) {
        if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
        int * foo;
        cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
        cudaFree(foo);
    }
    cuda_malloc_timer.stop_and_report();
    CUDA_CHECK_RETURN(cudaDeviceSynchronize());

    Timer device_malloc_timer("device malloc timer");
    device_malloc_timer.start();
    mallocai<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>();
    CUDA_CHECK_RETURN(cudaDeviceSynchronize());
    device_malloc_timer.stop_and_report();
}

If you find mistakes, please lmk in the comments, and I'll try to fix them.

如果您发现错误,请在评论中 lmk,我会尝试修复它们。

And I ran them again with larger everything:

我用更大的一切再次运行它们:

const int BLOCK_COUNT = 56;
const int THREADS_PER_BLOCK = 1024;
const int ITERATIONS = 1 << 18;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);

const int ARRAY_SIZE = 1024;

And cudaMalloc was still slower by a lot:

而 cudaMalloc 仍然慢了很多:

Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
         timer for cuda malloc timer took 74.878016s
Starting timer for device malloc timer
Stopping timer for device malloc timer
         timer for device malloc timer took 0.167331s

回答by Tyrandro

Maybe you should test

也许你应该测试

cudaMalloc(&foo,sizeof(int) * ARRAY_SIZE * ITERATIONS);
cudaFree(foo);

instead

反而

for (int i = 0; i < ITERATIONS; ++ i) {
    if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
    int * foo;
    cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
    cudaFree(foo);
}