C++ 我可以/应该在 GPU 上运行此代码吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13301309/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 17:10:30  来源:igfitidea点击:

Can/Should I run this code on a GPU?

c++ccudaparallel-processinggpu

提问by Mike

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.

我正在开发一个统计应用程序,该应用程序在一个数组中包含大约 10 - 3000 万个浮点值。

Several methods performing different, but independent, calculations on the array in nested loops, for example:

几种方法在嵌套循环中对数组执行不同但独立的计算,例如:

Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();

for (float x = 0f; x < 100f; x += 0.0001f) {
    int noOfOccurrences = 0;

    foreach (float y in largeFloatingPointArray) {
        if (x == y) {
            noOfOccurrences++;
        }
    }

    noOfNumbers.Add(x, noOfOccurrences);
}

The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:

当前应用程序是用 C# 编写的,在 Intel CPU 上运行,需要几个小时才能完成。我不了解 GPU 编程概念和 API,所以我的问题是:

  • Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
  • If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
  • 是否有可能(并且有意义)利用 GPU 来加速此类计算?
  • 如果是:有谁知道任何教程或获得任何示例代码(编程语言无关紧要)?

Any help would be highly appreciated.

任何帮助将不胜感激。

回答by dreamcrash

UPDATEGPU Version

更新GPU 版本

__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
    int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
    float y;                                         // compute one (or more) floats
    int noOfOccurrences = 0;
    int a;

    while( x < size )            // While there is work to do each thread will:
    {
        dictionary[x] = 0;       // Initialize the position in each it will work
        noOfOccurrences = 0;    

        for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
        {                                                     // that are equal 
                                                             // to it assign float
           y = largeFloatingPointArray[j];  // Take a candidate from the floats array 
           y *= 10000;                      // e.g if y = 0.0001f;
           a = y + 0.5;                     // a = 1 + 0.5 = 1;
           if (a == x) noOfOccurrences++;    
        }                                      

        dictionary[x] += noOfOccurrences; // Update in the dictionary 
                                          // the number of times that the float appears 

    x += blockDim.x * gridDim.x;  // Update the position here the thread will work
    }
}

This one I just tested for smaller inputs, because I am testing I my laptop. Nevertheless, it did work. However, it necessary to do furthers testes.

这个我只是针对较小的输入进行了测试,因为我正在测试我的笔记本电脑。尽管如此,它确实奏效了。但是,有必要做进一步的睾丸。

UPDATESequential Version

更新顺序版本

I just did this naive version that perform your algorithm for 30,000,000 in less than 20 seconds (already counting function to generate data).

我刚刚做了这个天真的版本,它在不到 20 秒的时间内执行了 30,000,000 次算法(已经是计数功能来生成数据)。

Basically, it sort your array of floats. It will travel over the sorted array, analyzing the number of times a value consecutively appears in the array and then put this value in a dictionary along with the number of times it appear.

基本上,它对您的浮点数组进行排序。它将遍历已排序的数组,分析一个值在数组中连续出现的次数,然后将该值连同它出现的次数一起放入字典中。

You can use sorted map, instead of the unordered_map that I used.

您可以使用排序映射,而不是我使用的 unordered_map。

Heres the code:

代码如下:

#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>


typedef std::tr1::unordered_map<float, int> Mymap;


void generator(float *data, long int size)
{
    float LO = 0.0;
    float HI = 100.0;

    for(long int i = 0; i < size; i++)
        data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}

void print_array(float *data, long int size)
{

    for(long int i = 2; i < size; i++)
        printf("%f\n",data[i]);

}

std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
    float previous = data[0];
    int count = 1;
    std::tr1::unordered_map<float, int> dict;

    for(long int i = 1; i < size; i++)
    {
        if(previous == data[i])
            count++;
        else
        {
          dict.insert(Mymap::value_type(previous,count));
          previous = data[i];
          count = 1;         
        }

    }
    dict.insert(Mymap::value_type(previous,count)); // add the last member
    return dict;

}

void printMAP(std::tr1::unordered_map<float, int> dict)
{
   for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
  {
     std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
   }
}


int main(int argc, char** argv)
{
  int size = 1000000; 
  if(argc > 1) size = atoi(argv[1]);
  printf("Size = %d",size);

  float data[size];
  using namespace __gnu_cxx;

  std::tr1::unordered_map<float, int> dict;

  generator(data,size);

  sort(data, data + size);
  dict = fill_dict(data,size);

  return 0;
}

If you have the library thrust installed in you machine you should use this:

如果你的机器上安装了库推力,你应该使用这个:

#include <thrust/sort.h>
thrust::sort(data, data + size);

instead of this

而不是这个

sort(data, data + size);

For sure it will be faster.

肯定会更快。

Original Post

原帖

"I'm working on a statistical application which has a large array containin 10 - 30 millions of floating point values".

“我正在开发一个统计应用程序,它有一个包含 10 - 30 百万个浮点值的大数组”。

"Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?"

“是否有可能(并且有意义)利用 GPU 来加速此类计算?”

Yes, it is. A month ago I put a Molecular Dynamic simulation entirely on the GPU. One of the kernels, that calculates the force between pairs of particles, receive 6 array each one with 500,000 doubles, a total of 3 Millions doubles (22 MB).

是的。一个月前,我将分子动力学模拟完全放在 GPU 上。其中一个计算粒子对之间的力的内核接收 6 个数组,每个数组有 500,000 个双打,总共 300 万个双打 (22 MB)。

So you are planing to put 30 Millions of float points this is about 114 MB of global Memory, so this is not a problem, even my laptop have 250MB.

所以你打算放 3000 万个浮点数,这大约是 114 MB 的全局内存,所以这不是问题,即使我的笔记本电脑也有 250MB。

The number of calculation can be a issue in your case? Based on my experience with the Molecular Dynamic (MD) I say no. The sequential MD version takes about 25 hours to complete while in GPU took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the Molecular Dynamic.

在您的情况下,计算数量可能是一个问题?根据我在分子动力学 (MD) 方面的经验,我说不。顺序 MD 版本需要大约 25 小时才能完成,而在 GPU 中需要 45 分钟。你说你的应用程序花了几个小时,同样基于你的代码示例,它看起来比分子动力学更柔和。

Here's the force calculation example:

这是力计算示例:

__global__ void add(double *fx, double *fy, double *fz,
                    double *x, double *y, double *z,...){

     int pos = (threadIdx.x + blockIdx.x * blockDim.x); 

     ...

     while(pos < particles)
     {

      for (i = 0; i < particles; i++)
      {
              if(//inside of the same radius)
                {
                 // calculate force
                } 
       }
     pos += blockDim.x * gridDim.x;  
     }        
  }

A simple example of a code in Cuda could be the sum of two 2D arrays:

Cuda 中代码的一个简单示例可能是两个二维数组的总和:

In c:

在 c:

for(int i = 0; i < N; i++)
    c[i] = a[i] + b[i]; 

In Cuda:

在库达:

__global__ add(int *c, int *a, int*b, int N)
{
  int pos = (threadIdx.x + blockIdx.x)
  for(; i < N; pos +=blockDim.x)
      c[pos] = a[pos] + b[pos];
}

In Cuda you basically took each for iteration and divide by each thread,

在 Cuda 中,您基本上将每个进行迭代并除以每个线程,

1) threadIdx.x + blockIdx.x*blockDim.x;

Each block have a Id from 0 to N-1 (N the number maximum of blocks) and each block have a X number of threads with an id from 0 to X-1.

每个块都有一个从 0 到 N-1 的 ID(N 是块的最大数量),每个块有 X 个线程,ID 从 0 到 X-1。

1) Gives you the for iteration that each thread will compute based on it id and the block id where the thread is in, the blockDim.x is the number of thread that a block have.

1) 为您提供每个线程将根据它的 id 和线程所在的块 id 计算的 for 迭代,blockDim.x 是块具有的线程数。

So if you have 2 blocks each one with 10 threads and a N = 40, the:

因此,如果您有 2 个块,每个块有 10 个线程且 N = 40,则:

Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39

Looking to your code I made this draft of what could be it in cuda:

看着你的代码,我制作了这个在 cuda 中可能是什么的草稿:

__global__ hash (float *largeFloatingPointArray, int *dictionary)
    // You can turn the dictionary in one array of int
    // here each position will represent the float
    // Since  x = 0f; x < 100f; x += 0.0001f
    // you can associate each x to different position
    // in the dictionary:

    // pos 0 have the same meaning as 0f;
    // pos 1 means float 0.0001f
    // pos 2 means float 0.0002f ect.
    // Then you use the int of each position 
    // to count how many times that "float" had appeared 


   int x = blockIdx.x;  // Each block will take a different x to work
    float y;

while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
    int noOfOccurrences = 0;
    float z = converting_int_to_float(x); // This function will convert the x to the
                                          // float like you use (x / 0.0001)

    // each thread of each block
    // will takes the y from the array of largeFloatingPointArray

    for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
    {
        y = largeFloatingPointArray[j];
        if (z == y)
        {
            noOfOccurrences++;
        }
    }
    if(threadIdx.x == 0) // Thread master will update the values
      atomicAdd(&dictionary[x], noOfOccurrences);
    __syncthreads();
}

You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences at the same time, so you have to unsure mutual exclusion.

您必须使用 atomicAdd,因为来自不同块的不同线程可能会同时写入/读取 noOfOccurrences,因此您必须不确定互斥。

This is only one approach you can even give the iterations of the outer loop to the threads instead of the blocks.

这只是您甚至可以将外循环的迭代提供给线程而不是块的一种方法。

Tutorials

教程

The Dr Dobbs Journal series CUDA: Supercomputing for the massesby Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.

Dr Dobbs Journal 系列CUDA:Rob Farmer 为大众提供超级计算非常出色,在其十四个部分中涵盖了几乎所有内容。它的启动也相当温和,因此对初学者相当友好。

and anothers:

和其他人:

Take a look on the last item, you will find many link to learn CUDA.

看看最后一项,你会发现很多学习 CUDA 的链接。

OpenCL: OpenCL Tutorials | MacResearch

OpenCL:OpenCL 教程 | 麦克研究

回答by AlliedEnvy

I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.

我对并行处理或 GPGPU 知之甚少,但对于这个特定示例,您可以通过对输入数组进行一次传递而不是循环遍历它来节省大量时间。如果可能,对于大型数据集,您通常希望一次性完成所有操作。即使您正在执行多个独立计算,如果它是在同一数据集上进行的,您可能会在同一次通过中获得更好的速度,因为这样您将获得更好的参考局部性。但由于代码复杂性增加,这可能不值得。

In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.

此外,您真的不希望像这样重复地向浮点数添加少量,舍入误差会加起来并且您不会得到您想要的。我在下面的示例中添加了一个 if 语句,以检查输入是否与您的迭代模式匹配,但如果您实际上不需要,则省略它。

I don't know any C#, but a single pass implementation of your sample would look something like this:

我不知道任何 C#,但是您的示例的单次传递实现看起来像这样:

Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();

foreach (float x in largeFloatingPointArray)
{
    if (math.Truncate(x/0.0001f)*0.0001f == x)
    {
        if (noOfNumbers.ContainsKey(x))
            noOfNumbers.Add(x, noOfNumbers[x]+1);
        else
            noOfNumbers.Add(x, 1);
    }
}

Hope this helps.

希望这可以帮助。

回答by Pragmateek

Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?

是否有可能(并且有意义)利用 GPU 来加速此类计算?

  • Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelismprocessing, the thing GPUs are so good at.
  • 肯定是的,这种算法通常是大规模数据并行处理的理想选择,这是 GPU 非常擅长的。

If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?

如果是:有谁知道任何教程或获得任何示例代码(编程语言无关紧要)?

  • When you want to go the GPGPU way you have two alternatives : CUDAor OpenCL.

    CUDA is mature with a lot of tools but is NVidia GPUs centric.

    OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.

  • For tutorial you have an excellent series on CodeProject by Rob Farber: http://www.codeproject.com/Articles/Rob-Farber#Articles

  • For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).

  • As you use C# you can use bindings like OpenCL.Netor Cloo.

  • If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.

  • 当您想要采用 GPGPU 方式时,您有两种选择:CUDAOpenCL

    CUDA 已经成熟,有很多工具,但以 NVidia GPU 为中心。

    OpenCL 是在 NVidia 和 AMD GPU 以及 CPU 上运行的标准。所以你真的应该喜欢它。

  • 对于教程,你有一个由Rob Farber 编写的关于 CodeProject 的优秀系列:http: //www.codeproject.com/Articles/Rob-Farber#Articles

  • 对于您的特定用例,有很多使用 OpenCL 构建的直方图示例(请注意,许多是图像直方图,但原理是相同的)。

  • 在使用 C# 时,您可以使用OpenCL.NetCloo等绑定。

  • 如果您的数组太大而无法存储在 GPU 内存中,您可以对它进行块分区,然后轻松地为每个部分重新运行 OpenCL 内核。

回答by Eli Algranti

In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.

除了上述海报的建议之外,在适当的时候使用 TPL(任务并行库)在多个内核上并行运行。

The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.

上面的例子可以使用 Parallel.Foreach 和 ConcurrentDictionary,但是更复杂的 map-reduce 设置,其中数组被分成块,每个块生成一个字典,然后将其简化为单个字典,会给你更好的结果。

I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.

我不知道您的所有计算是否都正确映射到 GPU 功能,但是无论如何您都必须使用 map-reduce 算法将计算映射到 GPU 内核,然后将部分结果减少为单个结果,因此您在转到不太熟悉的平台之前,最好在 CPU 上执行此操作。

回答by Arun Taylor

I am not sure whether using GPUs would be a good match given that 'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.

鉴于需要从内存中检索“largerFloatingPointArray”值,我不确定使用 GPU 是否是一个很好的匹配。我的理解是 GPU 更适合独立计算。

I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.

我认为将这个单进程应用程序转变为在许多系统上运行的分布式应用程序并调整算法应该会大大加快速度,这取决于有多少系统可用。

You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.

您可以使用经典的“分而治之”方法。我将采取的一般方法如下。

Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.

使用一个系统将“largeFloatingPointArray”预处理为哈希表或数据库。这将在一次通过中完成。它将使用浮点值作为键,并将数组中出现的次数作为值。最坏的情况是每个值只出现一次,但这不太可能。如果 largeFloatingPointArray 在每次应用程序运行时都不断变化,那么内存中的哈希表就有意义了。如果它是静态的,则该表可以保存在键值数据库中,例如 Berkeley DB。我们称其为“查找”系统。

On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.

在另一个系统上,我们称其为“主要”,创建工作块并将工作项“分散”到 N 个系统中,并在结果可用时“收集”它们。例如,一个工作项可以像两个数字一样简单,表示系统应该工作的范围。当系统完成工作时,它会发回一系列事件,并准备好处理另一块工作。

The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.

性能得到了提高,因为我们没有不断迭代 largeFloatingPointArray。如果查找系统成为瓶颈,则可以根据需要在任意数量的系统上复制它。

With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.

如果有足够多的系统并行工作,应该可以将处理时间减少到几分钟。

I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.

我正在开发用于 C 语言并行编程的编译器,该编译器针对基于多核的系统,通常称为微服务器,这些系统/或将使用系统内的多个“片上系统”模块构建。ARM 模块供应商包括 Calxeda、AMD、AMCC 等。英特尔可能也会提供类似的产品。

I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.

我有一个正在运行的编译器版本,可用于此类应用程序。编译器基于 C 函数原型,生成实现跨系统进程间通信代码 (IPC) 的 C 网络代码。可用的 IPC 机制之一是 socket/tcp/ip。

If you need help in implementing a distributed solution, I'd be happy to discuss it with you.

如果您在实施分布式解决方案方面需要帮助,我很乐意与您讨论。

Added Nov 16, 2012.

2012 年 11 月 16 日添加。

I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.

我对算法考虑了更多,我认为这应该一次性完成。它是用 C 编写的,与您拥有的相比,它应该非常快。

/*
 * Convert the X range from 0f to 100f in steps of 0.0001f
 * into a range of integers 0 to 1 + (100 * 10000) to use as an
 * index into an array.
 */

#define X_MAX           (1 + (100 * 10000))

/*
 * Number of floats in largeFloatingPointArray needs to be defined
 * below to be whatever your value is.
 */

#define LARGE_ARRAY_MAX (1000)

main()
{
    int j, y, *noOfOccurances;
    float *largeFloatingPointArray;

    /*
     * Allocate memory for largeFloatingPointArray and populate it.
     */

    largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));    
    if (largeFloatingPointArray == 0) {
        printf("out of memory\n");
        exit(1);
    }

    /*
     * Allocate memory to hold noOfOccurances. The index/10000 is the
     * the floating point number.  The contents is the count.
     *
     * E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
     * in largeFloatingPointArray.
     */

    noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
    if (noOfOccurances == 0) {  
        printf("out of memory\n");
        exit(1);
    }

    for (j = 0; j < LARGE_ARRAY_MAX; j++) {
        y = (int)(largeFloatingPointArray[j] * 10000);
        if (y >= 0 && y <= X_MAX) {
            noOfOccurances[y]++;
        }   
    }
}