使用 C++ 进行高级 GPU 编程

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16438099/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 20:21:35  来源:igfitidea点击:

High level GPU programming in C++

c++cudagpugpu-programming

提问by goocreations

I've been looking into libraries/extensions for C++ that will allow GPU-based processing on a high level. I'm not an expert in GPU programming and I don't want to dig too deep. I have a neural network consisting of classes with virtual functions. I need a library that basically does the GPU allocation for me - on a high level. There is a guy who wrote a thesis on a system called GPU++ which does most of the GPU stuff for you. I can't find the code anywhere, just his thesis.

我一直在研究 C++ 的库/扩展,这些库/扩展将允许在高层次上进行基于 GPU 的处理。我不是 GPU 编程方面的专家,也不想深入挖掘。我有一个由具有虚函数的类组成的神经网络。我需要一个基本上为我进行 GPU 分配的库 - 在高层次上。有一个人在名为 GPU++ 的系统上写了一篇论文,该系统为您完成了大部分 GPU 工作。我在任何地方都找不到代码,只有他的论文。

Does anyone know of a similar library, or does anyone have the code for GPU++? Libraries like CUDA are too low level and can't handle most of my operations (at least not without rewriting all my processes and algorithms - which I don't want to do).

有没有人知道类似的库,或者有没有人有 GPU++ 的代码?像 CUDA 这样的库级别太低,无法处理我的大部分操作(至少在不重写我所有的流程和算法的情况下是这样——我不想这样做)。

采纳答案by Ashwin Nanjappa

The Thrustlibrary provides containers, parallel primitives and algorithms. All of this functionality is nicely wrapped up in a STL-like syntax. So, if you are familiar with STL, you can actually write entire CUDA programs using just Thrust, without having to write a single CUDA kernel. Have a look at the simple examples in the Quick Start Guideto see the kind of high-level programs you can write using Thrust.

推力库提供容器,并行原语和算法。所有这些功能都很好地包含在类似 STL 的语法中。因此,如果您熟悉 STL,您实际上可以仅使用 Thrust 编写整个 CUDA 程序,而无需编写单个 CUDA 内核。查看快速入门指南中的简单示例,了解您可以使用 Thrust 编写的高级程序类型。

回答by BenC

There are many high-level libraries dedicated to GPGPU programming. Since they rely on CUDA and/or OpenCL, they have to be chosen wisely (a CUDA-based program will not run on AMD's GPUs, unless it goes through a pre-processing step with projects such as gpuocelot).

有许多专用于 GPGPU 编程的高级库。由于它们依赖于 CUDA 和/或 OpenCL,因此必须明智地选择它们(基于 CUDA 的程序不会在 AMD 的 GPU 上运行,除非它通过gpuocelot等项目的预处理步骤)。

CUDA

CUDA

You can find some examples of CUDA libraries on the NVIDIA website.

您可以在 NVIDIA网站上找到一些 CUDA 库示例。

  • Thrust: the official description speaks for itself
  • 推力:官方描述不言自明

Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs. Interoperability with established technologies (such as CUDA, TBB, and OpenMP) facilitates integration with existing software.

Thrust 是一个类似于 C++ 标准模板库 (STL) 的并行算法库。Thrust 的高级接口极大地提高了程序员的工作效率,同时实现了 GPU 和多核 CPU 之间的性能移植。与现有技术(例如 CUDA、TBB 和 OpenMP)的互操作性有助于与现有软件的集成。

As @Ashwinpointed out, the STL-like syntax of Thrust makes it a widely chosen library when developing CUDA programs. A quick look at the examples shows the kind of code you will be writing if you decide to use this library. NVIDIA's website presents the key featuresof this library. A video presentation(from GTC 2012) is also available.

正如@Ashwin指出的那样,Thrust 的类 STL 语法使其成为开发 CUDA 程序时广泛选择的库。如果您决定使用此库,快速浏览示例将显示您将编写的代码类型。NVIDIA 的网站介绍了该库的主要功能。一个视频演示(从GTC 2012)也可用。

  • CUB: the official description tells us:
  • CUB: 官方描述告诉我们:

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming mode. It is a flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming.

CUB 为 CUDA 编程模式的每一层提供最先进的、可重用的软件组件。它是一个灵活的协作线程块原语库和其他用于 CUDA 内核编程的实用程序。

It provides device-wide, block-wide and warp-wide parallel primitives such as parallel sort, prefix scan, reduction, histogram etc.

它提供设备范围、块范围和扭曲范围的并行原语,例如并行排序、前缀扫描、缩减、直方图等。

It is open-source and available on GitHub. It is not high-level from an implementation point of view (you develop in CUDA kernels), but provides high-level algorithms and routines.

它是开源的,可在GitHub 上获得。从实现的角度来看,它不是高级别的(您在 CUDA 内核中开发),但提供了高级算法和例程。

  • mshadow: lightweight CPU/GPU matrix/tensor template library in C++/CUDA.
  • mshadow:C++/CUDA 中的轻量级 CPU/GPU 矩阵/张量模板库。

This library is mostly used for machine learning, and relies on expression templates.

这个库主要用于机器学习,并且依赖于表达式模板

Starting from Eigen 3.3, it is now possible to use Eigen's objects and algorithms within CUDA kernels. However, only a subset of features are supported to make sure that no dynamic allocation is triggered within a CUDA kernel.

从 Eigen 3.3 开始,现在可以在 CUDA 内核中使用 Eigen 的对象和算法。但是,仅支持一部分功能以确保不会在 CUDA 内核中触发动态分配。

OpenCL

开放式语言

Note that OpenCLdoes more than GPGPU computing, since it supports heterogeneous platforms (multi-core CPUs, GPUs etc.).

请注意,OpenCL不仅仅是 GPGPU 计算,因为它支持异构平台(多核 CPU、GPU 等)。

  • OpenACC: this project provides OpenMP-like support for GPGPU. A large part of the programming is done implicitly by the compiler and the run-time API. You can find a sample codeon their website.
  • OpenACC:该项目为 GPGPU 提供类似 OpenMP 的支持。大部分编程是由编译器和运行时 API 隐式完成的。您可以在他们的网站上找到示例代码

The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators.

OpenACC 应用程序接口描述了一组编译器指令,用于指定标准 C、C++ 和 Fortran 中的循环和代码区域,以从主机 CPU 卸载到连接的加速器,提供跨操作系统、主机 CPU 和加速器的可移植性。

  • Bolt: open-source library with STL-like interface.
  • Bolt:具有类似 STL 接口的开源库。

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common algorithms such as scan, reduce, transform, and sort. The Bolt interface was modeled on the C++ Standard Template Library (STL). Developers familiar with the STL will recognize many of the Bolt APIs and customization techniques.

Bolt 是一个针对异构计算优化的 C++ 模板库。Bolt 旨在为扫描、归约、变换和排序等常见算法提供高性能库实现。Bolt 接口以 C++ 标准模板库 (STL) 为模型。熟悉 STL 的开发人员会认识许多 Bolt API 和定制技术。

  • Boost.Compute: as @Kyle Lutzsaid, Boost.Compute provides a STL-like interface for OpenCL. Note that this is not an official Boost library (yet).

  • SkelCL"is a library providing high-level abstractions for alleviated programming of modern parallel heterogeneous systems". This library relies on skeleton programming, and you can find more information in their research papers.

  • Boost.Compute:正如@Kyle Lutz所说,Boost.Compute 为 OpenCL 提供了一个类似 STL 的接口。请注意,这不是官方的 Boost 库(还)。

  • SkelCL“是一个为现代并行异构系统的简化编程提供高级抽象的库”。该库依赖于框架编程,您可以在他们的研究论文中找到更多信息。

CUDA + OpenCL

CUDA + OpenCL

  • ArrayFireis an open-source (used to be proprietary) GPGPU programming library. They first targeted CUDA, but now support OpenCL as well. You can check the examplesavailable online. NVIDIA's website provides a good summaryof its key features.
  • ArrayFire是一个开源(曾经是专有的)GPGPU 编程库。他们首先针对 CUDA,但现在也支持 OpenCL。您可以查看在线提供的示例。NVIDIA 的网站很好地总结了其主要功能。

Complementary information

补充资料

Although this is not really in the scope of this question, there is also the same kind of support for other programming languages:

虽然这不在这个问题的范围内,但对其他编程语言也有同样的支持:

If you need to do linear algebra (for instance) or other specific operations, dedicated math libraries are also available for CUDA and OpenCL (e.g. ViennaCL, CUBLAS, MAGMAetc.).

如果您需要进行线性代数(例如)或其他特定运算,还可以使用用于 CUDA 和 OpenCL 的专用数学库(例如ViennaCLCUBLASMAGMA等)。

Also note that using these libraries does not prevent you from doing some low-level operations if you need to do some very specific computation.

另请注意,如果您需要进行一些非常具体的计算,使用这些库并不会阻止您进行一些低级操作。

Finally, we can mention the future of the C++ standard library. There has been extensive work to add parallelism support. This is still a technical specification, and GPUs are not explicitely mentioned AFAIK (although NVIDIA's Jared Hoberock, developer of Thrust, is directly involved), but the will to make this a reality is definitely there.

最后,我们可以提及 C++ 标准库的未来。已经做了大量工作来添加并行性支持。这仍然是一个技术规范,AFAIK 并没有明确提到 GPU(虽然 Thrust 的开发者 NVIDIA 的 Jared Hoberock 直接参与了),但将其变为现实的意愿肯定是存在的。

回答by Kyle Lutz

Take a look at Boost.Compute. It provides a high-level, STL-like interface including containers like vector<T>and algorithms like transform()and sort().

看看Boost.Compute。它提供了一个高层次,STL类似的接口,包括类似容器vector<T>和类似的算法transform()sort()

It's built on OpenCLallowing it to run on most modern GPUs and CPUs including those by NVIDIA, AMD, and Intel.

它建立在OpenCL之上,使其能够在大多数现代 GPU 和 CPU 上运行,包括 NVIDIA、AMD 和 Intel 的 GPU 和 CPU。

回答by scottzed

If you're looking for higher-dimensional containers and the ability to pass and manipulate these containers in kernel code, I've spent the last few years developing the ecudaAPI to assist in my own scientific research projects (so it's been put through the paces). Hopefully it can fill a needed niche. A brief example of how it can be used (C++11 features are used here, but ecuda will work fine with pre-C++11 compilers):

如果您正在寻找更高维的容器以及在内核代码中传递和操作这些容器的能力,那么过去几年我一直在开发ecudaAPI 以协助我自己的科学研究项目(因此它已通过步)。希望它可以填补所需的利基。如何使用它的简短示例(此处使用 C++11 特性,但 ecuda 可以在 C++11 之前的编译器中正常工作):

#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <vector>

#include <ecuda/ecuda.hpp>

// kernel function
__global__
void calcColumnSums(
  typename ecuda::matrix<double>::const_kernel_argument mat,
  typename ecuda::vector<double>::kernel_argument vec
)
{
    const std::size_t t = threadIdx.x;
    auto col = mat.get_column(t);
    vec[t] = ecuda::accumulate( col.begin(), col.end(), static_cast<double>(0) );
}

int main( int argc, char* argv[] )
{

    // allocate 1000x1000 hardware-aligned device memory matrix
    ecuda::matrix<double> deviceMatrix( 1000, 1000 );

    // generate random values row-by-row and copy to matrix
    std::vector<double> hostRow( 1000 );
    for( std::size_t i = 0; i < 1000; ++i ) {
        for( double& x : hostRow ) x = static_cast<double>(rand())/static_cast<double>(RAND_MAX);
        ecuda::copy( hostRow.begin(), hostRow.end(), deviceMatrix[i].begin() );
    }

    // allocate device memory for column sums
    ecuda::vector<double> deviceSums( 1000 );

    CUDA_CALL_KERNEL_AND_WAIT(
        calcColumnSums<<<1,1000>>>( deviceMatrix, deviceSums )
    );

    // copy columns sums to host and print
    std::vector<double> hostSums( 1000 );
    ecuda::copy( deviceSums.begin(), deviceSums.end(), hostSums.begin() );

    std::cout << "SUMS =";
    for( const double& x : hostSums ) std::cout << " " << std::fixed << x;
    std::cout << std::endl;

    return 0;

}

I wrote it to be as an intuitive as possible (usually as simple as replacing std:: with ecuda::). If you know STL, then ecuda should do what you'd logically expect a CUDA-based C++ extension to do.

我把它写得尽可能直观(通常就像用 ecuda:: 替换 std:: 一样简单)。如果您了解 STL,那么 ecuda 应该做您在逻辑上期望基于 CUDA 的 C++ 扩展做的事情。

回答by ddemidov

Another high level library is VexCL-- a vector expression template library for OpenCL. It provides intuitive notation for vector operations and is available under MIT license.

另一个高级库是VexCL——OpenCL 的向量表达式模板库。它为矢量操作提供了直观的符号,并在 MIT 许可下可用。

回答by Dimitri

The cpp-opencl project provides a way to make programming GPUs easy for the developer. It allows you to implement data parallelism on a GPU directly in C++ instead of using OpenCL.

cpp-opencl 项目为开发人员提供了一种简化 GPU 编程的方法。它允许您直接使用 C++ 在 GPU 上实现数据并行,而不是使用 OpenCL。

Please see http://dimitri-christodoulou.blogspot.com/2014/02/implement-data-parallelism-on-gpu.html

请参阅http://dimitri-christodoulou.blogspot.com/2014/02/implement-data-parallelism-on-gpu.html

And the source code: https://github.com/dimitrs/cpp-opencl

和源代码:https: //github.com/dimitrs/cpp-opencl

See the example below. The code in the parallel_for_each lambda function is executed on the GPU, and all the rest is executed on the CPU. More specifically, the “square” function is executed both on the CPU (via a call to std::transform) and the GPU (via a call to compute::parallel_for_each).

请参阅下面的示例。parallel_for_each lambda 函数中的代码在 GPU 上执行,其余所有在 CPU 上执行。更具体地说,“square”函数在 CPU(通过调用 std::transform)和 GPU(通过调用 compute::parallel_for_each)都执行。

#include <vector>
#include <stdio.h>
#include "ParallelForEach.h"

template<class T> 
T square(T x)  
{
    return x * x;
}

void func() {
  std::vector<int> In {1,2,3,4,5,6};
  std::vector<int> OutGpu(6);
  std::vector<int> OutCpu(6);

  compute::parallel_for_each(In.begin(), In.end(), OutGpu.begin(), [](int x){
      return square(x);
  });


  std::transform(In.begin(), In.end(), OutCpu.begin(), [](int x) {
    return square(x);
  });

  // 
  // Do something with OutCpu and OutGpu …..........

  //

}

int main() {
  func();
  return 0;
}

回答by Pietro

The new OpenMPversion 4 now includes accelerator offload support.

新的OpenMP版本 4 现在包括加速器卸载支持。

AFAIK GPUs are considered as accelerators.

AFAIK GPU 被视为加速器。

回答by isti_spl

C++ AMP is The answer you are looking for.

C++ AMP 是您正在寻找的答案。