将 Java 与 Nvidia GPU (CUDA) 结合使用

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22866901/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 18:31:20  来源:igfitidea点击:

Using Java with Nvidia GPUs (CUDA)

javacudagpu-programmingmulti-gpu

提问by Hans

I'm working on a business project that is done in Java, and it needs huge computation power to compute business markets. Simple math, but with huge amount of data.

我正在从事一个用 Java 完成的商业项目,它需要巨大的计算能力来计算商业市场。简单的数学,但有大量的数据。

We ordered some CUDA GPUs to try it with and since Java is not supported by CUDA, I'm wondering where to start. Should I build a JNI interface? Should I use JCUDA or are there other ways?

我们订购了一些 CUDA GPU 来试用,因为 CUDA 不支持 Java,我想知道从哪里开始。我应该构建一个 JNI 接口吗?我应该使用 JCUDA 还是有其他方法?

I don't have experience in this field and I would like if someone could direct me to something so I can start researching and learning.

我在这个领域没有经验,我希望有人能指导我做一些事情,这样我就可以开始研究和学习。

采纳答案by Marco13

First of all, you should be aware of the fact that CUDA will not automagically make computations faster. On the one hand, because GPU programming is an art, and it can be very, very challenging to get it right. On the other hand, because GPUs are well-suited only for certain kindsof computations.

首先,您应该意识到 CUDA 不会自动加快计算速度的事实。在一方面,由于GPU编程是一门艺术,它可以是非常,非常具有挑战性得到它的权利。另一方面,因为 GPU 仅适用于某些类型的计算。

This may sound confusing, because you can basically compute anythingon the GPU. The key point is, of course, whether you will achieve a good speedup or not. The most important classification here is whether a problem is task parallelor data parallel. The first one refers, roughly speaking, to problems where several threads are working on their own tasks, more or less independently. The second one refers to problems where manythreads are all doing the same- but on different parts of the data.

这听起来可能令人困惑,因为您基本上可以在 GPU 上计算任何东西。当然,关键是您是否会实现良好的加速。这里最重要的分类是问题是任务并行还是数据并行。粗略地说,第一个是指多个线程或多或少独立地处理自己的任务的问题。第二个是指许多线程都在做同样的事情——但是在数据的不同部分上的问题。

The latter is the kind of problem that GPUs are good at: They have manycores, and all the cores do the same, but operate on different parts of the input data.

后者是 GPU 擅长的那种问题:它们有很多内核,所有内核都做同样的事情,但是对输入数据的不同部分进行操作。

You mentioned that you have "simple math but with huge amount of data". Although this may sound like a perfectly data-parallel problem and thus like it was well-suited for a GPU, there is another aspect to consider: GPUs are ridiculously fast in terms of theoretical computational power (FLOPS, Floating Point Operations Per Second). But they are often throttled down by the memory bandwidth.

你提到你有“简单的数学但有大量数据”。尽管这听起来像是一个完美的数据并行问题,因此它非常适合 GPU,但还有另一个方面需要考虑:GPU 在理论计算能力(FLOPS,每秒浮点运算)方面的速度快得离谱。但它们通常会受到内存带宽的限制。

This leads to another classification of problems. Namely whether problems are memory boundor compute bound.

这导致了另一种分类的问题。即问题是内存限制还是计算限制

The first one refers to problems where the number of instructions that are done for each data element is low. For example, consider a parallel vector addition: You'll have to readtwo data elements, then perform a single addition, and then writethe sum into the result vector. You will not see a speedup when doing this on the GPU, because the single addition does not compensate for the efforts of reading/writing the memory.

第一个是指为每个数据元素执行的指令数量很少的问题。例如,考虑并行向量加法:你必须两个数据元素,然后进行单次加入,然后将总和结果向量。在 GPU 上执行此操作时,您不会看到加速,因为单个添加不会补偿读/写内存的工作量。

The second term, "compute bound", refers to problems where the number of instructions is high compared to the number of memory reads/writes. For example, consider a matrix multiplication: The number of instructions will be O(n^3) when n is the size of the matrix. In this case, one can expect that the GPU will outperform a CPU at a certain matrix size. Another example could be when many complex trigonometric computations (sine/cosine etc) are performed on "few" data elements.

第二个术语“计算限制”是指指令数量与内存读取/写入数量相比较高的问题。例如,考虑矩阵乘法:当 n 是矩阵的大小时,指令的数量将为 O(n^3)。在这种情况下,可以预期 GPU 在特定矩阵大小下的性能将优于 CPU。另一个例子可能是对“少数”数据元素执行许多复杂的三角计算(正弦/余弦等)。

As a rule of thumb: You can assume that reading/writing one data element from the "main" GPU memory has a latency of about 500 instructions....

根据经验:您可以假设从“主”GPU 内存读取/写入一个数据元素的延迟约为 500 条指令......

Therefore, another key point for the performance of GPUs is data locality: If you have to read or write data (and in most cases, you will have to ;-)), then you should make sure that the data is kept as close as possible to the GPU cores. GPUs thus have certain memory areas (referred to as "local memory" or "shared memory") that usually is only a few KB in size, but particularly efficient for data that is about to be involved in a computation.

因此,GPU 性能的另一个关键点是数据局部性:如果您必须读取或写入数据(并且在大多数情况下,您将不得不 ;-)),那么您应该确保数据尽可能接近可能到 GPU 内核。因此,GPU 具有某些内存区域(称为“本地内存”或“共享内存”),通常只有几 KB 大小,但对于将要参与计算的数据特别有效。

So to emphasize this again: GPU programming is an art, that is only remotely related to parallel programming on the CPU. Things like Threads in Java, with all the concurrency infrastructure like ThreadPoolExecutors, ForkJoinPoolsetc. might give the impression that you just have to split your work somehow and distribute it among several processors. On the GPU, you may encounter challenges on a much lower level: Occupancy, register pressure, shared memory pressure, memory coalescing ... just to name a few.

所以再次强调这一点:GPU 编程是一门艺术,它与 CPU 上的并行编程只是遥相呼应。比如像线程在Java中,有像所有的并发性基础设施ThreadPoolExecutorsForkJoinPools等等可能给人的印象是,你只需要以某种方式分拆您的工作和多个处理器之间分配它。在 GPU 上,您可能会遇到更低级别的挑战:占用率、寄存器压力、共享内存压力、内存合并……仅举几例。

However, when you have a data-parallel, compute-bound problem to solve, the GPU is the way to go.

但是,当您需要解决数据并行、计算密集型问题时,GPU 是不二之选。



A general remark: Your specifically asked for CUDA. But I'd strongly recommend you to also have a look at OpenCL. It has several advantages. First of all, it's an vendor-independent, open industry standard, and there are implementations of OpenCL by AMD, Apple, Intel and NVIDIA. Additionally, there is a much broader support for OpenCL in the Java world. The only case where I'd rather settle for CUDA is when you want to use the CUDA runtime libraries, like CUFFT for FFT or CUBLAS for BLAS (Matrix/Vector operations). Although there are approaches for providing similar libraries for OpenCL, they can not directly be used from Java side, unless you create your own JNI bindings for these libraries.

一般评论:您特别要求使用 CUDA。但我强烈建议您也看看 OpenCL。它有几个优点。首先,它是一个独立于供应商的开放行业标准,并且有 AMD、Apple、Intel 和 NVIDIA 实现了 OpenCL。此外,Java 世界对 OpenCL 有更广泛的支持。我宁愿选择 CUDA 的唯一情况是当您想要使用 CUDA 运行时库时,例如 CUFFT 用于 FFT 或 CUBLAS 用于 BLAS(矩阵/矢量运算)。尽管有为 OpenCL 提供类似库的方法,但它们不能直接从 Java 端使用,除非您为这些库创建自己的 JNI 绑定。



You might also find it interesting to hear that in October 2012, the OpenJDK HotSpot group started the project "Sumatra": http://openjdk.java.net/projects/sumatra/. The goal of this project is to provide GPU support directlyin the JVM, with support from the JIT. The current status and first results can be seen in their mailing list at http://mail.openjdk.java.net/mailman/listinfo/sumatra-dev

您可能还会发现,在 2012 年 10 月,OpenJDK HotSpot 小组启动了“Sumatra”项目:http: //openjdk.java.net/projects/sumatra/。该项目的目标是在 JIT 的支持下直接在 JVM 中提供 GPU 支持。当前状态和第一个结果可以在他们的邮件列表中看到http://mail.openjdk.java.net/mailman/listinfo/sumatra-dev



However, a while ago, I collected some resources related to "Java on the GPU" in general. I'll summarize these again here, in no particular order.

然而,不久前,我收集了一些与“GPU 上的 Java”相关的资源。我将在这里再次总结这些,没有特别的顺序。

(Disclaimer: I'm the author of http://jcuda.org/and http://jocl.org/)

免责声明:我是http://jcuda.org/http://jocl.org/的作者)

(Byte)code translation and OpenCL code generation:

(字节)代码翻译和 OpenCL 代码生成:

https://github.com/aparapi/aparapi: An open-source library that is created and actively maintained by AMD. In a special "Kernel" class, one can override a specific method which should be executed in parallel. The byte code of this method is loaded at runtime using an own bytecode reader. The code is translated into OpenCL code, which is then compiled using the OpenCL compiler. The result can then be executed on the OpenCL device, which may be a GPU or a CPU. If the compilation into OpenCL is not possible (or no OpenCL is available), the code will still be executed in parallel, using a Thread Pool.

https://github.com/aparapi/aparapi:一个由 AMD 创建并积极维护的开源库。在特殊的“内核”类中,可以覆盖应该并行执行的特定方法。此方法的字节码在运行时使用自己的字节码阅读器加载。代码被翻译成 OpenCL 代码,然后使用 OpenCL 编译器进行编译。然后可以在 OpenCL 设备(可能是 GPU 或 CPU)上执行结果。如果无法编译到 OpenCL(或没有可用的 OpenCL),代码仍将使用线程池并行执行。

https://github.com/pcpratts/rootbeer1: An open-source library for converting parts of Java into CUDA programs. It offers dedicated interfaces that may be implemented to indicate that a certain class should be executed on the GPU. In contrast to Aparapi, it tries to automatically serialize the "relevant" data (that is, the complete relevant part of the object graph!) into a representation that is suitable for the GPU.

https://github.com/pcpratts/rootbeer1:用于将 Java 的一部分转换为 CUDA 程序的开源库。它提供了专用接口,可以实现这些接口以指示应在 GPU 上执行某个类。与 Aparapi 相比,它尝试自动将“相关”数据(即对象图的完整相关部分!)序列化为适合 GPU 的表示。

https://code.google.com/archive/p/java-gpu/: A library for translating annotated Java code (with some limitations) into CUDA code, which is then compiled into a library that executes the code on the GPU. The Library was developed in the context of a PhD thesis, which contains profound background information about the translation process.

https://code.google.com/archive/p/java-gpu/:用于将带注释的 Java 代码(有一些限制)翻译成 CUDA 代码的库,然后将其编译成在 GPU 上执行代码的库。该图书馆是在博士论文的背景下开发的,其中包含有关翻译过程的深刻背景信息。

https://github.com/ochafik/ScalaCL: Scala bindings for OpenCL. Allows special Scala collections to be processed in parallel with OpenCL. The functions that are called on the elements of the collections can be usual Scala functions (with some limitations) which are then translated into OpenCL kernels.

https://github.com/ochafik/ScalaCL:OpenCL 的Scala 绑定。允许与 OpenCL 并行处理特殊的 Scala 集合。在集合的元素上调用的函数可以是通常的 Scala 函数(有一些限制),然后将其转换为 OpenCL 内核。

Language extensions

语言扩展

http://www.ateji.com/px/index.html: A language extension for Java that allows parallel constructs (e.g. parallel for loops, OpenMP style) which are then executed on the GPU with OpenCL. Unfortunately, this very promising project is no longer maintained.

http://www.ateji.com/px/index.html:Java的语言扩展,允许并行构造(例如并行 for 循环,OpenMP 样式),然后使用 OpenCL 在 GPU 上执行。不幸的是,这个非常有前途的项目不再维护。

http://www.habanero.rice.edu/Publications.html(JCUDA) : A library that can translate special Java Code (called JCUDA code) into Java- and CUDA-C code, which can then be compiled and executed on the GPU. However, the library does not seem to be publicly available.

http://www.habanero.rice.edu/Publications.html(JCUDA) :一个可以将特殊 Java 代码(称为 JCUDA 代码)转换为 Java 和 CUDA-C 代码的库,然后可以在图形处理器。然而,图书馆似乎并不公开。

https://www2.informatik.uni-erlangen.de/EN/research/JavaOpenMP/index.html: Java language extension for for OpenMP constructs, with a CUDA backend

https://www2.informatik.uni-erlangen.de/EN/research/JavaOpenMP/index.html:用于 OpenMP 结构的 Java 语言扩展,带有 CUDA 后端

Java OpenCL/CUDA binding libraries

Java OpenCL/CUDA 绑定库

https://github.com/ochafik/JavaCL: Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings

https://github.com/ochafik/JavaCL:OpenCL 的Java 绑定:一个面向对象的 OpenCL 库,基于自动生成的低级绑定

http://jogamp.org/jocl/www/: Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings

http://jogamp.org/jocl/www/:OpenCL 的Java 绑定:一个面向对象的 OpenCL 库,基于自动生成的低级绑定

http://www.lwjgl.org/: Java bindings for OpenCL: Auto-generated low-level bindings and object-oriented convenience classes

http://www.lwjgl.org/:OpenCL 的Java 绑定:自动生成的低级绑定和面向对象的便利类

http://jocl.org/: Java bindings for OpenCL: Low-level bindings that are a 1:1 mapping of the original OpenCL API

http://jocl.org/:OpenCL 的Java 绑定:原始 OpenCL API 1:1 映射的低级绑定

http://jcuda.org/: Java bindings for CUDA: Low-level bindings that are a 1:1 mapping of the original CUDA API

http://jcuda.org/:CUDA 的Java 绑定:低级绑定是原始 CUDA API 的 1:1 映射

Miscellaneous

各种各样的

http://sourceforge.net/projects/jopencl/: Java bindings for OpenCL. Seem to be no longer maintained since 2010

http://sourceforge.net/projects/jopencl/:OpenCL 的Java 绑定。自 2010 年以来似乎不再维护

http://www.hoopoe-cloud.com/: Java bindings for CUDA. Seem to be no longer maintained

http://www.hoopoe-cloud.com/:CUDA 的Java 绑定。好像不再维护了



回答by JohnKlehm

I'd start by using one of the projects out there for Java and CUDA: http://www.jcuda.org/

我首先使用 Java 和 CUDA 的项目之一:http: //www.jcuda.org/

回答by David Griffin

From the researchI have done, if you are targeting Nvidia GPUs and have decided to use CUDA over OpenCL, I found three ways to use the CUDA API in java.

根据我所做的研究,如果您的目标是 Nvidia GPU 并决定在OpenCL上使用 CUDA ,我发现了三种在 Java 中使用 CUDA API 的方法。

  1. JCuda (or alternative)- http://www.jcuda.org/. This seems like the best solution for the problems I am working on. Many of libraries such as CUBLAS are available in JCuda. Kernels are still written in C though.
  2. JNI - JNI interfaces are not my favorite to write, but are very powerful and would allow you to do anything CUDA can do.
  3. JavaCPP - This basically lets you make a JNI interface in Java without writing C code directly. There is an example here: What is the easiest way to run working CUDA code in Java?of how to use this with CUDA thrust. To me, this seems like you might as well just write a JNI interface.
  1. JCuda(或替代方案)- http://www.jcuda.org/。这似乎是我正在处理的问题的最佳解决方案。JCuda 中提供了许多库,例如 CUBLAS。尽管如此,内核仍然是用 C 编写的。
  2. JNI - JNI 接口不是我最喜欢写的,但非常强大,可以让你做任何 CUDA 可以做的事情。
  3. JavaCPP - 这基本上可以让您在 Java 中创建 JNI 接口,而无需直接编写 C 代码。这里有一个例子:在 Java 中运行工作 CUDA 代码的最简单方法是什么?如何将其与 CUDA 推力一起使用。对我来说,这似乎你最好只写一个 JNI 接口。

All of these answers basically are just ways of using C/C++ code in Java. You should ask yourself why you need to use Java and if you can't do it in C/C++ instead.

所有这些答案基本上只是在 Java 中使用 C/C++ 代码的方法。您应该问问自己为什么需要使用 Java 以及是否不能用 C/C++ 来代替。

If you like Java and know how to use it and don't want to work with all the pointer management and what-not that comes with C/C++ then JCuda is probably the answer. On the other hand, the CUDA Thrust library and other libraries like it can be used to do a lot of the pointer management in C/C++ and maybe you should look at that.

如果您喜欢 Java 并且知道如何使用它并且不想使用所有指针管理以及 C/C++ 附带的其他内容,那么 JCuda 可能是答案。另一方面,CUDA Thrust 库和其他类似的库可用于在 C/C++ 中进行很多指针管理,也许您应该看看它。

If you like C/C++ and don't mind pointer management, but there are other constraints forcing you to use Java, then JNI might be the best approach. Though, if your JNI methods are just going be wrappers for kernel commands you might as well just use JCuda.

如果您喜欢 C/C++ 并且不介意指针管理,但还有其他限制迫使您使用 Java,那么 JNI 可能是最好的方法。不过,如果您的 JNI 方法只是作为内核命令的包装器,您最好只使用 JCuda。

There are a few alternatives to JCuda such as Cuda4J and Root Beer, but those do not seem to be maintained. Whereas at the time of writing this JCuda supports CUDA 10.1. which is the most up-to-date CUDA SDK.

JCuda 有一些替代品,例如 Cuda4J 和 Root Beer,但这些似乎没有得到维护。而在撰写本文时,JCuda 支持 CUDA 10.1。这是最新的 CUDA SDK。

Additionally there are a few java libraries that use CUDA, such as deeplearning4j and Hadoop, that may be able to do what you are looking for without requiring you to write kernel code directly. I have not looked into them too much though.

此外,还有一些使用 CUDA 的 Java 库,例如 deeplearning4j 和 Hadoop,它们可能能够执行您正在寻找的操作,而无需您直接编写内核代码。不过,我并没有过多地研究它们。

回答by Christian Fries

Marco13 already provided an excellent answer.

Marco13 已经提供了一个很好的答案

In case you are in search for a way to use the GPU without implementing CUDA/OpenCL kernels, I would like to add a reference to the finmath-lib-cuda-extensions (finmath-lib-gpu-extensions) http://finmath.net/finmath-lib-cuda-extensions/(disclaimer: I am the maintainer of this project).

如果您正在寻找一种在不实现 CUDA/OpenCL 内核的情况下使用 GPU 的方法,我想添加对 finmath-lib-cuda-extensions (finmath-lib-gpu-extensions) http://finmath 的引用.net/finmath-lib-cuda-extensions/(免责声明:我是这个项目的维护者)。

The project provides an implementation of "vector classes", to be precise, an interface called RandomVariable, which provides arithmetic operations and reduction on vectors. There are implementations for the CPU and GPU. There are implementation using algorithmic differentiation or plain valuations.

该项目提供了一个“向量类”的实现,准确地说,是一个名为 的接口RandomVariable,它提供了对向量的算术运算和归约。有 CPU 和 GPU 的实现。有使用算法微分或简单估值的实现。

The performance improvements on the GPU are currently small (but for vectors of size 100.000 you may get a factor > 10 performance improvements). This is due to the small kernel sizes. This will improve in a future version.

GPU 上的性能改进目前很小(但对于大小为 100.000 的向量,您可能会获得大于 10 的性能改进)。这是由于较小的内核大小。这将在未来版本中改进。

The GPU implementation use JCuda and JOCL and are available for Nvidia and ATI GPUs.

GPU 实现使用 JCuda 和 JOCL,可用于 Nvidia 和 ATI GPU。

The library is Apache 2.0 and available via Maven Central.

该库是 Apache 2.0,可通过 Maven Central 获得。