Java 矩阵数学库的性能?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/529457/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Performance of Java matrix math libraries?
提问by dfrankow
We are computing something whose runtime is bound by matrix operations. (Some details below if interested.) This experience prompted the following question:
我们正在计算一些运行时间受矩阵运算约束的东西。(如果有兴趣,请在下面提供一些详细信息。)这次经历引发了以下问题:
Do folk have experience with the performance of Java libraries for matrix math (e.g., multiply, inverse, etc.)? For example:
人们对矩阵数学(例如,乘法、逆等)的 Java 库的性能有经验吗?例如:
I searched and found nothing.
我搜索并没有发现任何东西。
Details of our speed comparison:
我们速度比较的详细信息:
We are using Intel FORTRAN (ifort (IFORT) 10.1 20070913). We have reimplemented it in Java (1.6) using Apache commons math 1.2 matrix ops, and it agrees to all of its digits of accuracy. (We have reasons for wanting it in Java.) (Java doubles, Fortran real*8). Fortran: 6 minutes, Java 33 minutes, same machine. jvisualm profiling shows much time spent in RealMatrixImpl.{getEntry,isValidCoordinate} (which appear to be gone in unreleased Apache commons math 2.0, but 2.0 is no faster). Fortran is using Atlas BLAS routines (dpotrf, etc.).
我们正在使用英特尔 FORTRAN (ifort (IFORT) 10.1 20070913)。我们使用 Apache commons math 1.2 矩阵运算在 Java (1.6) 中重新实现了它,并且它同意其所有准确度数字。(我们有理由希望在 Java 中使用它。)(Java 翻倍,Fortran 实数*8)。Fortran:6 分钟,Java 33 分钟,同一台机器。jvisualm 分析显示在 RealMatrixImpl.{getEntry,isValidCoordinate} 上花费了大量时间(在未发布的 Apache commons math 2.0 中似乎消失了,但 2.0 并没有更快)。Fortran 正在使用 Atlas BLAS 例程(dpotrf 等)。
Obviously this could depend on our code in each language, but we believe most of the time is in equivalent matrix operations.
显然,这可能取决于我们在每种语言中的代码,但我们相信大部分时间都在等效的矩阵运算中。
In several other computations that do not involve libraries, Java has not been much slower, and sometimes much faster.
在其他几个不涉及库的计算中,Java 并没有慢得多,有时甚至快得多。
采纳答案by Hamaad Shah
Just to add my 2 cents. I've compared some of these libraries. I attempted to matrix multiply a 3000 by 3000 matrix of doubles with itself. The results are as follows.
只是为了增加我的 2 美分。我比较了其中一些库。我试图将一个 3000 乘以 3000 的双精度矩阵与其自身进行矩阵乘法。结果如下。
Using multithreaded ATLAS with C/C++, Octave, Python and R, the time taken was around 4 seconds.
将多线程 ATLAS 与 C/C++、Octave、Python 和 R 结合使用,所用时间约为 4 秒。
Using Jama with Java, the time taken was 50 seconds.
在 Java 中使用 Jama,花费的时间为 50 秒。
Using Colt and Parallel Colt with Java, the time taken was 150 seconds!
在 Java 中使用 Colt 和 Parallel Colt,花费的时间是 150 秒!
Using JBLAS with Java, the time taken was again around 4 seconds as JBLAS uses multithreaded ATLAS.
将 JBLAS 与 Java 结合使用,由于 JBLAS 使用多线程 ATLAS,所用时间再次约为 4 秒。
So for me it was clear that the Java libraries didn't perform too well. However if someone has to code in Java, then the best option is JBLAS. Jama, Colt and Parallel Colt are not fast.
所以对我来说,很明显 Java 库表现得不太好。但是,如果有人必须用 Java 编写代码,那么最好的选择是 JBLAS。Jama、Colt 和 Parallel Colt 并不快。
回答by Zach Scrivena
Have you taken a look at the Intel Math Kernel Library? It claims to outperform even ATLAS. MKL can be used in Javathrough JNI wrappers.
您是否查看过Intel Math Kernel Library?它声称甚至优于ATLAS。MKL 可以通过 JNI 包装器在 Java 中使用。
回答by Neil Coffey
I can't really comment on specific libraries, but in principle there's little reason for such operations to be slower in Java. Hotspot generally does the kinds of things you'd expect a compiler to do: it compiles basic math operations on Java variables to corresponding machine instructions (it uses SSE instructions, but only one per operation); accesses to elements of an array are compiled to use "raw" MOV instructions as you'd expect; it makes decisions on how to allocate variables to registers when it can; it re-orders instructions to take advantage of processor architecture... A possible exception is that as I mentioned, Hotspot will only perform one operation per SSE instruction; in principle you could have a fantastically optimised matrix library that performed multiple operations per instruction, although I don't know if, say, your particular FORTRAN library does so or if such a library even exists. If it does, there's currently no way for Java (or at least, Hotspot) to compete with that (though you could of course write your own native library with those optimisations to call from Java).
我无法真正评论特定的库,但原则上,此类操作在 Java 中没有理由变慢。Hotspot 通常会做您期望编译器做的事情:它将 Java 变量上的基本数学运算编译为相应的机器指令(它使用 SSE 指令,但每个操作只有一个);如您所料,对数组元素的访问被编译为使用“原始”MOV 指令;它决定如何在可能的情况下将变量分配给寄存器;它重新排序指令以利用处理器架构......一个可能的例外是,正如我所提到的,Hotspot 只会执行一个 SSE 指令的操作;原则上你可以有一个非常优化的矩阵库,每条指令执行多个操作,尽管我没有 不知道,比如说,您的特定 FORTRAN 库是否这样做,或者这样的库是否存在。如果是这样,那么 Java(或至少 Hotspot)目前无法与之竞争(尽管您当然可以使用这些优化来编写自己的本机库以从 Java 调用)。
So what does all this mean? Well:
那么,这意味着什么?好:
- in principle, it is worth hunting around for a better-performing library, though unfortunately I can't recomend one
- if performance is really critical to you, I would consider just coding your own matrix operations, because you may then be able perform certain optimisations that a library generally can't, or that a particular library your using doesn't (if you have a multiprocessor machine, find out if the library is actually multithreaded)
- 原则上,值得寻找性能更好的库,但不幸的是我不能推荐一个
- 如果性能对你来说真的很重要,我会考虑只编写你自己的矩阵运算,因为这样你就可以执行某些库通常不能执行的优化,或者你使用的特定库不执行的优化(如果你有多处理器机器,找出库是否实际上是多线程的)
A hindrance to matrix operations is often data locality issues that arise when you need to traverse both row by row and column by column, e.g. in matrix multiplication, since you have to store the data in an order that optimises one or the other. But if you hand-write the code, you can sometimes combine operations to optimise data locality(e.g. if you're multiplying a matrix by its transformation, you can turn a column traversal into a row traversal if you write a dedicated function instead of combining two library functions). As usual in life, a library will give you non-optimal performance in exchange for faster development; you need to decide just how important performance is to you.
矩阵运算的一个障碍通常是当您需要逐行和逐列遍历时(例如在矩阵乘法中)出现的数据局部性问题,因为您必须以优化一个或另一个的顺序存储数据。但是如果你手写代码,你有时可以组合操作来优化数据局部性(例如,如果你将矩阵乘以其变换,你可以将列遍历变成行遍历,如果你编写一个专用函数而不是组合两个库函数)。与生活中的往常一样,库会给你非最佳的性能以换取更快的开发;您需要确定性能对您的重要性。
回答by Varkhan
Linalg code that relies heavily on Pentiums and later processors' vector computing capabilities (starting with the MMX extensions, like LAPACK and now Atlas BLAS) is not "fantastically optimized", but simply industry-standard. To replicate that perfomance in Java you are going to need native libraries. I have had the same performance problem as you describe (mainly, to be able to compute Choleski decompositions) and have found nothing really efficient: Jama is pure Java, since it is supposed to be just a template and reference kit for implementers to follow... which never happened. You know Apache math commons... As for COLT, I have still to test it but it seems to rely heavily on Ninja improvements, most of which were reached by building an ad-hoc Java compiler, so I doubt it's going to help. At that point, I think we "just" need a collective effort to build a native Jama implementation...
严重依赖奔腾和更高版本处理器的矢量计算能力(从 MMX 扩展开始,如 LAPACK 和现在的 Atlas BLAS)的 Linalg 代码不是“非常优化”,而只是行业标准。要在 Java 中复制该性能,您将需要本机库。我遇到了与您描述的相同的性能问题(主要是为了能够计算 Choleski 分解)并且没有发现任何真正有效的东西:Jama 是纯 Java,因为它应该只是实现者遵循的模板和参考工具包。 .. 从来没有发生过。你知道 Apache math commons...至于 COLT,我仍然需要测试它,但它似乎严重依赖 Ninja 改进,其中大部分是通过构建一个特别的 Java 编译器实现的,所以我怀疑它会有所帮助。那时,我认为我们“
回答by dfrankow
Building on Varkhan's post that Pentium-specific native code would do better:
基于 Varkhan 的帖子,Pentium 特定的本机代码会做得更好:
jBLAS: An alpha-stage project with JNI wrappers for Atlas: http://www.jblas.org.
- Author's blog post: http://mikiobraun.blogspot.com/2008/10/matrices-jni-directbuffers-and-number.html.
MTJ: Another such project: http://code.google.com/p/matrix-toolkits-java/
jBLAS:一个带有用于 Atlas 的 JNI 包装器的 alpha 阶段项目:http://www.jblas.org 。
MTJ:另一个这样的项目:http: //code.google.com/p/matrix-toolkits-java/
回答by Nick Fortescue
We have used COLT for some pretty large serious financial calculations and have been very happy with it. In our heavily profiled code we have almost never had to replace a COLT implementation with one of our own.
我们已经使用 COLT 进行了一些相当大的严肃财务计算,并且对此非常满意。在我们高度剖析的代码中,我们几乎从未需要用我们自己的代码替换 COLT 实现。
In their own testing (obviously not independent) I think they claim within a factor of 2 of the Intel hand-optimised assembler routines. The trick to using it well is making sure that you understand their design philosophy, and avoid extraneous object allocation.
在他们自己的测试中(显然不是独立的),我认为他们声称在英特尔手动优化汇编程序的 2 倍以内。使用它的诀窍是确保您了解他们的设计理念,并避免无关的对象分配。
回答by Nick Fortescue
There are many different freely available java linear algebra libraries. http://www.ujmp.org/java-matrix/benchmark/Unfortunately that benchmark only gives you info about matrix multiplication (with transposing the test does not allow the different libraries to exploit their respective design features).
有许多不同的免费 Java 线性代数库。http://www.ujmp.org/java-matrix/benchmark/不幸的是,该基准测试仅提供有关矩阵乘法的信息(转置测试不允许不同的库利用其各自的设计功能)。
What you should look at is how these linear algebra libraries perform when asked to compute various matrix decompositions. http://ojalgo.org/matrix_compare.html
您应该查看这些线性代数库在被要求计算各种矩阵分解时的表现。 http://ojalgo.org/matrix_compare.html
回答by Peter Lawrey
I have found that if you are creating a lot of high dimensional Matrices, you can make Jama about 20% faster if you change it to use a single dimensional array instead of a two dimensional array. This is because Java doesn't support multi-dimensional arrays as efficiently. ie. it creates an array of arrays.
我发现如果您要创建大量高维矩阵,如果您将其更改为使用一维数组而不是二维数组,则可以使 Jama 快 20%。这是因为 Java 不能有效地支持多维数组。IE。它创建一个数组数组。
Colt does this already, but I have found it is more complicated and more powerful than Jama which may explain why simple functions are slower with Colt.
Colt 已经这样做了,但我发现它比 Jama 更复杂、更强大,这可以解释为什么 Colt 的简单函数会变慢。
The answer really depends on that you are doing. Jama doesn't support a fraction of the things Colt can do which make make more of a difference.
答案实际上取决于您正在做什么。Jama 不支持 Colt 可以做的事情的一小部分,这些事情会产生更大的影响。
回答by Steve Lianoglou
Matrix Tookits Java (MTJ) was already mentioned before, but perhaps it's worth mentioning again for anyone else stumbling onto this thread. For those interested, it seems like there's also talk about having MTJ replace the linalg library in the apache commons math 2.0, though I'm not sure how that's progressing lately.
Matrix Tookits Java (MTJ) 之前已经提到过,但对于任何绊倒这个线程的人来说,也许值得再次提及。对于那些感兴趣的人,似乎也有人在谈论让 MTJ 替换apache commons math 2.0 中的 linalg 库,尽管我不确定最近进展如何。
回答by Mark Reid
You may want to check out the jblasproject. It's a relatively new Java library that uses BLAS, LAPACK and ATLAS for high-performance matrix operations.
您可能想查看jblas项目。它是一个相对较新的 Java 库,使用 BLAS、LAPACK 和 ATLAS 进行高性能矩阵运算。
The developer has posted some benchmarksin which jblas comes off favourably against MTJ and Colt.
开发人员发布了一些基准测试,其中 jblas 与 MTJ 和 Colt 相比表现出色。