我可以在 Java 代码中做什么来优化 CPU 缓存?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1478280/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What can I do in Java code to optimize for CPU caching?
提问by Hanno Fietz
When writing a Java program, do I have influence on how the CPU will utilize its cache to store my data? For example, if I have an array that is accessed a lot, does it help if it's small enough to fit in one cache line (typically 128 byte on a 64-bit machine)? What if I keep a much used object within that limit, can I expect the memory used by it's members to be close together and staying in cache?
在编写 Java 程序时,我是否会影响 CPU 将如何利用其缓存来存储我的数据?例如,如果我有一个经常被访问的数组,如果它足够小以适合一个缓存行(在 64 位机器上通常为 128 字节)是否有帮助?如果我将一个经常使用的对象保留在该限制内,我是否可以期望它的成员使用的内存靠近在一起并留在缓存中?
Background: I'm building a compressed digital tree, that's heavily inspired by the Judy arrays, which are in C. While I'm mostly after its node compression techniques, Judy has CPU cache optimization as a central design goal and the node types as well as the heuristics for switching between them are heavily influenced by that. I was wondering if I have any chance of getting those benefits, too?
背景:我正在构建一个压缩的数字树,这很大程度上受到了 C 语言Judy 数组的启发。虽然我主要是在其节点压缩技术之后,但 Judy 将 CPU 缓存优化作为中心设计目标,节点类型为以及在它们之间切换的启发式方法深受其影响。我想知道我是否也有机会获得这些好处?
Edit: The general advice of the answers so far is, don't try to microoptimize machine-level details when you're so far away from the machine as you're in Java. I totally agree, so felt I had to add some (hopefully) clarifying comments, to better explain why I think the question still makes sense. These are below:
编辑:到目前为止,答案的一般建议是,当您像在 Java 中那样远离机器时,不要尝试对机器级别的细节进行微优化。我完全同意,所以觉得我必须添加一些(希望)澄清评论,以更好地解释为什么我认为这个问题仍然有意义。这些如下:
There are some things that are just generally easier for computers to handle because of the way they are built. I have seen Java code run noticeably faster on compressed data (from memory), even though the decompression had to use additional CPU cycles. If the data were stored on disk, it's obvious why that is so, but of course in RAM it's the same principle.
由于计算机的构建方式,有些事情通常更容易让计算机处理。我已经看到 Java 代码在压缩数据(来自内存)上的运行速度明显更快,即使解压缩必须使用额外的 CPU 周期。如果数据存储在磁盘上,原因很明显,但当然在 RAM 中,原理是相同的。
Now, computer science has lots to say about what those things are, for example, locality of reference is great in C and I guess it's still great in Java, maybe even more so, if it helps the optimizing runtime to do more clever things. But how you accomplish it might be very different. In C, I might write code that manages larger chunks of memory itself and uses adjacent pointers for related data.
现在,计算机科学有很多关于这些东西的说法,例如,引用局部性在 C 中很好,我想它在 Java 中仍然很好,如果它有助于优化运行时做更聪明的事情,也许更是如此。但是你如何完成它可能会有很大的不同。在 C 中,我可能会编写代码来管理更大的内存块本身并使用相邻的指针来获取相关数据。
In Java, I can't (and don't want to) know much about how memory is going to be managed by a particular runtime. So I have to take optimizations to a higher level of abstraction, too. My question is basically, how do I do that? For locality of reference, what does "close together" mean at the level of abstraction I'm working on in Java? Same object? Same type? Same array?
在 Java 中,我不能(也不想)了解特定运行时将如何管理内存。因此,我也必须将优化提升到更高的抽象级别。我的问题基本上是,我该怎么做?对于引用的局部性,在我正在使用 Java 进行的抽象级别上,“靠近在一起”是什么意思?同一个对象?同类型?同一个数组?
In general, I don't think that abstraction layers change the "laws of physics", metaphorically speaking. Doubling your array in size every time you run out of space is a good strategy in Java, too, even though you don't call malloc()anymore.
一般来说,我不认为抽象层会改变“物理定律”,比喻地说。每次用完空间时将数组的大小加倍也是 Java 中的一个好策略,即使您不再调用malloc()。
采纳答案by erickson
The key to good performance with Java is to write idiomatic code, rather than trying to outwit the JIT compiler. If you write your code to try to influence it to do things in a certain way at the native instruction level, you are more likely to shoot yourself in the foot.
使用 Java 获得良好性能的关键是编写惯用的代码,而不是试图智胜 JIT 编译器。如果您编写代码以试图影响它在本机指令级别以某种方式做事,您更有可能在脚下开枪。
That isn't to say that common principles like locality of reference don't matter. They do, but I would consider the use of arrays and such to be performance-aware, idiomatic code, but not "tricky."
这并不是说参考位置等通用原则无关紧要。他们确实如此,但我会认为使用数组等是性能感知、惯用代码,但不是“棘手”。
HotSpot and other optimizing runtimes are extremelyclever about how they optimize code for specific processors. (For an example, check out this discussion.) If I were an expert machine language programmer, I'd write machine language, not Java. And if I'm not, it would be unwise to think that I could do a better job of optimizing my code than the experts.
HotSpot 和其他优化运行时在如何为特定处理器优化代码方面非常聪明。(例如,查看此讨论。)如果我是专家级机器语言程序员,我会编写机器语言,而不是 Java。如果我不是,那么认为我可以比专家更好地优化我的代码是不明智的。
Also, even if you do know the best way to implement something for a particular CPU, the beauty of Java is write-once-run-anywhere. Clever tricks to "optimize" Java code tend to make optimization opportunities harder for the JIT to recognize. Straight-forward code that adheres to common idioms is easier for an optimizer to recognize. So even when you get the best Java code for your testbed, that code might perform horribly on a different architecture, or at best, fail to take advantages of enhancements in future JITs.
此外,即使您确实知道为特定 CPU 实现某些东西的最佳方法,Java 的美妙之处在于一次编写,随处运行。“优化”Java 代码的巧妙技巧往往会使 JIT 更难识别优化机会。遵循常见习惯用法的直接代码更容易被优化器识别。因此,即使您为您的测试平台获得了最好的 Java 代码,该代码也可能在不同的体系结构上表现得非常糟糕,或者充其量无法利用未来 JIT 中的增强功能。
If you want good performance, keep it simple. Teams of reallysmart people are working to make it fast.
如果您想要良好的性能,请保持简单。由真正聪明的人组成的团队正在努力使其更快。
回答by Engineer
If the data you're crunching is primarily or wholly made up of primitives (eg. in numeric problems), I would advise the following.
如果您处理的数据主要或完全由原语组成(例如在数字问题中),我会建议以下。
Allocate a flat structure of fixed size arrays-of-primitives at initialisation-time, and make sure the data therein is periodically compacted/defragmented (0->n where n is the smallest max index possible given your element count), to be iterated over using a for-loop. This is the only way to guarantee contiguous allocation in Java, and compaction further serves to improves locality of reference. Compaction is beneficial, as it reduces the need to iterate over unused elements, reducing the number of conditionals: As the for loop iterates, the termination occurs earlier, and less iteration = less movement through the heap = fewer chances for a cache miss. While compaction creates an overhead in and of itself, this may be done only periodically (with respect to your primary areas of processing) if you so choose.
在初始化时分配固定大小的原始数组的平面结构,并确保其中的数据定期压缩/碎片整理(0->n,其中 n 是给定元素数量的最小最大索引),以进行迭代过度使用 for 循环。这是在 Java 中保证连续分配的唯一方法,并且压缩进一步用于提高引用的局部性。压缩是有益的,因为它减少了对未使用元素进行迭代的需要,减少了条件的数量:随着 for 循环的迭代,终止发生得更早,更少的迭代 = 更少的堆移动 = 更少的缓存未命中机会。虽然压缩本身会产生开销,但如果您愿意,这可能只能定期(相对于您的主要处理区域)进行。
Even better, you can interleavevalues in these pre-allocated arrays. For instance, if you are representing spatial transforms for many thousands of entities in 2D space, and are processing the equations of motion for each such, you might have a tight loop like
更好的是,您可以在这些预先分配的数组中交错值。例如,如果您在 2D 空间中表示数千个实体的空间变换,并且正在处理每个这样的运动方程,您可能有一个紧密的循环,如
int axIdx, ayIdx, vxIdx, vyIdx, xIdx, yIdx;
//Acceleration, velocity, and displacement for each
//of x and y totals 6 elements per entity.
for (axIdx = 0; axIdx < array.length; axIdx += 6)
{
ayIdx = axIdx+1;
vxIdx = axIdx+2;
vyIdx = axIdx+3;
xIdx = axIdx+4;
yIdx = axIdx+5;
//velocity1 = velocity0 + acceleration
array[vxIdx] += array[axIdx];
array[vyIdx] += array[ayIdx];
//displacement1 = displacement0 + velocity
array[xIdx] += array[vxIdx];
array[yIdx] += array[vxIdx];
}
This example ignores such issues as rendering of those entities using their associated (x,y)... rendering always requires non-primitives (thus, references/pointers). If you do need such object instances, then you can no longer guarantee locality of reference, and will likely be jumping around all over the heap. So if you can split your code into sections where you have primitive-intensive processing as shown above, then this approach will help you a lot. For games at least, AI, dynamic terrain, and physics can be some of the most processor-intensives aspect, and are all numeric, so this approach can be very beneficial.
此示例忽略了诸如使用关联的 (x,y) 渲染这些实体等问题……渲染始终需要非原始元素(因此,引用/指针)。如果您确实需要这样的对象实例,那么您就不能再保证引用的局部性,并且可能会在整个堆中四处跳跃。因此,如果您可以将代码拆分为具有原始密集处理的部分,如上所示,那么这种方法将对您有很大帮助。至少对于游戏来说,人工智能、动态地形和物理可能是处理器最密集的一些方面,并且都是数字,因此这种方法可能非常有益。
回答by Bill K
If you are down to where an improvement of a few percent makes a difference, use C where you'll get an improvement of 50-100%!
如果您发现提高几个百分点会有所不同,请使用 C,您将获得 50-100% 的改进!
If you think that the ease of use of Java makes it a better language to use, then don't screw it up with questionable optimizations.
如果您认为 Java 的易用性使其成为一种更好的语言,那么请不要用有问题的优化将其搞砸。
The good news is that Java will do a lot of stuff beneath the covers to improve your code at runtime, but it almost certainly won't do the kind of optimizations you're talking about.
好消息是 Java 会在幕后做很多事情来在运行时改进您的代码,但它几乎肯定不会做您正在谈论的那种优化。
If you decide to go with Java, just write your code as clearly as you can, don't take minor optimizations into account at all. (Major ones like using the right collections for the right job, not allocating/freeing objects inside a loop, etc. are still worth while)
如果您决定使用 Java,请尽可能清楚地编写代码,根本不要考虑细微的优化。(主要的比如为正确的工作使用正确的集合,不在循环内分配/释放对象等仍然值得一试)
回答by juancn
So far the advice is pretty strong, in general it's best not to try and outsmart the JIT. But as you say some knowledge about the details is useful sometimes.
到目前为止,建议非常有力,一般来说,最好不要试图超越 JIT。但正如你所说,一些关于细节的知识有时是有用的。
Regarding memory layout for objects, Sun's Jvm (now Oracle's) lays objects into memory by type (i.e. doubles and longs first, then ints and floats, then shorts and chars, after that bytes and booleans and finally object references). You can get more details here..
关于对象的内存布局,Sun 的 JVM(现在是 Oracle 的)按类型将对象放入内存中(即首先是双精度型和长型,然后是整数和浮点型,然后是短型和字符型,然后是字节和布尔型,最后是对象引用)。您可以在此处获得更多详细信息..
Local variables are usually kept in the stack (that is references and primitive types).
局部变量通常保存在堆栈中(即引用和原始类型)。
As Nick mentions, the best way to ensure the memory layout in Java is by using primitive arrays. That way you can make sure that data is contiguous in memory. Be careful about array sizes though, GCs have trouble with large arrays. It also has the downside that you have to do some memory management yourself.
正如 Nick 所说,确保 Java 内存布局的最佳方法是使用原始数组。这样您就可以确保数据在内存中是连续的。不过要注意数组大小,GC 在处理大数组时会遇到麻烦。它也有缺点,您必须自己进行一些内存管理。
On the upside, you can use a Flyweight pattern to get Object-like usability while keeping fast performance.
从好的方面来说,您可以使用享元模式在保持快速性能的同时获得类对象的可用性。
If you need the extra oomph in performance, generating your own bytecode on the fly helps with some problems, as long as the generated code is executed enough times and your VM's native code cache doesn't get full (which disables the JIT for all practical purposes).
如果您需要额外的性能,动态生成您自己的字节码有助于解决一些问题,只要生成的代码执行足够多次并且您的 VM 的本机代码缓存未满(这会禁用所有实用的 JIT目的)。
回答by Jay
To the best of my knowledge: No. You pretty much have to be writing in machine code to get that level of optimization. With assembly you're a step away because you no longer control where things are stored. With a compiler you're two steps away because you don't even control the details of the generated code. With Java you're three steps away because there's a JVM interpreting your code on the fly.
据我所知:不可以。您几乎必须用机器代码编写才能获得该级别的优化。有了组装,您就一步之遥了,因为您不再控制物品的存储位置。有了编译器,您就差两步了,因为您甚至无法控制生成代码的细节。使用 Java,您只需三步,因为有一个 JVM 即时解释您的代码。
I don't know of any constructs in Java that let you control things on that level of detail. In theory you could indirectly influence it by how you organize your program and data, but you're so far away that I don't see how you could do it reliably, or even know whether or not it was happening.
我不知道 Java 中有任何结构可以让您在该级别上控制事物的细节。从理论上讲,您可以通过组织程序和数据的方式间接影响它,但距离太远,我不知道如何可靠地做到这一点,甚至不知道它是否正在发生。

