java 为什么 ByteBuffer.allocate() 和 ByteBuffer.allocateDirect() 之间的奇怪性能曲线差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3651737/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why the odd performance curve differential between ByteBuffer.allocate() and ByteBuffer.allocateDirect()
提问by Stu Thompson
I'm working on some SocketChannel
-to-SocketChannel
code which will do best with a direct byte buffer--long lived and large (tens to hundreds of megabytes per connection.) While hashing out the exact loop structure with FileChannel
s, I ran some micro-benchmarks on ByteBuffer.allocate()
vs. ByteBuffer.allocateDirect()
performance.
我工作的一些SocketChannel
至-SocketChannel
代码会做最好用直接字节缓冲区- (几十到几百每个连接的兆字节),长寿命,大而散列出具有确切循环结构FileChannel
S,我跑了一些微ByteBuffer.allocate()
对ByteBuffer.allocateDirect()
性能的基准测试。
There was a surprise in the results that I can't really explain. In the below graph, there is a very pronounced cliff at the 256KB and 512KB for the ByteBuffer.allocate()
transfer implementation--the performance drops by ~50%! There also seem sto be a smaller performance cliff for the ByteBuffer.allocateDirect()
. (The %-gain series helps to visualize these changes.)
结果中有一个我无法解释的惊喜。在下图中,ByteBuffer.allocate()
传输实现在 256KB 和 512KB 处有一个非常明显的悬崖——性能下降了约 50%!似乎还有一个较小的性能悬崖ByteBuffer.allocateDirect()
。(%-gain 系列有助于可视化这些变化。)
Buffer Size (bytes) versus Time (MS)
缓冲区大小(字节)与时间 (MS)
Why the odd performance curve differential between ByteBuffer.allocate()
and ByteBuffer.allocateDirect()
?What exactly is going on behind the curtain?
为什么奇数性能曲线在ByteBuffer.allocate()
和之间存在差异ByteBuffer.allocateDirect()
?幕后究竟发生了什么?
It very well maybe hardware and OS dependent, so here are those details:
它很可能依赖于硬件和操作系统,所以这里是这些细节:
- MacBook Pro w/ Dual-core Core 2 CPU
- Intel X25M SSD drive
- OSX 10.6.4
- 配备双核 Core 2 CPU 的 MacBook Pro
- 英特尔 X25M 固态硬盘
- OSX 10.6.4
Source code, by request:
源代码,按要求:
package ch.dietpizza.bench;
import static java.lang.String.format;
import static java.lang.System.out;
import static java.nio.ByteBuffer.*;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.UnknownHostException;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;
public class SocketChannelByteBufferExample {
private static WritableByteChannel target;
private static ReadableByteChannel source;
private static ByteBuffer buffer;
public static void main(String[] args) throws IOException, InterruptedException {
long timeDirect;
long normal;
out.println("start");
for (int i = 512; i <= 1024 * 1024 * 64; i *= 2) {
buffer = allocateDirect(i);
timeDirect = copyShortest();
buffer = allocate(i);
normal = copyShortest();
out.println(format("%d, %d, %d", i, normal, timeDirect));
}
out.println("stop");
}
private static long copyShortest() throws IOException, InterruptedException {
int result = 0;
for (int i = 0; i < 100; i++) {
int single = copyOnce();
result = (i == 0) ? single : Math.min(result, single);
}
return result;
}
private static int copyOnce() throws IOException, InterruptedException {
initialize();
long start = System.currentTimeMillis();
while (source.read(buffer)!= -1) {
buffer.flip();
target.write(buffer);
buffer.clear(); //pos = 0, limit = capacity
}
long time = System.currentTimeMillis() - start;
rest();
return (int)time;
}
private static void initialize() throws UnknownHostException, IOException {
InputStream is = new FileInputStream(new File("/Users/stu/temp/robyn.in"));//315 MB file
OutputStream os = new FileOutputStream(new File("/dev/null"));
target = Channels.newChannel(os);
source = Channels.newChannel(is);
}
private static void rest() throws InterruptedException {
System.gc();
Thread.sleep(200);
}
}
回答by bestsss
How ByteBuffer works and why Direct (Byte)Buffers are the only truly useful now.
ByteBuffer 的工作原理以及为什么 Direct (Byte)Buffers 是现在唯一真正有用的。
first I am a bit surprised it's not common knowledge but bear it w/ me
首先,我有点惊讶这不是常识,但请跟我一起承担
Direct byte buffers allocate an address outside the java heap.
直接字节缓冲区在 java 堆之外分配一个地址。
This is utmost importance: all OS (and native C) functions can utilize that address w/o locking the object on the heap and copying the data. Short example on copying: in order to send any data via Socket.getOutputStream().write(byte[]) the native code has to "lock" the byte[], copy it outside java heap and then call the OS function, e.g. send. The copy is performed either on the stack (for smaller byte[]) or via malloc/free for larger ones. DatagramSockets are no different and they also copy - except they are limited to 64KB and allocated on the stack which can even kill the process if the thread stack is not large enough or deep in recursion. note: locking prevents JVM/GC to move/reallocate the object around the heap
这是最重要的:所有操作系统(和本机 C)函数都可以使用该地址,而无需将对象锁定在堆上并复制数据。关于复制的简短示例:为了通过 Socket.getOutputStream().write(byte[]) 发送任何数据,本机代码必须“锁定”byte[],将其复制到 Java 堆之外,然后调用 OS 函数,例如发送. 复制要么在堆栈上执行(对于较小的 byte[]),要么通过 malloc/free 进行。DatagramSockets 没有什么不同,它们也复制——除了它们被限制为 64KB 并分配在堆栈上,如果线程堆栈不够大或递归深度不够,甚至可以杀死进程。 注意:锁定可防止 JVM/GC 在堆周围移动/重新分配对象
So w/ the introduction of NIO the idea was avoid the copy and multitudes of stream pipelining/indirection. Often there are 3-4 buffered type of streams before the data reaches its destination. (yay Poland equalizes(!) with a beautiful shot)By introducing the direct buffers java could communicate straight to C native code w/o any locking/copy necessary. Hence sent
function can take the address of the buffer add the position and the performance is much the same as native C.
That's about the direct buffer.
所以随着 NIO 的引入,这个想法是避免复制和大量的流管道/间接。在数据到达目的地之前,通常有 3-4 个缓冲类型的流。(是的,波兰用漂亮的镜头来平衡(!))通过引入直接缓冲区,java 可以直接与 C 本机代码通信,而无需任何锁定/复制。因此sent
函数可以将缓冲区的地址加上位置,性能与原生C 非常相似。这就是直接缓冲区。
The main issue w/ direct buffers - they are expensive to allocate and expensive to deallocateand quite cumbersome to use, nothing like byte[].
直接缓冲区的主要问题 - 它们的分配和解除分配都很昂贵,而且使用起来很麻烦,不像字节 []。
Non-direct buffer do not offer the true essence the direct buffers do - i.e. direct bridge to the native/OS instead they are light-weighted and share exactly the same API - and even more, they can wrap byte[]
and even their backing array is available for direct manipulation - what not to love? Well they have to be copied!
非直接缓冲区没有提供直接缓冲区所做的真正本质——即直接桥接到本机/操作系统,而是轻量级并共享完全相同的 API——甚至更多,它们可以wrap byte[]
,甚至它们的后备数组可用于直接操纵 - 什么不爱?好吧,他们必须被复制!
So how does Sun/Oracle handles non-direct buffers as the OS/native can't use 'em - well, naively. When a non-direct buffer is used a direct counter part has to be created. The implementation is smart enough to use ThreadLocal
and cache a few direct buffers via SoftReference
* to avoid the hefty cost of creation. The naive part comes when copying them - it attempts to copy the entire buffer (remaining()
) each time.
那么 Sun/Oracle 如何处理非直接缓冲区,因为操作系统/本机无法使用它们 - 好吧,天真。当使用非直接缓冲区时,必须创建直接计数器部分。该实现足够聪明,可以ThreadLocal
通过SoftReference
*使用和缓存一些直接缓冲区,以避免创建的巨额成本。复制它们时会出现天真的部分 - 它remaining()
每次都尝试复制整个缓冲区 ( )。
Now imagine: 512 KB non-direct buffer going to 64 KB socket buffer, the socket buffer won't take more than its size. So the 1st time 512 KB will be copied from non-direct to thread-local-direct, but only 64 KB of which will be used. The next time 512-64 KB will be copied but only 64 KB used, and the third time 512-64*2 KB will be copied but only 64 KB will be used, and so on... and that's optimistic that always the socket buffer will be empty entirely. So you are not only copying n
KB in total, but n
× n
÷ m
(n
= 512, m
= 16 (the average space the socket buffer has left)).
现在想象一下:512 KB 非直接缓冲区去 64 KB 套接字缓冲区,套接字缓冲区不会超过其大小。所以第一次 512 KB 将从非直接复制到线程本地直接,但只有 64 KB 将被使用。下次将复制 512-64 KB 但仅使用 64 KB,第三次将复制 512-64*2 KB 但仅使用 64 KB,依此类推...缓冲区将完全为空。因此,您不仅要复制n
KB 总数,还要复制n
× n
÷ m
( n
= 512, m
= 16 (套接字缓冲区剩余的平均空间))。
The copying part is a common/abstract path to all non-direct buffer, so the implementation never knows the target capacity. Copying trashes the caches and what not, reduces the memory bandwidth, etc.
复制部分是所有非直接缓冲区的公共/抽象路径,因此实现永远不知道目标容量。复制会破坏缓存,而不会减少内存带宽等。
*A note on SoftReference caching: it depends on the GC implementation and the experience can vary. Sun's GC uses the free heap memory to determine the lifespan of the SoftRefences which leads to some awkward behavior when they are freed - the application needs to allocated the previously cached objects again- i.e. more allocation (direct ByteBuffers take minor part in the heap, so at least they do not affect the extra cache trashing but get affected instead)
*关于 SoftReference 缓存的说明:这取决于 GC 实现,经验可能会有所不同。Sun 的 GC 使用空闲堆内存来确定 SoftRefences 的生命周期,这会导致在释放它们时出现一些尴尬的行为 - 应用程序需要再次分配先前缓存的对象 - 即更多的分配(直接 ByteBuffers 在堆中占很小的一部分,因此至少它们不会影响额外的缓存垃圾,而是会受到影响)
My rule of the thumb - a pooled direct buffer sized with the socket read/write buffer. The OS never copies more than necessary.
我的经验法则 - 一个池直接缓冲区大小与套接字读/写缓冲区。操作系统永远不会复制超过必要的内容。
This micro-benchmark is mostly memory throughput test, the OS will have the file entirely in cache, so it mostly tests memcpy
. Once the buffers run out of the L2 cache the drop of performance is to be noticeable. Also running the benchmark like that imposes increasing and accumulated GC collection costs. (rest()
will not collect the soft-referenced ByteBuffers)
这个微基准测试主要是内存吞吐量测试,操作系统会将文件完全放在缓存中,所以它主要测试memcpy
. 一旦缓冲区用完 L2 缓存,性能下降就会很明显。同样运行这样的基准测试会增加和累积 GC 收集成本。(rest()
不会收集软引用的 ByteBuffers)
回答by Bert F
Thread Local Allocation Buffers (TLAB)
线程本地分配缓冲区 (TLAB)
I wonder if the thread local allocation buffer (TLAB) during the test is around 256K. Use of TLABs optimizes allocations from the heap so that the non-direct allocations of <=256K are fast.
不知道测试时线程本地分配缓冲区(TLAB)是否在256K左右。使用 TLAB 优化了堆的分配,以便 <=256K 的非直接分配很快。
What is commonly done is to give each thread a buffer that is used exclusively by that thread to do allocations. You have to use some synchronization to allocate the buffer from the heap, but after that the thread can allocate from the buffer without synchronization. In the hotspot JVM we refer to these as thread local allocation buffers (TLAB's). They work well.
通常所做的是为每个线程提供一个缓冲区,该缓冲区专门用于该线程进行分配。您必须使用一些同步来从堆分配缓冲区,但之后线程可以在没有同步的情况下从缓冲区分配。在热点 JVM 中,我们将它们称为线程本地分配缓冲区 (TLAB)。他们工作得很好。
Large allocations bypassing the TLAB
绕过 TLAB 的大分配
If my hypothesis about a 256K TLAB is correct, then information later in the the article suggests that perhaps the >256K allocations for the larger non-direct buffers bypass the TLAB. These allocations go straight to heap, requiring thread synchronization, thus incurring the performance hits.
如果我关于 256K TLAB 的假设是正确的,那么本文后面的信息表明,对于较大的非直接缓冲区的 >256K 分配可能会绕过 TLAB。这些分配直接进入堆,需要线程同步,从而导致性能下降。
An allocation that can not be made from a TLAB does not always mean that the thread has to get a new TLAB. Depending on the size of the allocation and the unused space remaining in the TLAB, the VM could decide to just do the allocation from the heap. That allocation from the heap would require synchronization but so would getting a new TLAB. If the allocation was considered large (some significant fraction of the current TLAB size), the allocation would always be done out of the heap.This cut down on wastage and gracefully handled the much-larger-than-average allocation.
不能从 TLAB 进行的分配并不总是意味着线程必须获得新的 TLAB。根据分配的大小和 TLAB 中剩余的未使用空间,VM 可以决定只从堆进行分配。来自堆的分配需要同步,但获得新的 TLAB 也需要同步。如果分配被认为很大(当前 TLAB 大小的一些重要部分),则分配将始终在堆外完成。这减少了浪费并优雅地处理了远大于平均水平的分配。
Tweaking the TLAB parameters
调整 TLAB 参数
This hypothesis could be tested using information from a later article indicating how to tweak the TLAB and get diagnostic info:
可以使用后面的文章中的信息来测试这个假设,该文章指示如何调整 TLAB 并获取诊断信息:
To experiment with a specific TLAB size, two -XX flags need to be set, one to define the initial size, and one to disable the resizing:
-XX:TLABSize= -XX:-ResizeTLAB
The minimum size of a tlab is set with -XX:MinTLABSize which defaults to 2K bytes. The maximum size is the maximum size of an integer Java array, which is used to fill the unallocated portion of a TLAB when a GC scavenge occurs.
Diagnostic Printing Options
-XX:+PrintTLAB
Prints at each scavenge one line for each thread (starts with "TLAB: gc thread: " without the "'s) and one summary line.
要试验特定的 TLAB 大小,需要设置两个 -XX 标志,一个定义初始大小,一个禁用调整大小:
-XX:TLABSize= -XX:-ResizeTLAB
tlab 的最小大小由 -XX:MinTLABSize 设置,默认为 2K 字节。最大大小是一个整数 Java 数组的最大大小,用于在发生 GC 清除时填充 TLAB 的未分配部分。
诊断打印选项
-XX:+PrintTLAB
在每次清理时为每个线程打印一行(以“TLAB:gc 线程:”开头,没有“')和一个摘要行。
回答by Harv
I suspect that these knees are due to tripping across a CPU cache boundary. The "non-direct" buffer read()/write() implementation "cache misses" earlier due to the additional memory buffer copy compared to the "direct" buffer read()/write() implementation.
我怀疑这些膝盖是由于跨越 CPU 缓存边界造成的。与“直接”缓冲区 read()/write() 实现相比,由于额外的内存缓冲区副本,“非直接”缓冲区 read()/write() 实现“缓存未命中”较早。
回答by Hardcoded
There are many reasons why this could happen. Without code and/or more details about the data, we can only guess what is happening.
发生这种情况的原因有很多。没有代码和/或有关数据的更多详细信息,我们只能猜测发生了什么。
Some Guesses:
一些猜测:
- Maybe you hit the maximum bytes that can be read at a time, thus the IOwaits gets higher or memory consumption up without a decrease in loops.
- Maybe you hit a critical memory limit, or the JVM is trying to free memory before a new allocation. Try playing around with the
-Xmx
and-Xms
parameters - Maybe HotSpot can't/won't optimize, because the number of calls to some methods are too low.
- Maybe there are OS or Hardware conditions that cause this kind of delay
- Maybe the implementation of the JVM is just buggy ;-)
- 也许您达到了一次可以读取的最大字节数,因此 IOwaits 变得更高或内存消耗增加而循环次数却没有减少。
- 也许您遇到了关键的内存限制,或者 JVM 正在尝试在新分配之前释放内存。尝试使用
-Xmx
和-Xms
参数 - 也许 HotSpot 不能/不会优化,因为对某些方法的调用次数太少。
- 也许有操作系统或硬件条件导致这种延迟
- 也许JVM的实现只是有问题;-)