Java 使用 FileInputStream 时如何确定理想的缓冲区大小?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/236861/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you determine the ideal buffer size when using FileInputStream?
提问by ARKBAN
I have a method that creates a MessageDigest (a hash) from a file, and I need to do this to a lot of files (>= 100,000). How big should I make the buffer used to read from the files to maximize performance?
我有一种从文件创建 MessageDigest(哈希)的方法,我需要对很多文件(> = 100,000)执行此操作。我应该使用多大的缓冲区来读取文件以最大化性能?
Most everyone is familiar with the basic code (which I'll repeat here just in case):
大多数人都熟悉基本代码(我将在这里重复以防万一):
MessageDigest md = MessageDigest.getInstance( "SHA" );
FileInputStream ios = new FileInputStream( "myfile.bmp" );
byte[] buffer = new byte[4 * 1024]; // what should this value be?
int read = 0;
while( ( read = ios.read( buffer ) ) > 0 )
md.update( buffer, 0, read );
ios.close();
md.digest();
What is the ideal size of the buffer to maximize throughput? I know this is system dependent, and I'm pretty sure its OS, FileSystem, andHDD dependent, and there maybe other hardware/software in the mix.
使吞吐量最大化的理想缓冲区大小是多少?我知道这是依赖于系统的,我很确定它依赖于操作系统、文件系统和硬盘,并且可能还有其他硬件/软件。
(I should point out that I'm somewhat new to Java, so this may just be some Java API call I don't know about.)
(我应该指出,我对 Java 有点陌生,所以这可能只是我不知道的一些 Java API 调用。)
Edit:I do not know ahead of time the kinds of systems this will be used on, so I can't assume a whole lot. (I'm using Java for that reason.)
编辑:我事先不知道这将用于哪些系统,所以我不能假设很多。(出于这个原因,我正在使用 Java。)
Edit:The code above is missing things like try..catch to make the post smaller
编辑:上面的代码缺少像 try..catch 这样的东西来使帖子更小
采纳答案by Kevin Day
Optimum buffer size is related to a number of things: file system block size, CPU cache size and cache latency.
最佳缓冲区大小与许多因素有关:文件系统块大小、CPU 缓存大小和缓存延迟。
Most file systems are configured to use block sizes of 4096 or 8192. In theory, if you configure your buffer size so you are reading a few bytes more than the disk block, the operations with the file system can be extremely inefficient (i.e. if you configured your buffer to read 4100 bytes at a time, each read would require 2 block reads by the file system). If the blocks are already in cache, then you wind up paying the price of RAM -> L3/L2 cache latency. If you are unlucky and the blocks are not in cache yet, the you pay the price of the disk->RAM latency as well.
大多数文件系统都配置为使用 4096 或 8192 的块大小。理论上,如果您配置缓冲区大小以便读取的字节数比磁盘块多几个字节,则文件系统的操作可能会非常低效(即,如果您将缓冲区配置为一次读取 4100 个字节,每次读取将需要文件系统读取 2 个块)。如果块已经在缓存中,那么您最终要付出 RAM -> L3/L2 缓存延迟的代价。如果你不走运并且块还没有在缓存中,那么你也要付出磁盘-> RAM 延迟的代价。
This is why you see most buffers sized as a power of 2, and generally larger than (or equal to) the disk block size. This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads.
这就是为什么您看到大多数缓冲区大小为 2 的幂,并且通常大于(或等于)磁盘块大小的原因。这意味着您的一个流读取可能会导致多个磁盘块读取 - 但这些读取将始终使用一个完整的块 - 不会浪费读取。
Now, this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read (we are doing sequential reads here, after all) - so you wind up paying the RAM -> L3/L2 cache latency price on the next read, but not the disk->RAM latency. In terms of order of magnitude, disk->RAM latency is so slow that it pretty much swamps any other latency you might be dealing with.
现在,这在典型的流媒体场景中被抵消了很多,因为当您进行下一次读取时,从磁盘读取的块仍将在内存中(毕竟,我们在这里进行的是顺序读取) - 所以你结束了在下一次读取时支付 RAM -> L3/L2 缓存延迟价格,而不是磁盘 -> RAM 延迟。就数量级而言,磁盘-> RAM 延迟非常慢,几乎淹没了您可能正在处理的任何其他延迟。
So, I suspect that if you ran a test with different cache sizes (haven't done this myself), you will probably find a big impact of cache size up to the size of the file system block. Above that, I suspect that things would level out pretty quickly.
因此,我怀疑如果您使用不同的缓存大小运行测试(我自己没有这样做过),您可能会发现缓存大小对文件系统块大小的影响很大。除此之外,我怀疑事情会很快趋于平稳。
There are a tonof conditions and exceptions here - the complexities of the system are actually quite staggering (just getting a handle on L3 -> L2 cache transfers is mind bogglingly complex, and it changes with every CPU type).
这里有大量的条件和例外——系统的复杂性实际上是相当惊人的(仅仅处理 L3 -> L2 缓存传输是令人难以置信的复杂,并且它随着每种 CPU 类型而变化)。
This leads to the 'real world' answer: If your app is like 99% out there, set the cache size to 8192 and move on (even better, choose encapsulation over performance and use BufferedInputStream to hide the details). If you are in the 1% of apps that are highly dependent on disk throughput, craft your implementation so you can swap out different disk interaction strategies, and provide the knobs and dials to allow your users to test and optimize (or come up with some self optimizing system).
这导致了“真实世界”的答案:如果您的应用程序像 99% 一样,将缓存大小设置为 8192 并继续(更好的是,选择封装而不是性能并使用 BufferedInputStream 隐藏细节)。如果您属于高度依赖磁盘吞吐量的 1% 的应用程序,请精心设计您的实现,以便您可以交换不同的磁盘交互策略,并提供旋钮和刻度盘以允许您的用户进行测试和优化(或提出一些自优化系统)。
回答by Jon Skeet
Yes, it's probably dependent on various things - but I doubt it will make very much difference. I tend to opt for 16K or 32K as a good balance between memory usage and performance.
是的,它可能取决于各种因素 - 但我怀疑它会产生很大的不同。我倾向于选择 16K 或 32K 作为内存使用和性能之间的良好平衡。
Note that you should have a try/finally block in the code to make sure the stream is closed even if an exception is thrown.
请注意,您应该在代码中有一个 try/finally 块,以确保即使抛出异常,流也已关闭。
回答by Ovidiu Pacurar
In the ideal case we should have enough memory to read the file in one read operation. That would be the best performer because we let the system manage File System , allocation units and HDD at will. In practice you are fortunate to know the file sizes in advance, just use the average file size rounded up to 4K (default allocation unit on NTFS). And best of all : create a benchmark to test multiple options.
在理想情况下,我们应该有足够的内存来在一次读取操作中读取文件。那将是最好的表现,因为我们让系统随意管理文件系统、分配单元和硬盘。在实践中,您有幸提前知道文件大小,只需使用四舍五入为 4K(NTFS 上的默认分配单位)的平均文件大小。最重要的是:创建一个基准来测试多个选项。
回答by John Gardner
You could use the BufferedStreams/readers and then use their buffer sizes.
您可以使用 BufferedStreams/readers,然后使用它们的缓冲区大小。
I believe the BufferedXStreams are using 8192 as the buffer size, but like Ovidiu said, you should probably run a test on a whole bunch of options. Its really going to depend on the filesystem and disk configurations as to what the best sizes are.
我相信 BufferedXStreams 使用 8192 作为缓冲区大小,但正如 Ovidiu 所说,您可能应该对一大堆选项进行测试。它实际上将取决于文件系统和磁盘配置的最佳大小。
回答by Adam Rosenfield
In most cases, it really doesn't matter that much. Just pick a good size such as 4K or 16K and stick with it. If you're positivethat this is the bottleneck in your application, then you should start profiling to find the optimal buffer size. If you pick a size that's too small, you'll waste time doing extra I/O operations and extra function calls. If you pick a size that's too big, you'll start seeing a lot of cache misses which will really slow you down. Don't use a buffer bigger than your L2 cache size.
在大多数情况下,这真的没有那么重要。只需选择一个合适的尺寸,例如 4K 或 16K 并坚持使用。如果你是积极的,这是在应用程序中的瓶颈,那么你应该开始分析,以找到最佳的缓冲区大小。如果您选择的大小太小,您将浪费时间进行额外的 I/O 操作和额外的函数调用。如果您选择的大小太大,您将开始看到大量缓存未命中,这会真正减慢您的速度。不要使用大于 L2 缓存大小的缓冲区。
回答by Maglob
As already mentioned in other answers, use BufferedInputStreams.
正如其他答案中已经提到的,使用 BufferedInputStreams。
After that, I guess the buffer size does not really matter. Either the program is I/O bound, and growing buffer size over BIS default, will not make any big impact on performance.
在那之后,我想缓冲区大小并不重要。要么程序受 I/O 限制,并且缓冲区大小超过 BIS 默认值,都不会对性能产生任何重大影响。
Or the program is CPU bound inside the MessageDigest.update(), and majority of the time is not spent in the application code, so tweaking it will not help.
或者程序在 MessageDigest.update() 中受 CPU 限制,并且大部分时间没有花在应用程序代码上,因此对其进行调整无济于事。
(Hmm... with multiple cores, threads might help.)
(嗯......多核,线程可能会有所帮助。)
回答by Alexander
Reading files using Java NIO's FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream. Basically, memory-map large files, and use direct buffers for small ones.
使用 Java NIO 的 FileChannel 和 MappedByteBuffer 读取文件很可能会产生比任何涉及 FileInputStream 的解决方案快得多的解决方案。基本上,内存映射大文件,小文件使用直接缓冲区。
回答by Adrian Krebs
1024 is appropriate for a wide variety of circumstances, although in practice you may see better performance with a larger or smaller buffer size.
1024 适用于多种情况,但实际上您可能会看到更大或更小的缓冲区大小具有更好的性能。
This would depend on a number of factors including file system block size and CPU hardware.
这将取决于许多因素,包括文件系统块大小和 CPU 硬件。
It is also common to choose a power of 2 for the buffer size, since most underlying hardware is structured with fle block and cache sizes that are a power of 2. The Buffered classes allow you to specify the buffer size in the constructor. If none is provided, they use a default value, which is a power of 2 in most JVMs.
为缓冲区大小选择 2 的幂也是很常见的,因为大多数底层硬件都是由文件块和缓存大小构成的,这些大小是 2 的幂。 Buffered 类允许您在构造函数中指定缓冲区大小。如果未提供,则它们使用默认值,在大多数 JVM 中该值是 2 的幂。
Regardless of which buffer size you choose, the biggest performance increase you will see is moving from nonbuffered to buffered file access. Adjusting the buffer size may improve performance slightly, but unless you are using an extremely small or extremely large buffer size, it is unlikely to have a signifcant impact.
无论您选择哪种缓冲区大小,您将看到的最大性能提升是从非缓冲文件访问转移到缓冲文件访问。调整缓冲区大小可能会略微提高性能,但除非您使用极小或极大的缓冲区大小,否则不太可能产生显着影响。
回答by GoForce5500
In BufferedInputStream‘s source you will find: private static int DEFAULT_BUFFER_SIZE = 8192;
So it's okey for you to use that default value.
But if you can figure out some more information you will get more valueable answers.
For example, your adsl maybe preffer a buffer of 1454 bytes, thats because TCP/IP's payload. For disks, you may use a value that match your disk's block size.
在 BufferedInputStream 的源代码中,您会发现: private static int DEFAULT_BUFFER_SIZE = 8192;
因此,您可以使用该默认值。
但是如果你能找出更多的信息,你会得到更有价值的答案。
例如,您的 adsl 可能提供 1454 字节的缓冲区,这是因为 TCP/IP 的有效负载。对于磁盘,您可以使用与磁盘块大小匹配的值。