java高效获取文件大小
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/116574/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
java get file size efficiently
提问by joshjdevl
While googling, I see that using java.io.File#length()
can be slow.
FileChannel
has a size()
method that is available as well.
在谷歌搜索时,我发现使用java.io.File#length()
可能很慢。
FileChannel
也有一个size()
可用的方法。
Is there an efficient way in java to get the file size?
java中有没有一种有效的方法来获取文件大小?
采纳答案by GHad
Well, I tried to measure it up with the code below:
好吧,我试着用下面的代码来衡量它:
For runs = 1 and iterations = 1 the URL method is fastest most times followed by channel. I run this with some pause fresh about 10 times. So for one time access, using the URL is the fastest way I can think of:
对于runs = 1 和iterations = 1,URL 方法最快,其次是channel。我运行这个暂停了大约 10 次。所以一次性访问,使用URL是我能想到的最快的方式:
LENGTH sum: 10626, per Iteration: 10626.0
CHANNEL sum: 5535, per Iteration: 5535.0
URL sum: 660, per Iteration: 660.0
For runs = 5 and iterations = 50 the picture draws different.
对于运行 = 5 和迭代 = 50,图片绘制不同。
LENGTH sum: 39496, per Iteration: 157.984
CHANNEL sum: 74261, per Iteration: 297.044
URL sum: 95534, per Iteration: 382.136
File must be caching the calls to the filesystem, while channels and URL have some overhead.
文件必须缓存对文件系统的调用,而通道和 URL 有一些开销。
Code:
代码:
import java.io.*;
import java.net.*;
import java.util.*;
public enum FileSizeBench {
LENGTH {
@Override
public long getResult() throws Exception {
File me = new File(FileSizeBench.class.getResource(
"FileSizeBench.class").getFile());
return me.length();
}
},
CHANNEL {
@Override
public long getResult() throws Exception {
FileInputStream fis = null;
try {
File me = new File(FileSizeBench.class.getResource(
"FileSizeBench.class").getFile());
fis = new FileInputStream(me);
return fis.getChannel().size();
} finally {
fis.close();
}
}
},
URL {
@Override
public long getResult() throws Exception {
InputStream stream = null;
try {
URL url = FileSizeBench.class
.getResource("FileSizeBench.class");
stream = url.openStream();
return stream.available();
} finally {
stream.close();
}
}
};
public abstract long getResult() throws Exception;
public static void main(String[] args) throws Exception {
int runs = 5;
int iterations = 50;
EnumMap<FileSizeBench, Long> durations = new EnumMap<FileSizeBench, Long>(FileSizeBench.class);
for (int i = 0; i < runs; i++) {
for (FileSizeBench test : values()) {
if (!durations.containsKey(test)) {
durations.put(test, 0l);
}
long duration = testNow(test, iterations);
durations.put(test, durations.get(test) + duration);
// System.out.println(test + " took: " + duration + ", per iteration: " + ((double)duration / (double)iterations));
}
}
for (Map.Entry<FileSizeBench, Long> entry : durations.entrySet()) {
System.out.println();
System.out.println(entry.getKey() + " sum: " + entry.getValue() + ", per Iteration: " + ((double)entry.getValue() / (double)(runs * iterations)));
}
}
private static long testNow(FileSizeBench test, int iterations)
throws Exception {
long result = -1;
long before = System.nanoTime();
for (int i = 0; i < iterations; i++) {
if (result == -1) {
result = test.getResult();
//System.out.println(result);
} else if ((result = test.getResult()) != result) {
throw new Exception("variance detected!");
}
}
return (System.nanoTime() - before) / 1000;
}
}
回答by tgdavies
When I modify your code to use a file accessed by an absolute path instead of a resource, I get a different result (for 1 run, 1 iteration, and a 100,000 byte file -- times for a 10 byte file are identical to 100,000 bytes)
当我修改您的代码以使用通过绝对路径而不是资源访问的文件时,我得到了不同的结果(对于 1 次运行、1 次迭代和 100,000 字节的文件——10 字节文件的次数与 100,000 字节相同)
LENGTH sum: 33, per Iteration: 33.0
长度总和:33,每次迭代:33.0
CHANNEL sum: 3626, per Iteration: 3626.0
频道总和:3626,每次迭代:3626.0
URL sum: 294, per Iteration: 294.0
URL 总和:294,每次迭代:294.0
回答by tgdavies
The benchmark given by GHad measures lots of other stuff (such as reflection, instantiating objects, etc.) besides getting the length. If we try to get rid of these things then for one call I get the following times in microseconds:
除了获得长度之外,GHad 给出的基准测试还测量了许多其他内容(例如反射、实例化对象等)。如果我们试图摆脱这些事情,那么对于一次调用,我会以微秒为单位获得以下时间:
file sum___19.0, per Iteration___19.0 raf sum___16.0, per Iteration___16.0 channel sum__273.0, per Iteration__273.0
For 100 runs and 10000 iterations I get:
对于 100 次运行和 10000 次迭代,我得到:
file sum__1767629.0, per Iteration__1.7676290000000001 raf sum___881284.0, per Iteration__0.8812840000000001 channel sum___414286.0, per Iteration__0.414286
I did run the following modified code giving as an argument the name of a 100MB file.
我确实运行了以下修改后的代码,将 100MB 文件的名称作为参数。
import java.io.*;
import java.nio.channels.*;
import java.net.*;
import java.util.*;
public class FileSizeBench {
private static File file;
private static FileChannel channel;
private static RandomAccessFile raf;
public static void main(String[] args) throws Exception {
int runs = 1;
int iterations = 1;
file = new File(args[0]);
channel = new FileInputStream(args[0]).getChannel();
raf = new RandomAccessFile(args[0], "r");
HashMap<String, Double> times = new HashMap<String, Double>();
times.put("file", 0.0);
times.put("channel", 0.0);
times.put("raf", 0.0);
long start;
for (int i = 0; i < runs; ++i) {
long l = file.length();
start = System.nanoTime();
for (int j = 0; j < iterations; ++j)
if (l != file.length()) throw new Exception();
times.put("file", times.get("file") + System.nanoTime() - start);
start = System.nanoTime();
for (int j = 0; j < iterations; ++j)
if (l != channel.size()) throw new Exception();
times.put("channel", times.get("channel") + System.nanoTime() - start);
start = System.nanoTime();
for (int j = 0; j < iterations; ++j)
if (l != raf.length()) throw new Exception();
times.put("raf", times.get("raf") + System.nanoTime() - start);
}
for (Map.Entry<String, Double> entry : times.entrySet()) {
System.out.println(
entry.getKey() + " sum: " + 1e-3 * entry.getValue() +
", per Iteration: " + (1e-3 * entry.getValue() / runs / iterations));
}
}
}
回答by Karthikeyan
In response to rgrig's benchmark, the time taken to open/close the FileChannel & RandomAccessFile instances also needs to be taken into account, as these classes will open a stream for reading the file.
为了响应 rgrig 的基准测试,还需要考虑打开/关闭 FileChannel 和 RandomAccessFile 实例所花费的时间,因为这些类将打开一个流来读取文件。
After modifying the benchmark, I got these results for 1 iterations on a 85MB file:
修改基准测试后,我在 85MB 文件上进行了 1 次迭代,得到了这些结果:
file totalTime: 48000 (48 us)
raf totalTime: 261000 (261 us)
channel totalTime: 7020000 (7 ms)
For 10000 iterations on same file:
对于同一文件的 10000 次迭代:
file totalTime: 80074000 (80 ms)
raf totalTime: 295417000 (295 ms)
channel totalTime: 368239000 (368 ms)
If all you need is the file size, file.length() is the fastest way to do it. If you plan to use the file for other purposes like reading/writing, then RAF seems to be a better bet. Just don't forget to close the file connection :-)
如果您只需要文件大小,file.length() 是最快的方法。如果您打算将该文件用于读/写等其他目的,那么 RAF 似乎是一个更好的选择。只是不要忘记关闭文件连接:-)
import java.io.File;
import java.io.FileInputStream;
import java.io.RandomAccessFile;
import java.nio.channels.FileChannel;
import java.util.HashMap;
import java.util.Map;
public class FileSizeBench
{
public static void main(String[] args) throws Exception
{
int iterations = 1;
String fileEntry = args[0];
Map<String, Long> times = new HashMap<String, Long>();
times.put("file", 0L);
times.put("channel", 0L);
times.put("raf", 0L);
long fileSize;
long start;
long end;
File f1;
FileChannel channel;
RandomAccessFile raf;
for (int i = 0; i < iterations; i++)
{
// file.length()
start = System.nanoTime();
f1 = new File(fileEntry);
fileSize = f1.length();
end = System.nanoTime();
times.put("file", times.get("file") + end - start);
// channel.size()
start = System.nanoTime();
channel = new FileInputStream(fileEntry).getChannel();
fileSize = channel.size();
channel.close();
end = System.nanoTime();
times.put("channel", times.get("channel") + end - start);
// raf.length()
start = System.nanoTime();
raf = new RandomAccessFile(fileEntry, "r");
fileSize = raf.length();
raf.close();
end = System.nanoTime();
times.put("raf", times.get("raf") + end - start);
}
for (Map.Entry<String, Long> entry : times.entrySet()) {
System.out.println(entry.getKey() + " totalTime: " + entry.getValue() + " (" + getTime(entry.getValue()) + ")");
}
}
public static String getTime(Long timeTaken)
{
if (timeTaken < 1000) {
return timeTaken + " ns";
} else if (timeTaken < (1000*1000)) {
return timeTaken/1000 + " us";
} else {
return timeTaken/(1000*1000) + " ms";
}
}
}
回答by Ben Spink
Actually, I think the "ls" may be faster. There are definitely some issues in Java dealing with getting File info. Unfortunately there is no equivalent safe method of recursive ls for Windows. (cmd.exe's DIR /S can get confused and generate errors in infinite loops)
实际上,我认为“ls”可能更快。Java 在处理获取文件信息方面肯定存在一些问题。不幸的是,对于 Windows 没有等效的递归 ls 安全方法。(cmd.exe 的 DIR /S 可能会混淆并在无限循环中产生错误)
On XP, accessing a server on the LAN, it takes me 5 seconds in Windows to get the count of the files in a folder (33,000), and the total size.
在 XP 上,访问 LAN 上的服务器,在 Windows 中我需要 5 秒钟才能获取文件夹中的文件数(33,000)和总大小。
When I iterate recursively through this in Java, it takes me over 5 minutes. I started measuring the time it takes to do file.length(), file.lastModified(), and file.toURI() and what I found is that 99% of my time is taken by those 3 calls. The 3 calls I actually need to do...
当我在 Java 中递归遍历这个时,我需要超过 5 分钟。我开始测量执行 file.length()、file.lastModified() 和 file.toURI() 所需的时间,我发现这 3 个调用占用了我 99% 的时间。我实际上需要做的 3 个电话...
The difference for 1000 files is 15ms local versus 1800ms on server. The server path scanning in Java is ridiculously slow. If the native OS can be fast at scanning that same folder, why can't Java?
1000 个文件的差异是本地 15 毫秒与服务器上的 1800 毫秒。Java 中的服务器路径扫描速度非常慢。如果本机操作系统可以快速扫描同一个文件夹,为什么 Java 不能?
As a more complete test, I used WineMerge on XP to compare the modified date, and size of the files on the server versus the files locally. This was iterating over the entire directory tree of 33,000 files in each folder. Total time, 7 seconds. java: over 5 minutes.
作为一个更完整的测试,我在 XP 上使用 WineMerge 来比较服务器上文件与本地文件的修改日期和大小。这是对每个文件夹中 33,000 个文件的整个目录树进行迭代。总时间,7 秒。java:超过5分钟。
So the original statement and question from the OP is true, and valid. Its less noticeable when dealing with a local file system. Doing a local compare of the folder with 33,000 items takes 3 seconds in WinMerge, and takes 32 seconds locally in Java. So again, java versus native is a 10x slowdown in these rudimentary tests.
因此,来自 OP 的原始陈述和问题是真实且有效的。在处理本地文件系统时它不太明显。在 WinMerge 中对包含 33,000 个项目的文件夹进行本地比较需要 3 秒,而在 Java 中本地比较需要 32 秒。因此,在这些基本测试中,java 与原生相比,速度降低了 10 倍。
Java 1.6.0_22 (latest), Gigabit LAN, and network connections, ping is less than 1ms (both in the same switch)
Java 1.6.0_22(最新),千兆局域网,和网络连接,ping小于1ms(都在同一台交换机上)
Java is slow.
Java很慢。
回答by StuartH
All the test cases in this post are flawed as they access the same file for each method tested. So disk caching kicks in which tests 2 and 3 benefit from. To prove my point I took test case provided by GHAD and changed the order of enumeration and below are the results.
这篇文章中的所有测试用例都有缺陷,因为它们为每个测试的方法访问相同的文件。所以磁盘缓存开始,测试 2 和 3 从中受益。为了证明我的观点,我采用了 GHAD 提供的测试用例并更改了枚举顺序,结果如下。
Looking at result I think File.length() is the winner really.
看看结果,我认为 File.length() 真的是赢家。
Order of test is the order of output. You can even see the time taken on my machine varied between executions but File.Length() when not first, and incurring first disk access won.
测试的顺序是输出的顺序。你甚至可以看到在我的机器上执行的时间不同,但 File.Length() 不是第一次,并且导致第一次磁盘访问获胜。
---
LENGTH sum: 1163351, per Iteration: 4653.404
CHANNEL sum: 1094598, per Iteration: 4378.392
URL sum: 739691, per Iteration: 2958.764
---
CHANNEL sum: 845804, per Iteration: 3383.216
URL sum: 531334, per Iteration: 2125.336
LENGTH sum: 318413, per Iteration: 1273.652
---
URL sum: 137368, per Iteration: 549.472
LENGTH sum: 18677, per Iteration: 74.708
CHANNEL sum: 142125, per Iteration: 568.5
回答by Ben Spink
I ran into this same issue. I needed to get the file size and modified date of 90,000 files on a network share. Using Java, and being as minimalistic as possible, it would take a very long time. (I needed to get the URL from the file, and the path of the object as well. So its varied somewhat, but more than an hour.) I then used a native Win32 executable, and did the same task, just dumping the file path, modified, and size to the console, and executed that from Java. The speed was amazing. The native process, and my string handling to read the data could process over 1000 items a second.
我遇到了同样的问题。我需要在网络共享上获取 90,000 个文件的文件大小和修改日期。使用 Java 并尽可能地简约,这将需要很长时间。(我需要从文件中获取 URL,以及对象的路径。所以它有所不同,但一个多小时。)然后我使用了一个本地的 Win32 可执行文件,并完成了同样的任务,只是转储文件路径、修改和大小到控制台,并从 Java 执行。速度是惊人的。本机进程和我读取数据的字符串处理每秒可以处理 1000 多个项目。
So even though people down ranked the above comment, this is a valid solution, and did solve my issue. In my case I knew the folders I needed the sizes of ahead of time, and I could pass that in the command line to my win32 app. I went from hours to process a directory to minutes.
因此,即使人们对上述评论的排名较低,但这是一个有效的解决方案,并且确实解决了我的问题。就我而言,我提前知道我需要的文件夹大小,并且可以在命令行中将其传递给我的 win32 应用程序。我从几个小时处理一个目录到几分钟。
The issue did also seem to be Windows specific. OS X did not have the same issue and could access network file info as fast as the OS could do so.
该问题似乎也与 Windows 相关。OS X 没有同样的问题,可以像操作系统一样快速访问网络文件信息。
Java File handling on Windows is terrible. Local disk access for files is fine though. It was just network shares that caused the terrible performance. Windows could get info on the network share and calculate the total size in under a minute too.
Windows 上的 Java 文件处理很糟糕。文件的本地磁盘访问虽然很好。只是网络共享导致了糟糕的表现。Windows 可以获取有关网络共享的信息并在一分钟内计算出总大小。
--Ben
——本
回答by Gob00st
From GHad's benchmark, there are a few issue people have mentioned:
从 GHad 的基准测试中,人们提到了几个问题:
1>Like BalusC mentioned: stream.available() is flowed in this case.
1> 就像 BalusC 提到的:stream.available() 在这种情况下是流动的。
Because available() returns an estimateof the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream.
因为 available() 返回可以从此输入流读取(或跳过)的字节数的估计值,而不会被此输入流的下一次调用方法阻塞。
So 1st to remove the URL this approach.
所以第一个删除 URL 这种方法。
2>As StuartH mentioned - the order the test run also make the cache difference, so take that out by run the test separately.
2>正如StuartH提到的-测试运行的顺序也会使缓存有所不同,因此通过单独运行测试来消除它。
Now start test:
现在开始测试:
When CHANNEL one run alone:
当频道一单独运行时:
CHANNEL sum: 59691, per Iteration: 238.764
When LENGTH one run alone:
当 LENGTH one 单独运行时:
LENGTH sum: 48268, per Iteration: 193.072
So looks like the LENGTH one is the winner here:
所以看起来 LENGTH 是这里的赢家:
@Override
public long getResult() throws Exception {
File me = new File(FileSizeBench.class.getResource(
"FileSizeBench.class").getFile());
return me.length();
}
回答by Scg
If you want the file size of multiple files in a directory, use Files.walkFileTree
. You can obtain the size from the BasicFileAttributes
that you'll receive.
如果您想要一个目录中多个文件的文件大小,请使用Files.walkFileTree
. 您可以从BasicFileAttributes
收到的 中获取尺寸。
This is much faster then calling .length()
on the result of File.listFiles()
or using Files.size()
on the result of Files.newDirectoryStream()
. In my test cases it was about 100 times faster.
这比调用.length()
的结果File.listFiles()
或使用Files.size()
的结果要快得多Files.newDirectoryStream()
。在我的测试用例中,它快了大约 100 倍。