java - 如何在没有“内存不足”异常的情况下在java中列出200万个文件目录

Question

提问by Fgblanch

I have to deal with a directory of about 2 million xml's to be processed.

我必须处理一个包含大约 200 万个要处理的 xml 的目录。

I've already solved the processing distributing the work between machines and threads using queues and everything goes right.

我已经解决了使用队列在机器和线程之间分配工作的处理，并且一切正常。

But now the big problem is the bottleneck of reading the directory with the 2 million files in order to fill the queues incrementally.

但是现在最大的问题是读取具有 200 万个文件的目录以增量填充队列的瓶颈。

I've tried using the File.listFiles()method, but it gives me a java out of memory: heap spaceexception. Any ideas?

我试过使用该File.listFiles()方法，但它给了我一个 javaout of memory: heap space异常。有任何想法吗？

Answer 1

采纳答案by aioobe

First of all, do you have any possibility to use Java 7? There you have a FileVisitorand the Files.walkFileTree, which should probably work within your memory constraints.

首先，你有可能使用Java 7吗？你有一个FileVisitor和Files.walkFileTree，它可能应该在你的内存限制范围内工作。

Otherwise, the only way I can think of is to use File.listFiles(FileFilter filter)with a filter that always returns false(ensuring that the full array of files is never kept in memory), but that catches the files to be processed along the way, and perhaps puts them in a producer/consumer queue or writes the file-names to disk for later traversal.

否则，我能想到的唯一方法是使用File.listFiles(FileFilter filter)始终返回的过滤器false（确保永远不会将完整的文件数组保存在内存中），但会在此过程中捕获要处理的文件，并且可能将它们放入生产者/消费者队列或将文件名写入磁盘以供以后遍历。

Alternatively, if you control the names of the files, or if they are named in some nice way, you could process the files in chunks using a filter that accepts filenames on the form file0000000-filefile0001000then file0001000-filefile0002000and so on.
或者，如果您控制文件的名称，或者如果它们以某种不错的方式命名，您可以使用接受表单上的文件名的过滤器分块处理文件file0000000-filefile0001000然后file0001000-filefile0002000等等。

~~If the names are notnamed in a nice way like this, you could try filtering them based on the hash-code of the file-name, which is supposed to be fairly evenly distributed over the set of integers.~~

~~如果名称没有像这样以一种很好的方式命名，您可以尝试根据文件名的哈希码过滤它们，它应该在整数集上相当均匀地分布。~~

Update:Sigh. Probably won't work. Just had a look at the listFiles implementation:

更新：叹气。恐怕行不通。刚刚看了一下 listFiles 实现：

public File[] listFiles(FilenameFilter filter) {
    String ss[] = list();
    if (ss == null) return null;
    ArrayList v = new ArrayList();
    for (int i = 0 ; i < ss.length ; i++) {
        if ((filter == null) || filter.accept(this, ss[i])) {
            v.add(new File(ss[i], this));
        }
    }
    return (File[])(v.toArray(new File[v.size()]));
}

so it will probably fail at the first line anyway... Sort of disappointing. I believe your best option is to put the files in different directories.

所以无论如何它可能会在第一行失败......有点令人失望。我相信您最好的选择是将文件放在不同的目录中。

Btw, could you give an example of a file name? Are they "guessable"? Like

顺便说一句，你能举一个文件名的例子吗？他们“可以猜到”吗？喜欢

for (int i = 0; i < 100000; i++)
    tryToOpen(String.format("file%05d", i))

Answer 2

回答by J?rn Horstmann

If Java 7 is not an option, this hack will work (for UNIX):

如果 Java 7 不是一个选项，这个 hack 将起作用（对于 UNIX）：

Process process = Runtime.getRuntime().exec(new String[]{"ls", "-f", "/path"});
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line;
while (null != (line = reader.readLine())) {
    if (line.startsWith("."))
        continue;
    System.out.println(line);
}

The -f parameter will speed it up (from man ls):

-f 参数将加快速度（从man ls）：

-f     do not sort, enable -aU, disable -lst

Answer 3

回答by Jaime Hablutzel

In case you can use Java 7 this can be done in this way and you won't have those out of memory problems.

如果您可以使用 Java 7，这可以通过这种方式完成，并且您不会遇到内存不足的问题。

Path path = FileSystems.getDefault().getPath("C:\path\with\lots\of\files");
        Files.walkFileTree(path, new FileVisitor<Path>() {
            @Override
            public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
                return FileVisitResult.CONTINUE;
            }

            @Override
            public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
                // here you have the files to process
                System.out.println(file);
                return FileVisitResult.CONTINUE;
            }

            @Override
            public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
               return FileVisitResult.TERMINATE;
            }

            @Override
            public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {
              return FileVisitResult.CONTINUE;
            }
        });

Answer 4

回答by Michael Borgwardt

Use File.list()instead of File.listFiles()- the Stringobjects it returns consume less memory than the Fileobjects, and (more importantly, depending on the location of the directory) they don't contain the full path name.

使用File.list()代替File.listFiles()-String它返回的File对象比对象消耗更少的内存，并且（更重要的是，取决于目录的位置）它们不包含完整路径名。

Then, construct Fileobjects as needed when processing the result.

然后，File在处理结果时根据需要构造对象。

However, this will not work for arbitrarily large directories either. It's an overall better idea to organize your files in a hierarchy of directories so that no single directory has more than a few thousand entries.

但是，这也不适用于任意大的目录。将文件组织在目录层次结构中总体上是一个更好的主意，这样单个目录的条目就不会超过几千个。

Answer 5

回答by M4nux

You can do that with Apache FileUtils library. No memory problem. I did check with visualvm.

您可以使用 Apache FileUtils 库来做到这一点。没有内存问题。我确实用visualvm检查过。

  Iterator<File> it = FileUtils.iterateFiles(folder, null, true);
  while (it.hasNext())
  {
     File fileEntry = (File) it.next();
  }

Hope that helps. bye

希望有帮助。再见

Answer 6

回答by Ross Judson

Since you're on Windows, it seems like you should have simply used ProcessBuilder to start something like "cmd /k dir /b target_directory", capture the output of that, and route it into a file. You can then process that file a line at a time, reading the file names out and dealing with them.

由于您使用的是 Windows，似乎您应该简单地使用 ProcessBuilder 来启动诸如“cmd /k dir /b target_directory”之类的内容，捕获它的输出，并将其路由到一个文件中。然后，您可以一次处理该文件一行，读取文件名并处理它们。

Better late than never? ;)

迟到总比不到好？;)

Answer 7

回答by kbolino

This also requires Java 7, but it's simpler than the Files.walkFileTreeanswer if you just want to list the contents of a directory and not walk the whole tree:

这也需要 Java 7，但Files.walkFileTree如果您只想列出目录的内容而不是遍历整个树，它比答案更简单：

Path dir = Paths.get("/some/directory");
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {
    for (Path path : stream) {
        handleFile(path.toFile());
    }
} catch (IOException e) {
    handleException(e);
}

The implementation of DirectoryStreamis platform-specific and never calls File.listor anything like it, instead using the Unix or Windows system calls that iterate over a directory one entry at a time.

的实现DirectoryStream是特定于平台的，从不调用File.list或类似的东西，而是使用 Unix 或 Windows 系统调用，一次遍历一个目录一个条目。

Answer 8

回答by Péter T?r?k

Why do you store 2 million files in the same directory anyway? I can imagine it slows down access terribly on the OS level already.

你为什么要在同一个目录中存储 200 万个文件？我可以想象它已经在操作系统级别上严重减慢了访问速度。

I would definitely want to have them divided into subdirectories (e.g. by date/time of creation) already before processing. But if it is not possible for some reason, could it be done during processing? E.g. move 1000 files queued for Process1 into Directory1, another 1000 files for Process2 into Directory2 etc. Then each process/thread sees only the (limited number of) files portioned for it.

我肯定希望在处理之前将它们分成子目录（例如，按创建日期/时间）。但是如果由于某种原因无法实现，是否可以在处理过程中完成？例如，将 Process1 排队的 1000 个文件移动到 Directory1，将 Process2 的另外 1000 个文件移动到 Directory2 等等。然后每个进程/线程只能看到（有限数量的）为其分配的文件。

Answer 9

回答by Thorbj?rn Ravn Andersen

Please post the full stack trace of the OOM exception to identify where the bottleneck is, as well as a short, complete Java program showing the behaviour you see.

请张贴 OOM 异常的完整堆栈跟踪以确定瓶颈所在，以及一个简短、完整的 Java 程序，显示您看到的行为。

It is most likely because you collect all of the two million entries in memory, and they don't fit. Can you increase heap space?

这很可能是因为您收集了内存中的所有 200 万个条目，并且它们不适合。你能增加堆空间吗？

Answer 10

回答by InsertNickHere

At fist you could try to increase the memory of your JVM with passing -Xmx1024m e.g.

首先，您可以尝试通过传递 -Xmx1024m 来增加 JVM 的内存，例如

java - 如何在没有“内存不足”异常的情况下在java中列出200万个文件目录

提问by Fgblanch

采纳答案by aioobe

回答by J?rn Horstmann

回答by Jaime Hablutzel

回答by Michael Borgwardt

回答by M4nux

回答by Ross Judson

回答by kbolino

回答by Péter T?r?k

回答by Thorbj?rn Ravn Andersen

回答by InsertNickHere

相关推荐

最近更新

标签

java - 如何在没有“内存不足”异常的情况下在java中列出200万个文件目录

提问by Fgblanch

采纳答案by aioobe

回答by J?rn Horstmann

回答by Jaime Hablutzel

回答by Michael Borgwardt

回答by M4nux

回答by Ross Judson

回答by kbolino

回答by Péter T?r?k

回答by Thorbj?rn Ravn Andersen

回答by InsertNickHere

相关推荐

java Android SQLite Query 和使用游标处理多行

java 如何在 Guice 中进行可选绑定？

如何在 Java Swing 的 JLabel 中设置行间距/高度？

用于接受有效主机名、IPv4 或 IPv6 地址的 Java 正则表达式

相关推荐

最近更新

标签