如何在 Java 中快速检索目录列表?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1034977/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 22:31:26  来源:igfitidea点击:

How to retrieve a list of directories QUICKLY in Java?

javaperformancefile-iofilesystems

提问by erotsppa

Suppose a very simple program that lists out all the subdirectories of a given directory. Sound simple enough? Except the only way to list all subdirectories in Java is to use FilenameFiltercombined with File.list().

假设一个非常简单的程序列出了给定目录的所有子目录。听起来够简单吗?除了在 Java 中列出所有子目录的唯一方法是将FilenameFilterFile.list()结合使用。

This works for the trivial case, but when the folder has say 150,000 files and 2 sub folders, it's silly waiting there for 45 seconds iterating through all the files and testing for file.isDirectory(). Is there a better way to list sub directories??

这适用于微不足道的情况,但是当文件夹有 150,000 个文件和 2 个子文件夹时,在那里等待 45 秒迭代所有文件并测试 file.isDirectory() 是愚蠢的。有没有更好的方法来列出子目录?



PS. Sorry, please save the lectures on having too many files in the same directory. Our live environment has this as part of the requirement.

附注。抱歉,请保存关于同一目录中文件过多的讲座。我们的生活环境将此作为要求的一部分。

回答by Hardwareguy

You could hack it if the 150k files all (or a significant number of them) had a similar naming convention like:

如果 150k 文件全部(或其中相当数量)具有类似的命名约定,则您可以破解它,例如:

*.jpg
*Out.txt

and only actually create file objects for the ones you are unsure about being a folder.

并且只为您不确定是否是文件夹的对象实际创建文件对象。

回答by Nick

Maybe you could write a directory searching program in C#/C/C++ and use JNI to get it to Java. Do not know if this would improve performance or not.

也许你可以用 C#/C/C++ 编写一个目录搜索程序,然后使用 JNI 将其转为 Java。不知道这是否会提高性能。

回答by akarnokd

In that case you might try some JNA solution - a platform dependant directory traverser (FindFirst, FindNext on Windows) with the possibility of some iteration pattern. Also Java 7 will have much better file system support, worth checking out the specs (I don't remember any specifics).

在这种情况下,您可能会尝试一些 JNA 解决方案 - 一个平台相关的目录遍历器(Windows 上的 FindFirst、FindNext),可能有一些迭代模式。此外,Java 7 将有更好的文件系统支持,值得查看规范(我不记得任何细节)。

Edit:An idea: one option is to hide the slowness of the directory listing from the user's eyes. In a client side app, you could use some animation while the listing is working to distract the user. Actually depends on what else your application does beside the listing.

编辑:一个想法:一个选择是隐藏用户眼中目录列表的缓慢。在客户端应用程序中,您可以在列表工作时使用一些动画来分散用户的注意力。实际上取决于您的应用程序除了列表之外还做什么。

回答by kdgregory

There's actually a reason why you got the lectures: it's the correct answer to your problem. Here's the background, so that perhaps you can make some changes in your live environment.

你得到这些讲座实际上是有原因的:它是你问题的正确答案。这是背景,因此也许您可以对您的实际环境进行一些更改。

First: directories are stored on the filesystem; think of them as files, because that's exactly what they are. When you iterate through the directory, you have to read those blocks from the disk. Each directory entry will require enough space to hold the filename, and permissions, and information on where that file is found on-disk.

第一:目录存储在文件系统上;将它们视为文件,因为它们正是如此。当您遍历目录时,您必须从磁盘读取这些块。每个目录条目都需要足够的空间来保存文件名、权限以及有关该文件在磁盘上的位置的信息。

Second: directories aren't stored with any internal ordering (at least, not in the filesystems where I've worked with directory files). If you have 150,000 entries and 2 sub-directories, those 2 sub-directory references could be anywhere within the 150,000. You have to iterate to find them, there's no way around that.

第二:目录不以任何内部顺序存储(至少,不在我处理目录文件的文件系统中)。如果您有 150,000 个条目和 2 个子目录,那么这 2 个子目录引用可能位于 150,000 中的任何位置。您必须迭代才能找到它们,这是没有办法的。

So, let's say that you can't avoid the big directory. Your only real option is to try to keep the blocks comprising the directory file in the in-memory cache, so that you're not hitting the disk every time you access them. You can achieve this by regularly iterating over the directory in a background thread -- but this is going to cause undue load on your disks, and interfere with other processes. Alternatively, you can scan once and keep track of the results.

因此,假设您无法避免大目录。您唯一真正的选择是尝试将组成目录文件的块保留在内存缓存中,这样您就不会在每次访问它们时都访问磁盘。您可以通过在后台线程中定期迭代目录来实现这一点——但这会导致磁盘上的过度负载,并干扰其他进程。或者,您可以扫描一次并跟踪结果。

The alternative is to create a tiered directory structure. If you look at commercial websites, you'll see URLs like /1/150/15023.html -- this is meant to keep the number of files per directory small. Think of it as a BTree index in a database.

另一种方法是创建分层目录结构。如果您查看商业网站,您会看到类似 /1/150/15023.html 的 URL——这是为了保持每个目录的文件数量较少。将其视为数据库中的 BTree 索引。

Of course, you can hide that structure: you can create a filesystem abstraction layer that takes filenames and automatically generates the directory tree where those filenames can be found.

当然,您可以隐藏该结构:您可以创建一个文件系统抽象层,该层采用文件名并自动生成可以找到这些文件名的目录树。

回答by Yoni Roit

Well, either JNI, or, if you say your deployment is constant, just run "dir" on Windows or "ls" on *nixes, with appropriate flags to list only directories (Runtime.exec())

好吧,要么是 JNI,要么,如果您说您的部署是恒定的,只需在 Windows 上运行“dir”或在 *nixes 上运行“ls”,并使用适当的标志仅列出目录(Runtime.exec())

回答by DVK

Do you know the finite list of possible subdirectory names? If so, use a loop over all possible names and check for directory's existence.

您知道可能的子目录名称的有限列表吗?如果是这样,请对所有可能的名称使用循环并检查目录是否存在。

Otherwise, you can not get ONLY directory names in most underlying OSs (e.g. in Unix, the directory listing is merely reading contents of "directory" file, so there's no way to find "just directories" quickly without listing all the files).

否则,您将无法在大多数底层操作系统中获得唯一的目录名称(例如,在 Unix 中,目录列表只是读取“目录”文件的内容,因此无法在不列出所有文件的情况下快速找到“仅目录”)。

However, in NIO.2 in Java7 (see http://java.sun.com/developer/technicalArticles/javase/nio/#3), there's a way to have a streaming directory list so you don't get a full array of file elements cluttering your memory/network.

但是,在 Java7 中的 NIO.2(请参阅http://java.sun.com/developer/technicalArticles/javase/nio/#3)中,有一种方法可以获得流目录列表,因此您不会获得完整的数组使您的内存/网络混乱的文件元素。

回答by lavinio

I don't know if the overhead of shelling out to cmd.exewould eat it up, but one possibility would be something like this:

我不知道炮轰的开销cmd.exe是否会吃掉它,但一种可能性是这样的:

...
Runtime r = Runtime.getRuntime();
Process p = r.exec("cmd.exe /k dir /s/b/ad C:\folder");
BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream()));
for (;;) {
    String d = br.readLine();
    if (d == null)
        break;
    System.out.println(d);
}
...
  • /smeans search subdirectories
  • /admeans only return directories
  • /bmeans return the full pathname from the root
  • /s表示搜索子目录
  • /ad表示只返回目录
  • /b表示从根返回完整路径名

回答by dfa

if your OS is 'stable' give a try to JNA:

如果您的操作系统“稳定”,请尝试使用JNA

these are all "streaming API". They doesn't force you to allocate a 150k list/array before start searching. IMHO this is a great advantage in your scenario.

这些都是“流媒体API”。它们不会强制您在开始搜索之前分配 150k 列表/数组。恕我直言,这在您的场景中是一个很大的优势。

回答by Emil H

As has already been mentioned, this is basicly a hardware problem. Disk access is always slow, and most file systems aren't really designed to handle directories with that many files.

正如已经提到的,这基本上是一个硬件问题。磁盘访问总是很慢,而且大多数文件系统并不是真正设计为处理包含这么多文件的目录。

If you for some reason have to store all the files in the same directory, I think you'll have to maintain your own cache. This could be done using a local database such as sqlite, HeidiSQL or HSQL. If you want extreme performance, use a java TreeSet and cache it in memory. This means at the very least that you'll have to read the directory less often, and it could possibly be done in the background. You could reduce the need to refresh the list even further by using your systems native file update notification API (inotify on linux) to subscribe to changes to the directory.

如果您出于某种原因必须将所有文件存储在同一目录中,我认为您必须维护自己的缓存。这可以使用本地数据库(例如 sqlite、HeidiSQL 或 HSQL)来完成。如果您想要极致性能,请使用 java TreeSet 并将其缓存在内存中。这意味着至少您不必经常阅读目录,并且可能会在后台完成。通过使用系统本机文件更新通知 API(Linux 上的 inotify)订阅目录更改,您可以进一步减少刷新列表的需要。

This doesn't seem to be possible for you, but I once solved a similiar problem by "hashing" the files into subdirectories. In my case, the challenge was to store a couple of millions images with numeric ids. I constructed the directory structure as follows:

这对您来说似乎是不可能的,但我曾经通过将文件“散列”到子目录中解决了一个类似的问题。就我而言,挑战是存储几百万个带有数字 ID 的图像。我构建的目录结构如下:

images/[id - (id % 1000000)]/[id - (id % 1000)]/[id].jpg

This has worked well for us, and it's the solution that I would recommend. You could do something similiar to alpha-numeric filenames by simply taking the first two letters of the filename, and then the next two letters. I've done this as well once, and it did the job as well.

这对我们来说效果很好,这是我推荐的解决方案。您可以通过简单地取文件名的前两个字母,然后取下两个字母来执行类似于字母数字文件名的操作。我也这样做过一次,它也完成了这项工作。

回答by dfa

there is also a recursive parallel scanning at http://blogs.oracle.com/adventures/entry/fast_directory_scanning. Essentially siblings are processed in parallel. There also encouraging performance tests.

http://blogs.oracle.com/adventures/entry/fast_directory_scanning 上还有一个递归并行扫描。本质上兄弟姐妹是并行处理的。还有令人鼓舞的性能测试。