Java Hadoop 文件系统中的通配符列出 API 调用

Question

提问by snooze92

tl;dr:To be able to use wildcards (globs) in the listed paths, one simply has to use globStatus(...)instead of listStatus(...).

tl;dr:为了能够在列出的路径中使用通配符（globs），只需使用globStatus(...)代替listStatus(...)。

Context

语境

Files on my HDFS cluster are organized in partitions, with the date being the "root"partition. A simplified example of the files structure would look like this:

我的 HDFS 集群上的文件按分区组织，日期是“根”分区。文件结构的简化示例如下所示：

/schemas_folder
├── date=20140101
│?? ├── A-schema.avsc
│?? ├── B-schema.avsc
├── date=20140102
│?? ├── A-schema.avsc
│?? ├── B-schema.avsc
│?? ├── C-schema.avsc
└── date=20140103
 ?? ├── B-schema.avsc
 ?? └── C-schema.avsc

In my case, the directory stores Avroschemas for different types of data (A, B and C in this example)at different dates. The schema might start existing, evolve and stop existing... as time passes.

就我而言，该目录在不同日期存储不同类型数据（本例中为A、B 和 C）的Avro模式。随着时间的推移，模式可能会开始存在、发展和停止存在......

Goal

目标

I need to be able to get all the schemas that exist for a given type, as quickly as possible. In the example where I would like to get all the schemas that exist for type A, I would like to do the following:

我需要能够尽快获得给定类型存在的所有模式。在我想要获取类型 A 存在的所有模式的示例中，我想要执行以下操作：

hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc

That would give me

那会给我

Found 1 items
-rw-r--r--   3 user group 1234 2014-01-01 12:34 /schemas_folder/date=20140101/A-schema.avsc
Found 1 items
-rw-r--r--   3 user group 2345 2014-01-02 23:45 /schemas_folder/date=20140102/A-schema.avsc

Problem

问题

I don't want to be using the shell command, and cannot seem to find the equivalent to that command above in the Java APIs. When I try to implement the looping myself, I get terrible performance. I want at leastthe performance of the command line (around 3 secondsin my case)...

我不想使用 shell 命令，并且似乎无法在 Java API 中找到与上述命令等效的命令。当我尝试自己实现循环时，我的表现很糟糕。我至少想要命令行的性能（在我的情况下大约3 秒）...

What I found so far

到目前为止我发现了什么

One can notice that it prints twice Found 1 items, once before each result. It does not print Found 2 itemsonce at the beginning. That probably hints that wildcards are not implemented on the FileSystemside but somehow handled by the client. I can't seem to find the right source code to look at to see how that was implemented.

人们可以注意到它打印了两次Found 1 items，在每个结果之前一次。它不会Found 2 items在开始时打印一次。这可能暗示通配符不是在FileSystem侧面实现的，而是由客户端以某种方式处理的。我似乎找不到正确的源代码来查看它是如何实现的。

Below are my first shots, probably a bit too na?ve...

下面是我的第一张照片，可能有点太天真了...

Using listFiles(...)

使用 listFiles(...)

Code:

代码：

RemoteIterator<LocatedFileStatus> files = filesystem.listFiles(new Path("/schemas_folder"), true);
Pattern pattern = Pattern.compile("^.*/date=[0-9]{8}/A-schema\.avsc$");
while (files.hasNext()) {
    Path path = files.next().getPath();
    if (pattern.matcher(path.toString()).matches())
    {
        System.out.println(path);
    }
}

Result:

结果：

This prints exactly what I would expect, but since it first lists everything recursively and then filters, the performance is really poor. With my current dataset, it takes almost 25 seconds...

这正是我所期望的打印结果，但由于它首先递归列出所有内容，然后进行过滤，因此性能非常差。使用我当前的数据集，它需要将近25 秒......

Using listStatus(...)

使用 listStatus(...)

Code:

代码：

FileStatus[] statuses = filesystem.listStatus(new Path("/schemas_folder"), new PathFilter()
{
    private final Pattern pattern = Pattern.compile("^date=[0-9]{8}$");

    @Override
    public boolean accept(Path path)
    {
        return pattern.matcher(path.getName()).matches();
    }
});
Path[] paths = new Path[statuses.length];
for (int i = 0; i < statuses.length; i++) { paths[i] = statuses[i].getPath(); }
statuses = filesystem.listStatus(paths, new PathFilter()
{
    @Override
    public boolean accept(Path path)
    {
        return "A-schema.avsc".equals(path.getName());
    }
});
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

Result:

结果：

Thanks to the PathFilters and the use of arrays, it seems to perform faster (around 12 seconds). The code is more complex, though, and more difficult to adapt to different situations. Most importantly, the performance is still 3 to 4 times slower than the command-line version!

由于PathFilters 和数组的使用，它似乎执行得更快（大约12 秒）。但是，代码更复杂，并且更难以适应不同的情况。最重要的是，性能仍然比命令行版本慢 3 到 4 倍！

Question

题

What am I missing here? What is the fastest way to get the results I want?

我在这里缺少什么？获得我想要的结果的最快方法是什么？

Updates

更新

2014.07.09 - 13:38

The proposed answerof Mukesh Sis apparently the best possible API approach.

所提出的答案的穆克什小号显然是最好的API的方法。

In the example I gave above, the code end-up looking like this:

在我上面给出的例子中，代码最终看起来像这样：

FileStatus[] statuses = filesystem.globStatus(new Path("/schemas_folder/date=*/A-schema.avsc"));
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

This is the best looking and best performing code I could come up with so far, but is still not performing as well as the shell version.

这是迄今为止我能想出的最好看、性能最好的代码，但性能仍然不如 shell 版本。

Answer 1

采纳答案by Mukesh S

Instead of listStatus you can try hadoops globStatus. Hadoop provides two FileSystem method for processing globs:

您可以尝试使用 hadoops globStatus 代替 listStatus。Hadoop 提供了两种 FileSystem 方法来处理 globs：

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException

An optional PathFilter can be specified to restrict the matches further.

可以指定可选的 PathFilter 以进一步限制匹配。

For more description you can check Hadoop:Definitive Guide here

有关更多描述，您可以在此处查看 Hadoop:Definitive Guide

Hope it helps..!!!

希望能帮助到你..！！！

Java Hadoop 文件系统中的通配符列出 API 调用

提问by snooze92

Context

语境

Goal

目标

Problem

问题

What I found so far

到目前为止我发现了什么

Using listFiles(...)

使用 listFiles(...)

Code:

代码：

Result:

结果：

Using listStatus(...)

使用 listStatus(...)

Code:

代码：

Result:

结果：

Question

题

Updates

更新

2014.07.09 - 13:38

2014.07.09 - 13:38

采纳答案by Mukesh S

相关推荐

最近更新

标签

Java Hadoop 文件系统中的通配符列出 API 调用

提问by snooze92

Context

语境

Goal

目标

Problem

问题

What I found so far

到目前为止我发现了什么

Using listFiles(...)

使用 listFiles(...)

Code:

代码：

Result:

结果：

Using listStatus(...)

使用 listStatus(...)

Code:

代码：

Result:

结果：

Question

题

Updates

更新

2014.07.09 - 13:38

2014.07.09 - 13:38

采纳答案by Mukesh S

相关推荐

Android 错误：java.net.SocketException：套接字已关闭

使用jackson转换Java对象时如何忽略可选属性

Java 我想要一个菜单​​在无效选项后重复？

Java Mac 上 Tomcat 8 的 CATALINA_BASE/webapps 文件夹在哪里？

相关推荐

最近更新

标签

Java 我想要一个菜单在无效选项后重复？