Java Hadoop 文件系统中的通配符列出 API 调用
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24647992/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Wildcard in Hadoop's FileSystem listing API calls
提问by snooze92
tl;dr:
To be able to use wildcards (globs) in the listed paths, one simply has to use globStatus(...)
instead of listStatus(...)
.
tl;dr:
为了能够在列出的路径中使用通配符(globs),只需使用globStatus(...)
代替listStatus(...)
。
Context
语境
Files on my HDFS cluster are organized in partitions, with the date being the "root"partition. A simplified example of the files structure would look like this:
我的 HDFS 集群上的文件按分区组织,日期是“根”分区。文件结构的简化示例如下所示:
/schemas_folder
├── date=20140101
│?? ├── A-schema.avsc
│?? ├── B-schema.avsc
├── date=20140102
│?? ├── A-schema.avsc
│?? ├── B-schema.avsc
│?? ├── C-schema.avsc
└── date=20140103
?? ├── B-schema.avsc
?? └── C-schema.avsc
In my case, the directory stores Avroschemas for different types of data (A, B and C in this example)at different dates. The schema might start existing, evolve and stop existing... as time passes.
就我而言,该目录在不同日期存储不同类型数据(本例中为A、B 和 C)的Avro模式。随着时间的推移,模式可能会开始存在、发展和停止存在......
Goal
目标
I need to be able to get all the schemas that exist for a given type, as quickly as possible. In the example where I would like to get all the schemas that exist for type A, I would like to do the following:
我需要能够尽快获得给定类型存在的所有模式。在我想要获取类型 A 存在的所有模式的示例中,我想要执行以下操作:
hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc
That would give me
那会给我
Found 1 items
-rw-r--r-- 3 user group 1234 2014-01-01 12:34 /schemas_folder/date=20140101/A-schema.avsc
Found 1 items
-rw-r--r-- 3 user group 2345 2014-01-02 23:45 /schemas_folder/date=20140102/A-schema.avsc
Problem
问题
I don't want to be using the shell command, and cannot seem to find the equivalent to that command above in the Java APIs. When I try to implement the looping myself, I get terrible performance. I want at leastthe performance of the command line (around 3 secondsin my case)...
我不想使用 shell 命令,并且似乎无法在 Java API 中找到与上述命令等效的命令。当我尝试自己实现循环时,我的表现很糟糕。我至少想要命令行的性能(在我的情况下大约3 秒)...
What I found so far
到目前为止我发现了什么
One can notice that it prints twice Found 1 items
, once before each result. It does not print Found 2 items
once at the beginning. That probably hints that wildcards are not implemented on the FileSystem
side but somehow handled by the client. I can't seem to find the right source code to look at to see how that was implemented.
人们可以注意到它打印了两次Found 1 items
,在每个结果之前一次。它不会Found 2 items
在开始时打印一次。这可能暗示通配符不是在FileSystem
侧面实现的,而是由客户端以某种方式处理的。我似乎找不到正确的源代码来查看它是如何实现的。
Below are my first shots, probably a bit too na?ve...
下面是我的第一张照片,可能有点太天真了...
Using listFiles(...)
使用 listFiles(...)
Code:
代码:
RemoteIterator<LocatedFileStatus> files = filesystem.listFiles(new Path("/schemas_folder"), true);
Pattern pattern = Pattern.compile("^.*/date=[0-9]{8}/A-schema\.avsc$");
while (files.hasNext()) {
Path path = files.next().getPath();
if (pattern.matcher(path.toString()).matches())
{
System.out.println(path);
}
}
Result:
结果:
This prints exactly what I would expect, but since it first lists everything recursively and then filters, the performance is really poor. With my current dataset, it takes almost 25 seconds...
这正是我所期望的打印结果,但由于它首先递归列出所有内容,然后进行过滤,因此性能非常差。使用我当前的数据集,它需要将近25 秒......
Using listStatus(...)
使用 listStatus(...)
Code:
代码:
FileStatus[] statuses = filesystem.listStatus(new Path("/schemas_folder"), new PathFilter()
{
private final Pattern pattern = Pattern.compile("^date=[0-9]{8}$");
@Override
public boolean accept(Path path)
{
return pattern.matcher(path.getName()).matches();
}
});
Path[] paths = new Path[statuses.length];
for (int i = 0; i < statuses.length; i++) { paths[i] = statuses[i].getPath(); }
statuses = filesystem.listStatus(paths, new PathFilter()
{
@Override
public boolean accept(Path path)
{
return "A-schema.avsc".equals(path.getName());
}
});
for (FileStatus status : statuses)
{
System.out.println(status.getPath());
}
Result:
结果:
Thanks to the PathFilter
s and the use of arrays, it seems to perform faster (around 12 seconds). The code is more complex, though, and more difficult to adapt to different situations. Most importantly, the performance is still 3 to 4 times slower than the command-line version!
由于PathFilter
s 和数组的使用,它似乎执行得更快(大约12 秒)。但是,代码更复杂,并且更难以适应不同的情况。最重要的是,性能仍然比命令行版本慢 3 到 4 倍!
Question
题
What am I missing here? What is the fastest way to get the results I want?
我在这里缺少什么?获得我想要的结果的最快方法是什么?
Updates
更新
2014.07.09 - 13:38
2014.07.09 - 13:38
The proposed answerof Mukesh Sis apparently the best possible API approach.
In the example I gave above, the code end-up looking like this:
在我上面给出的例子中,代码最终看起来像这样:
FileStatus[] statuses = filesystem.globStatus(new Path("/schemas_folder/date=*/A-schema.avsc"));
for (FileStatus status : statuses)
{
System.out.println(status.getPath());
}
This is the best looking and best performing code I could come up with so far, but is still not performing as well as the shell version.
这是迄今为止我能想出的最好看、性能最好的代码,但性能仍然不如 shell 版本。
采纳答案by Mukesh S
Instead of listStatus you can try hadoops globStatus. Hadoop provides two FileSystem method for processing globs:
您可以尝试使用 hadoops globStatus 代替 listStatus。Hadoop 提供了两种 FileSystem 方法来处理 globs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException
An optional PathFilter can be specified to restrict the matches further.
可以指定可选的 PathFilter 以进一步限制匹配。
For more description you can check Hadoop:Definitive Guide here
有关更多描述,您可以在此处查看 Hadoop:Definitive Guide
Hope it helps..!!!
希望能帮助到你..!!!