Java Hadoop:提供目录作为 MapReduce 作业的输入
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20094366/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Hadoop : Provide directory as input to MapReduce job
提问by sgokhales
I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.
我正在使用 Cloudera Hadoop。我能够运行简单的 mapreduce 程序,在其中我提供一个文件作为 MapReduce 程序的输入。
This file contains all the other files to be processed by mapper function.
该文件包含所有其他要由映射器函数处理的文件。
But, I'm stuck at one point.
但是,我被困在了某一点上。
/folder1
- file1.txt
- file2.txt
- file3.txt
How can I specify the input path to MapReduce program as "/folder1"
, so that it can start processing each file inside that directory ?
如何将 MapReduce 程序的输入路径指定为"/folder1"
,以便它可以开始处理该目录中的每个文件?
Any ideas ?
有任何想法吗 ?
EDIT :
编辑 :
1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.
1) 首先,我提供了 inputFile.txt 作为 mapreduce 程序的输入。它工作得很好。
>inputFile.txt
file1.txt
file2.txt
file3.txt
2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.
2)但是现在,我想在命令行上提供一个输入目录作为 arg[0] ,而不是提供输入文件。
hadoop jar ABC.jar /folder1 /output
回答by zhutoulala
you could use FileSystem.listStatusto get the file list from given dir, the code could be as below:
您可以使用FileSystem.listStatus从给定目录中获取文件列表,代码如下:
//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf);
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
for(FileStatus status : status_list){
//add each file to the list of inputs for the map-reduce job
FileInputFormat.addInputPath(conf, status.getPath());
}
}
回答by shashaDenovo
The Problem is FileInputFormat doesn't read files recursively in the input path dir.
问题是 FileInputFormat 不会在输入路径目录中递归读取文件。
Solution:Use Following code
解决方案:使用以下代码
FileInputFormat.setInputDirRecursive(job, true);
Before below line in your Map Reduce Code
FileInputFormat.setInputDirRecursive(job, true);
在您的 Map Reduce 代码中的以下行之前
FileInputFormat.addInputPath(job, new Path(args[0]));
FileInputFormat.addInputPath(job, new Path(args[0]));
You can check herefor which version it was fixed.
您可以在此处查看已修复的版本。
回答by Dmitry
回答by Ravindra babu
Use MultipleInputsclass.
使用MultipleInputs类。
MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat>
inputFormatClass, Class<? extends Mapper> mapperClass)
Have a look at working code
看看工作代码