Java Hadoop：提供目录作为 MapReduce 作业的输入

Question

提问by sgokhales

I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.

我正在使用 Cloudera Hadoop。我能够运行简单的 mapreduce 程序，在其中我提供一个文件作为 MapReduce 程序的输入。

This file contains all the other files to be processed by mapper function.

该文件包含所有其他要由映射器函数处理的文件。

But, I'm stuck at one point.

但是，我被困在了某一点上。

/folder1
  - file1.txt
  - file2.txt
  - file3.txt

How can I specify the input path to MapReduce program as "/folder1", so that it can start processing each file inside that directory ?

如何将 MapReduce 程序的输入路径指定为"/folder1"，以便它可以开始处理该目录中的每个文件？

Any ideas ?

有任何想法吗？

EDIT :

编辑：

1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.

1) 首先，我提供了 inputFile.txt 作为 mapreduce 程序的输入。它工作得很好。

>inputFile.txt
file1.txt
file2.txt
file3.txt

2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.

2）但是现在，我想在命令行上提供一个输入目录作为 arg[0] ，而不是提供输入文件。

hadoop jar ABC.jar /folder1 /output

Answer 1

回答by zhutoulala

you could use FileSystem.listStatusto get the file list from given dir, the code could be as below:

您可以使用FileSystem.listStatus从给定目录中获取文件列表，代码如下：

//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf); 
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
    for(FileStatus status : status_list){
        //add each file to the list of inputs for the map-reduce job
        FileInputFormat.addInputPath(conf, status.getPath());
    }
}

Answer 2

回答by shashaDenovo

The Problem is FileInputFormat doesn't read files recursively in the input path dir.

问题是 FileInputFormat 不会在输入路径目录中递归读取文件。

Solution:Use Following code

解决方案：使用以下代码

FileInputFormat.setInputDirRecursive(job, true);Before below line in your Map Reduce Code

FileInputFormat.setInputDirRecursive(job, true);在您的 Map Reduce 代码中的以下行之前

FileInputFormat.addInputPath(job, new Path(args[0]));

You can check herefor which version it was fixed.

您可以在此处查看已修复的版本。

Answer 3

回答by Dmitry

you can use hdfs wildcardsin order to provide multiple files

您可以使用 hdfs通配符来提供多个文件

so, the solution :

所以，解决方案：

hadoop jar ABC.jar /folder1/* /output

or

或者

hadoop jar ABC.jar /folder1/*.txt /output

Answer 4

回答by Ravindra babu

Use MultipleInputsclass.

使用MultipleInputs类。

MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat> 
inputFormatClass, Class<? extends Mapper> mapperClass)

Have a look at working code

看看工作代码

Java Hadoop：提供目录作为 MapReduce 作业的输入

提问by sgokhales

回答by zhutoulala

回答by shashaDenovo

回答by Dmitry

回答by Ravindra babu

相关推荐

最近更新

标签

Java Hadoop：提供目录作为 MapReduce 作业的输入

提问by sgokhales

回答by zhutoulala

回答by shashaDenovo

回答by Dmitry

回答by Ravindra babu

相关推荐

Java REST 如何“轻量级”？

Java 如何使用 Apache HttpClient 处理无效的 SSL 证书？

Java Android MediaPlayer SeekTo 功能解决方法

Java 数据截断：日期时间值不正确：''

相关推荐

最近更新

标签