java Hadoop:如何将reducer 输出合并到一个文件中?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12911798/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 10:45:17  来源:igfitidea点击:

Hadoop: How can i merge reducer outputs to a single file?

javahadoopmergemapreducehdfs

提问by thomaslee

I know that "getmerge" command in shell can do this work.

我知道 shell 中的“getmerge”命令可以完成这项工作。

But what should I do if I want to merge these outputs after the job by HDFS API for java?

但是如果我想在工作后通过 HDFS API for java 合并这些输出怎么办?

What i actually want is a single merged file on HDFS.

我真正想要的是 HDFS 上的单个合并文件。

The only thing i can think of is to start an additional job after that.

我唯一能想到的就是在那之后开始一份额外的工作。

thanks!

谢谢!

采纳答案by VoiceOfUnreason

But what should I do if I want to merge these outputs after the job by HDFS API for java?

但是,如果我想通过 HDFS API for java 在作业完成后合并这些输出,我该怎么办?

Guessing, because I haven't tried this myself, but I think the method you are looking for is FileUtil.copyMerge, which is the method that FsShell invokes when you run the -getmergecommand. FileUtil.copyMergetakes two FileSystem objects as arguments - FsShell uses FileSystem.getLocal to retrieve the destination FileSystem, but I don't see any reason you couldn't instead use Path.getFileSystem on the destination to obtain an OutputStream

猜测,因为我没有尝试这样做我自己,但我认为你正在寻找的方法是FileUtil.copyMerge,这是方法,当你运行FsShell调用-getmerge命令。 FileUtil.copyMerge将两个 FileSystem 对象作为参数 - FsShell 使用 FileSystem.getLocal 来检索目标文件系统,但我没有看到您不能在目标上使用 Path.getFileSystem 来获取 OutputStream 的任何理由

That said, I don't think it wins you very much -- the merge is still happening in the local JVM; so you aren't really saving very much over -getmergefollowed by -put.

话虽如此,我认为它不会让您受益匪浅——合并仍在本地 JVM 中进行;所以你并没有真正节省很多,-getmerge然后是-put.

回答by saurabh shashank

You get a single Out-put File by Setting a single Reducer in your code .

您可以通过在代码中设置单个 Reducer 来获得单个输出文件。

Job.setNumberOfReducer(1);

Will work for your requirement , but costly

将满足您的要求,但成本高



OR

或者



Static method to execute a shell command. 
Covers most of the simple cases without requiring the user to implement the Shell interface.

Parameters:
env the map of environment key=value
cmd shell command to execute.
Returns:
the output of the executed command.


org.apache.hadoop.util.Shell.execCommand(String[])