java Hadoop:如何将reducer 输出合并到一个文件中?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12911798/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Hadoop: How can i merge reducer outputs to a single file?
提问by thomaslee
I know that "getmerge" command in shell can do this work.
我知道 shell 中的“getmerge”命令可以完成这项工作。
But what should I do if I want to merge these outputs after the job by HDFS API for java?
但是如果我想在工作后通过 HDFS API for java 合并这些输出怎么办?
What i actually want is a single merged file on HDFS.
我真正想要的是 HDFS 上的单个合并文件。
The only thing i can think of is to start an additional job after that.
我唯一能想到的就是在那之后开始一份额外的工作。
thanks!
谢谢!
采纳答案by VoiceOfUnreason
But what should I do if I want to merge these outputs after the job by HDFS API for java?
但是,如果我想通过 HDFS API for java 在作业完成后合并这些输出,我该怎么办?
Guessing, because I haven't tried this myself, but I think the method you are looking for is FileUtil.copyMerge, which is the method that FsShell invokes when you run the -getmerge
command. FileUtil.copyMerge
takes two FileSystem objects as arguments - FsShell uses FileSystem.getLocal to retrieve the destination FileSystem, but I don't see any reason you couldn't instead use Path.getFileSystem on the destination to obtain an OutputStream
猜测,因为我没有尝试这样做我自己,但我认为你正在寻找的方法是FileUtil.copyMerge,这是方法,当你运行FsShell调用-getmerge
命令。 FileUtil.copyMerge
将两个 FileSystem 对象作为参数 - FsShell 使用 FileSystem.getLocal 来检索目标文件系统,但我没有看到您不能在目标上使用 Path.getFileSystem 来获取 OutputStream 的任何理由
That said, I don't think it wins you very much -- the merge is still happening in the local JVM; so you aren't really saving very much over -getmerge
followed by -put
.
话虽如此,我认为它不会让您受益匪浅——合并仍在本地 JVM 中进行;所以你并没有真正节省很多,-getmerge
然后是-put
.
回答by saurabh shashank
You get a single Out-put File by Setting a single Reducer in your code .
您可以通过在代码中设置单个 Reducer 来获得单个输出文件。
Job.setNumberOfReducer(1);
Will work for your requirement , but costly
将满足您的要求,但成本高
OR
或者
Static method to execute a shell command.
Covers most of the simple cases without requiring the user to implement the Shell interface.
Parameters:
env the map of environment key=value
cmd shell command to execute.
Returns:
the output of the executed command.
org.apache.hadoop.util.Shell.execCommand(String[])