java 在 Hadoop Map Reduce 中重命名部分文件

Question

提问by Arun A K

I have tried to use the MultipleOutputsclass as per the example in page http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

我尝试MultipleOutputs按照页面http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs 中的示例使用该类.html

Driver Code

驱动程序代码

    Configuration conf = new Configuration();
    Job job = new Job(conf, "Wordcount");
    job.setJarByClass(WordCount.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(WordCountReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
            Text.class, IntWritable.class);
    System.exit(job.waitForCompletion(true) ? 0 : 1);

Reducer Code

减速机代码

public class WordCountReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    private MultipleOutputs<Text, IntWritable> mos;
    public void setup(Context context){
        mos = new MultipleOutputs<Text, IntWritable>(context);
    }
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        //context.write(key, result);
        mos.write("text", key,result);
    }
    public void cleanup(Context context)  {
         try {
            mos.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
         }
}

The output of the reducer is found to rename to text-r-00000

发现reducer的输出重命名为text-r-00000

But the issue here is that I am also getting an empty part-r-00000 file as well. Is this how MultipleOutputs is expected to behave, or is there some problem with my code? Please advice.

但这里的问题是我也得到了一个空的 part-r-00000 文件。这是 MultipleOutputs 预期的行为方式，还是我的代码有问题？请指教。

Another alternative I have tried out is to iterate through my output folder using the FileSystem class and manually rename all files beginning with part.

我尝试过的另一种方法是使用 FileSystem 类遍历我的输出文件夹，并手动重命名所有以 part 开头的文件。

What is the best way?

什么是最好的方法？

FileSystem hdfs = FileSystem.get(configuration);
FileStatus fs[] = hdfs.listStatus(new Path(outputPath));
for (FileStatus aFile : fs) {
if (aFile.isDir()) {
hdfs.delete(aFile.getPath(), true);
// delete all directories and sub-directories (if any) in the output directory
} 
else {
if (aFile.getPath().getName().contains("_"))
hdfs.delete(aFile.getPath(), true);
// delete all log files and the _SUCCESS file in the output directory
else {
hdfs.rename(aFile.getPath(), new Path(myCustomName));
}
}

Answer 1

采纳答案by Charles Menguy

Even if you are using MultipleOutputs, the default OutputFormat(I believe it is TextOutputFormat) is still being used, and so it will initialize and creating these part-r-xxxxxfiles that you are seeing.

即使您正在使用MultipleOutputs，默认值OutputFormat（我相信是TextOutputFormat）仍在使用，因此它将初始化并创建part-r-xxxxx您看到的这些文件。

The fact that they are empty is because you are not doing any context.writebecause you are using MultipleOutputs. But that doesn't prevent them from being created during initialization.

它们是空的事实是因为您没有做任何事情，context.write因为您正在使用MultipleOutputs. 但这并不能阻止它们在初始化期间被创建。

To get rid of them, you need to define your OutputFormatto say you are not expecting any output. You can do it this way:

为了摆脱它们，你需要定义你OutputFormat说你不期待任何输出。你可以这样做：

job.setOutputFormat(NullOutputFormat.class);

With that property set, this should ensure that your part files are never initialized at all, but you still get your output in the MultipleOutputs.

设置该属性后，这应该确保您的零件文件根本不会被初始化，但您仍然可以在MultipleOutputs.

You could also probably use LazyOutputFormatwhich would ensure that your output files are only created when/if there is some data, and not initialize empty files. You could do i this way:

您也可以使用LazyOutputFormatwhich 来确保您的输出文件仅在/如果有数据时创建，而不是初始化空文件。你可以这样做：

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

Note that you are using in your Reducerthe prototype MultipleOutputs.write(String namedOutput, K key, V value), which just uses a default output path that will be generated based on your namedOutputto something like: {namedOutput}-(m|r)-{part-number}. If you want to have more control over your output filenames, you should use the prototype MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath)which can allow you to get filenames generated at runtime based on your keys/values.

请注意，您正在使用您Reducer的原型MultipleOutputs.write(String namedOutput, K key, V value)，它仅使用默认输出路径，该路径将根据您的内容生成namedOutput，例如：{namedOutput}-(m|r)-{part-number}。如果你想对你的输出文件名有更多的控制，你应该使用原型MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath)，它可以让你根据你的键/值在运行时生成文件名。

Answer 2

回答by RHolland

This is all you need to do in the Driver class to change the basename of the output file: job.getConfiguration().set("mapreduce.output.basename", "text");So this will result in your files being called "text-r-00000".

这就是您需要在 Driver 类中更改输出文件的基名所需的全部操作： job.getConfiguration().set("mapreduce.output.basename", "text");因此，这将导致您的文件被称为“text-r-00000”。

java 在 Hadoop Map Reduce 中重命名部分文件

提问by Arun A K

采纳答案by Charles Menguy

回答by RHolland

相关推荐

最近更新

标签

java 在 Hadoop Map Reduce 中重命名部分文件

提问by Arun A K

采纳答案by Charles Menguy

回答by RHolland

相关推荐

java 如何在休眠中为自动增量提供起始值

java 如何测试从文件和路径读取（使用junit）？

java 如何在运行时覆盖 jar 中的 log4j 属性文件

java Android Horizo​​ntal LinearLayout - 包裹元素

相关推荐

最近更新

标签

java Android Horizontal LinearLayout - 包裹元素