java 在 Hadoop Map Reduce 中重命名部分文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14555313/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Renaming Part Files in Hadoop Map Reduce
提问by Arun A K
I have tried to use the MultipleOutputs
class as per the example in page http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
我尝试MultipleOutputs
按照页面http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs 中的示例使用该类.html
Driver Code
驱动程序代码
Configuration conf = new Configuration();
Job job = new Job(conf, "Wordcount");
job.setJarByClass(WordCount.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
Text.class, IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Reducer Code
减速机代码
public class WordCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
private MultipleOutputs<Text, IntWritable> mos;
public void setup(Context context){
mos = new MultipleOutputs<Text, IntWritable>(context);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
//context.write(key, result);
mos.write("text", key,result);
}
public void cleanup(Context context) {
try {
mos.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The output of the reducer is found to rename to text-r-00000
发现reducer的输出重命名为text-r-00000
But the issue here is that I am also getting an empty part-r-00000 file as well. Is this how MultipleOutputs is expected to behave, or is there some problem with my code? Please advice.
但这里的问题是我也得到了一个空的 part-r-00000 文件。这是 MultipleOutputs 预期的行为方式,还是我的代码有问题?请指教。
Another alternative I have tried out is to iterate through my output folder using the FileSystem class and manually rename all files beginning with part.
我尝试过的另一种方法是使用 FileSystem 类遍历我的输出文件夹,并手动重命名所有以 part 开头的文件。
What is the best way?
什么是最好的方法?
FileSystem hdfs = FileSystem.get(configuration);
FileStatus fs[] = hdfs.listStatus(new Path(outputPath));
for (FileStatus aFile : fs) {
if (aFile.isDir()) {
hdfs.delete(aFile.getPath(), true);
// delete all directories and sub-directories (if any) in the output directory
}
else {
if (aFile.getPath().getName().contains("_"))
hdfs.delete(aFile.getPath(), true);
// delete all log files and the _SUCCESS file in the output directory
else {
hdfs.rename(aFile.getPath(), new Path(myCustomName));
}
}
采纳答案by Charles Menguy
Even if you are using MultipleOutputs
, the default OutputFormat
(I believe it is TextOutputFormat
) is still being used, and so it will initialize and creating these part-r-xxxxx
files that you are seeing.
即使您正在使用MultipleOutputs
,默认值OutputFormat
(我相信是TextOutputFormat
)仍在使用,因此它将初始化并创建part-r-xxxxx
您看到的这些文件。
The fact that they are empty is because you are not doing any context.write
because you are using MultipleOutputs
. But that doesn't prevent them from being created during initialization.
它们是空的事实是因为您没有做任何事情,context.write
因为您正在使用MultipleOutputs
. 但这并不能阻止它们在初始化期间被创建。
To get rid of them, you need to define your OutputFormat
to say you are not expecting any output. You can do it this way:
为了摆脱它们,你需要定义你OutputFormat
说你不期待任何输出。你可以这样做:
job.setOutputFormat(NullOutputFormat.class);
With that property set, this should ensure that your part files are never initialized at all, but you still get your output in the MultipleOutputs
.
设置该属性后,这应该确保您的零件文件根本不会被初始化,但您仍然可以在MultipleOutputs
.
You could also probably use LazyOutputFormat
which would ensure that your output files are only created when/if there is some data, and not initialize empty files. You could do i this way:
您也可以使用LazyOutputFormat
which 来确保您的输出文件仅在/如果有数据时创建,而不是初始化空文件。你可以这样做:
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
Note that you are using in your Reducer
the prototype MultipleOutputs.write(String namedOutput, K key, V value)
, which just uses a default output path that will be generated based on your namedOutput
to something like: {namedOutput}-(m|r)-{part-number}
. If you want to have more control over your output filenames, you should use the prototype MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath)
which can allow you to get filenames generated at runtime based on your keys/values.
请注意,您正在使用您Reducer
的原型MultipleOutputs.write(String namedOutput, K key, V value)
,它仅使用默认输出路径,该路径将根据您的内容生成namedOutput
,例如:{namedOutput}-(m|r)-{part-number}
。如果你想对你的输出文件名有更多的控制,你应该使用原型MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath)
,它可以让你根据你的键/值在运行时生成文件名。
回答by RHolland
This is all you need to do in the Driver class to change the basename of the output file:
job.getConfiguration().set("mapreduce.output.basename", "text");
So this will result in your files being called "text-r-00000".
这就是您需要在 Driver 类中更改输出文件的基名所需的全部操作:
job.getConfiguration().set("mapreduce.output.basename", "text");
因此,这将导致您的文件被称为“text-r-00000”。