java Reducer setup() 的 Mapper 是干什么用的?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25432598/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the Mapper of Reducer setup() used for?
提问by Don E
What exactly are the setup and cleanup methods used for? I have tried to find out what they mean, but no one had yet to describe exactly what they do. For instance, how does the setup method use the data from the input split? does it take it as a whole? or line by line?
设置和清理方法究竟用于什么?我试图找出它们的意思,但还没有人确切地描述它们的作用。例如,setup 方法如何使用来自输入拆分的数据?它是一个整体吗?还是一行一行?
回答by Jane Wayne
As already mentioned, setup()
and cleanup()
are methods you can override, if you choose, and they are there for you to initialize and clean up your map/reduce tasks. You actually don't have access to any data from the input split directly during these phases. The lifecycle of a map/reduce task is (from a programmer's point of view):
前面已经提到,setup()
和cleanup()
一些方法可以覆盖,如果你选择,他们在那里为您初始化和清理你的map / reduce任务。在这些阶段,您实际上无法直接访问来自输入拆分的任何数据。map/reduce 任务的生命周期是(从程序员的角度来看):
setup -> map -> cleanup
设置 -> 地图 -> 清理
setup -> reduce -> cleanup
设置 -> 减少 -> 清理
What typically happens during setup()
is that you may read parameters from the configuration object to customize your processing logic.
在此期间通常会发生的情况setup()
是您可以从配置对象中读取参数以自定义您的处理逻辑。
What typically happens during cleanup()
is that you clean up any resources you may have allocated. There are other uses too, which is to flush out any accumulation of aggregate results.
在此期间通常会发生的事情cleanup()
是清理可能已分配的任何资源。还有其他用途,即清除聚合结果的任何积累。
The setup()
and cleanup()
methods are simply "hooks" for you, the developer/programmer, to have a chance to do something before and after your map/reduce tasks.
该setup()
和cleanup()
方法是简单的“钩子”为您,开发人员/程序员,有机会之前和之后的地图做一些事情/ reduce任务。
For example, in the canonical word count example, let's say you want to exclude certain words from being counted (e.g. stop words such as "the", "a", "be", etc...). When you configure your MapReduce Job, you can pass a list (comma-delimited) of these words as a parameter (key-value pair) into the configuration object. Then in your map code, during setup()
, you can acquire the stop words and store them in some global variable (global variable to the map task) and exclude counting these words during your map logic. Here is a modified example from http://wiki.apache.org/hadoop/WordCount.
例如,在规范字数统计示例中,假设您想从计数中排除某些字词(例如,停用词,如“the”、“a”、“be”等...)。当您配置 MapReduce 作业时,您可以将这些单词的列表(以逗号分隔)作为参数(键值对)传递到配置对象中。然后在您的地图代码中, during setup()
,您可以获取停用词并将它们存储在某个全局变量(地图任务的全局变量)中,并在您的地图逻辑期间排除对这些词的计数。这是来自http://wiki.apache.org/hadoop/WordCount的修改示例。
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private Set<String> stopWords;
protected void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
stopWords = new HashSet<String>();
for(String word : conf.get("stop.words").split(",")) {
stopWords.add(word);
}
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
if(stopWords.contains(token)) {
continue;
}
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("stop.words", "the, a, an, be, but, can");
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
回答by pasha701
setup: Called once at the beginning of the task.
You can put custom initialization here.
您可以在此处放置自定义初始化。
cleanup: Called once at the end of the task.
You can put resource releasing here.
你可以把资源释放放在这里。
回答by Y.Prithvi
setupand cleanupare called once for each task.
For example you have 5 mappers running, for each mapper you want to initialize some values, then you can use setup. Your setup method is called 5 times.
So, for each mapreduce first setup()
method is called then map()/reduce()
method is called and later cleanup()
method is called before exiting the task.
setup和cleanup为每个任务调用一次。
例如,您有 5 个映射器正在运行,对于每个要初始化一些值的映射器,您可以使用 setup. 您的设置方法被调用了 5 次。
因此,对于每个 mapreduce,首先setup()
调用map()/reduce()
方法,然后调用cleanup()
方法,然后在退出任务之前调用稍后的方法。