java Hadoop Word count:接收以字母“c”开头的单词总数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26208454/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Hadoop Word count: receive the total number of words that start with the letter "c"
提问by King11
Heres the Hadoop word count java map and reduce source code:
下面是 Hadoop 字数统计 java map 和 reduce 源代码:
In the map function, I've gotten to where I can output all the word that starts with the letter "c" and also the total number of times that word appears, but what I'm trying to do is just output the total number of words starting with the letter "c" but I'm stuck a little on getting the total number.Any help would be greatly appreciated, Thank you.
在 map 函数中,我已经到了可以输出所有以字母“c”开头的单词以及该单词出现的总次数的地方,但我想要做的只是输出总数以字母“c”开头的单词,但我在获取总数方面有点卡住了。任何帮助将不胜感激,谢谢。
Example
例子
My Output of what I'm getting:
我得到的输出:
could 2
可以 2
can 3
可以 3
cat 5
猫 5
What I'm trying to get:
我想得到什么:
c-total 10
c-总计 10
public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
if(word.toString().startsWith("c"){
output.collect(word, one);
}
}
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get(); //gets the sum of the words and add them together
}
output.collect(key, new IntWritable(sum)); //outputs the word and the number
}
}
回答by Chris Gerken
Instead of
代替
output.collect(word, one);
in your mapper, try:
在您的映射器中,尝试:
output.collect("c-total", one);
回答by Unmesha SreeVeni
Chris Gerken's answer is right.
克里斯·格肯的回答是正确的。
If you are outputing word as your key it will only help you to calculate the count of unique words starting with "c"
如果您输出单词作为键,它只会帮助您计算以“c”开头的唯一单词的数量
Not all total count of "c".
并非所有“c”的总数。
So for that you need to output a unique key from mapper.
因此,为此您需要从映射器输出一个唯一键。
while (itr.hasMoreTokens()) {
String token = itr.nextToken();
if(token.startsWith("c")){
word.set("C_Count");
output.collect(word, one);
}
}
Here is an example using New Api
这是使用 New Api 的示例
Driver class
驱动程序类
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
FileSystem fs = FileSystem.get(conf);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
if (fs.exists(new Path(args[1])))
fs.delete(new Path(args[1]), true);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.waitForCompletion(true);
}
}
Mapper class
映射器类
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
String token = itr.nextToken();
if(token.startsWith("c")){
word.set("C_Count");
context.write(word, one);
}
}
}
}
Reducer class
减速机类
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
回答by Dravidian
Simpler code for mapper:
更简单的映射器代码:
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> op, Reporter r)throws IOException
{
String s = value.toString();
for (String w : s.split("\W+"))
{
if (w.length()>0)
{
if(w.startsWith("C")){
op.collect(new Text("C-Count"), new IntWritable(1));
}
}
}
}