如何在Ubuntu 16.04上运行Hadoop MapReduce程序
在此教程 中,将介绍如何运行MapReduce程序。
MapReduce是Apache Hadoop的核心部分之一,它是Apache Hadoop的处理层。
因此,在我向我们展示如何运行MapReduce程序之前,请告诉我简要解释MapReduce。
MapReduce是一个用于并行处理大数据集的系统。
MapReduce将数据减少到结果并创建数据摘要。
MapReduce程序有两个映射器和减速机。
映射完成后的工作后,只有减速器启动。
映射器:它将输入键/值对映射到一组中间键/值对。
Reducer:它减少了一组中间值,该值与较小的值共享密钥。
基本上,在WordCount MapReduce程序中,我们提供输入文件的任何文本文件,如输入。
当MapReduce程序开始时,以下是它通过的进程:
拆分:将输入文件中的每一行拆分为单词。
映射:它形成一个键值对,其中Word是键,1是分配给每个键的值。
Shuffling:常见的键值对一起分组。
减少:相似密钥的值将加在一起。
运行mapreduce程序
MapReduce程序是用Java编写的。
大多数Eclipse IDE用于由开发人员编程。
所以在这个教程 中,将介绍如何将MapReduce程序从Eclipse IDE导出到JAR文件中,并在Hadoop集群上运行它。
我的mapreduce程序在我的Eclipse IDE中。
现在要在Hadoop集群上运行此MapReduce程序,我们会将项目导出为JAR文件。
在Eclipse IDE中选择"文件"选项,然后单击"导出"。
在Java选项中,选择JAR文件,然后单击"下一步"。
选择WordCount项目,并为JAR文件提供路径和名称,我保留它WordCount。
jar,点击下一步两次。
现在单击"浏览"并选择主类,最后单击"完成"以使JAR文件。
如果我们收到以下任何警告,只需单击"确定"。
检查Hadoop集群是否已启动和工作。
命令:JPS.
hadoop@hadoop-VirtualBox:~$jps 3008 NodeManager 3924 Jps 2885 ResourceManager 2505 DataNode 3082 JobHistoryServer 2716 SecondaryNameNode 2383 NameNode hadoop@hadoop-VirtualBox:~$
我们将输入文件与WordCount程序的HDFS相关联。
hadoop@hadoop-VirtualBox:~$hdfs dfs -put input / hadoop@hadoop-VirtualBox:~$hdfs dfs -cat /input This is my first mapreduce test This is wordcount program hadoop@hadoop-VirtualBox:~$
现在运行wordcount。
jar文件使用以下命令。
注意:由于我们在导出WordCount时选择了主类。
jar,所以没有必要在命令中提到主类。
命令:hadoop jar wordcount。
jar /输入/输出
hadoop@hadoop-VirtualBox:~$hadoop jar wordcount.jar /input /output 16/11/27 22:52:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0: 8032 16/11/27 22:52:22 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/11/27 22:52:27 INFO input.FileInputFormat: Total input paths to process : 1 16/11/27 22:52:28 INFO mapreduce.JobSubmitter: number of splits:1 16/11/27 22:52:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14802 67251741_0001 16/11/27 22:52:32 INFO impl.YarnClientImpl: Submitted application application_14802 67251741_0001 16/11/27 22:52:33 INFO mapreduce.Job: The url to track the job: http://hadoop-Virtu alBox:8088/proxy/application_1480267251741_0001/ 16/11/27 22:52:33 INFO mapreduce.Job: Running job: job_1480267251741_0001 16/11/27 22:53:20 INFO mapreduce.Job: Job job_1480267251741_0001 running in uber mo de : false 16/11/27 22:53:20 INFO mapreduce.Job: map 0% reduce 0% 16/11/27 22:53:45 INFO mapreduce.Job: map 100% reduce 0% 16/11/27 22:54:13 INFO mapreduce.Job: map 100% reduce 100% 16/11/27 22:54:15 INFO mapreduce.Job: Job job_1480267251741_0001 completed successfully 16/11/27 22:54:16 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=124 FILE: Number of bytes written=237911 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=150 HDFS: Number of bytes written=66 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=21062 Total time spent by all reduces in occupied slots (ms)=25271 Total time spent by all map tasks (ms)=21062 Total time spent by all reduce tasks (ms)=25271 Total vcore-milliseconds taken by all map tasks=21062 Total vcore-milliseconds taken by all reduce tasks=25271 Total megabyte-milliseconds taken by all map tasks=21567488 Total megabyte-milliseconds taken by all reduce tasks=25877504 Map-Reduce Framework Map input records=2 Map output records=10 Map output bytes=98 Map output materialized bytes=124 Input split bytes=92 Combine input records=0 Combine output records=0 Reduce input groups=8 Reduce shuffle bytes=124 Reduce input records=10 Reduce output records=8 Spilled Records=20 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=564 CPU time spent (ms)=4300 Physical memory (bytes) snapshot=330784768 Virtual memory (bytes) snapshot=3804205056 Total committed heap usage (bytes)=211812352 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=58 File Output Format Counters Bytes Written=66 hadoop@hadoop-VirtualBox:~$
程序成功运行后,转到HDFS并检查输出目录中的零件文件。
以下是WordCount程序的输出。
hadoop@hadoop-VirtualBox:~$hdfs dfs -cat /output/part-r-00000 This 2 first 1 is 2 mapreduce 1 my 1 program 1 test 1 wordcount 1 hadoop@hadoop-VirtualBox:~$