如何在Ubuntu 16.04上运行Hadoop MapReduce程序

时间:2020-03-05 15:29:21  来源:igfitidea点击:

在此教程 中,将介绍如何运行MapReduce程序。
MapReduce是Apache Hadoop的核心部分之一,它是Apache Hadoop的处理层。
因此,在我向我们展示如何运行MapReduce程序之前,请告诉我简要解释MapReduce。

MapReduce是一个用于并行处理大数据集的系统。
MapReduce将数据减少到结果并创建数据摘要。
MapReduce程序有两个映射器和减速机。
映射完成后的工作后,只有减速器启动。

映射器:它将输入键/值对映射到一组中间键/值对。

Reducer:它减少了一组中间值,该值与较小的值共享密钥。

基本上,在WordCount MapReduce程序中,我们提供输入文件的任何文本文件,如输入。
当MapReduce程序开始时,以下是它通过的进程:

拆分:将输入文件中的每一行拆分为单词。

映射:它形成一个键值对,其中Word是键,1是分配给每个键的值。

Shuffling:常见的键值对一起分组。

减少:相似密钥的值将加在一起。

运行mapreduce程序

MapReduce程序是用Java编写的。
大多数Eclipse IDE用于由开发人员编程。
所以在这个教程 中,将介绍如何将MapReduce程序从Eclipse IDE导出到JAR文件中,并在Hadoop集群上运行它。

我的mapreduce程序在我的Eclipse IDE中。

现在要在Hadoop集群上运行此MapReduce程序,我们会将项目导出为JAR文件。
在Eclipse IDE中选择"文件"选项,然后单击"导出"。
在Java选项中,选择JAR文件,然后单击"下一步"。

选择WordCount项目,并为JAR文件提供路径和名称,我保留它WordCount。
jar,点击下一步两次。

现在单击"浏览"并选择主类,最后单击"完成"以使JAR文件。
如果我们收到以下任何警告,只需单击"确定"。

检查Hadoop集群是否已启动和工作。

命令:JPS.

hadoop@hadoop-VirtualBox:~$jps
3008 NodeManager
3924 Jps
2885 ResourceManager
2505 DataNode
3082 JobHistoryServer
2716 SecondaryNameNode
2383 NameNode
hadoop@hadoop-VirtualBox:~$

我们将输入文件与WordCount程序的HDFS相关联。

hadoop@hadoop-VirtualBox:~$hdfs dfs -put input /
hadoop@hadoop-VirtualBox:~$hdfs dfs -cat /input
This is my first mapreduce test
This is wordcount program
hadoop@hadoop-VirtualBox:~$

现在运行wordcount。
jar文件使用以下命令。

注意:由于我们在导出WordCount时选择了主类。
jar,所以没有必要在命令中提到主类。

命令:hadoop jar wordcount。
jar /输入/输出

hadoop@hadoop-VirtualBox:~$hadoop jar wordcount.jar /input /output
16/11/27 22:52:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:
8032
16/11/27 22:52:22 WARN mapreduce.JobResourceUploader: Hadoop command-line option 
parsing not performed. Implement the Tool interface and execute your application 
with ToolRunner to remedy this.
16/11/27 22:52:27 INFO input.FileInputFormat: Total input paths to process : 1
16/11/27 22:52:28 INFO mapreduce.JobSubmitter: number of splits:1
16/11/27 22:52:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14802
67251741_0001
16/11/27 22:52:32 INFO impl.YarnClientImpl: Submitted application application_14802
67251741_0001
16/11/27 22:52:33 INFO mapreduce.Job: The url to track the job: http://hadoop-Virtu
alBox:8088/proxy/application_1480267251741_0001/
16/11/27 22:52:33 INFO mapreduce.Job: Running job: job_1480267251741_0001
16/11/27 22:53:20 INFO mapreduce.Job: Job job_1480267251741_0001 running in uber mo
de : false
16/11/27 22:53:20 INFO mapreduce.Job:  map 0% reduce 0%
16/11/27 22:53:45 INFO mapreduce.Job:  map 100% reduce 0%
16/11/27 22:54:13 INFO mapreduce.Job:  map 100% reduce 100%
16/11/27 22:54:15 INFO mapreduce.Job: Job job_1480267251741_0001 completed 
successfully
16/11/27 22:54:16 INFO mapreduce.Job: Counters: 49
          File System Counters
                    FILE: Number of bytes read=124
                    FILE: Number of bytes written=237911
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=150
                    HDFS: Number of bytes written=66
                    HDFS: Number of read operations=6
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=2
          Job Counters
                    Launched map tasks=1
                    Launched reduce tasks=1
                    Data-local map tasks=1
                    Total time spent by all maps in occupied slots (ms)=21062
                    Total time spent by all reduces in occupied slots (ms)=25271
                    Total time spent by all map tasks (ms)=21062
                    Total time spent by all reduce tasks (ms)=25271
                    Total vcore-milliseconds taken by all map tasks=21062
                    Total vcore-milliseconds taken by all reduce tasks=25271
                    Total megabyte-milliseconds taken by all map tasks=21567488
                    Total megabyte-milliseconds taken by all reduce tasks=25877504
          Map-Reduce Framework
                    Map input records=2
                    Map output records=10
                    Map output bytes=98
                    Map output materialized bytes=124
                    Input split bytes=92
                    Combine input records=0
                    Combine output records=0
                    Reduce input groups=8
                    Reduce shuffle bytes=124
                    Reduce input records=10
                    Reduce output records=8
                    Spilled Records=20
                    Shuffled Maps =1
                    Failed Shuffles=0
                    Merged Map outputs=1
                    GC time elapsed (ms)=564
                    CPU time spent (ms)=4300
                    Physical memory (bytes) snapshot=330784768
                    Virtual memory (bytes) snapshot=3804205056
                    Total committed heap usage (bytes)=211812352
          Shuffle Errors
                    BAD_ID=0
                    CONNECTION=0
                    IO_ERROR=0
                    WRONG_LENGTH=0
                    WRONG_MAP=0
                    WRONG_REDUCE=0
          File Input Format Counters
                    Bytes Read=58
          File Output Format Counters
                    Bytes Written=66
hadoop@hadoop-VirtualBox:~$

程序成功运行后,转到HDFS并检查输出目录中的零件文件。

以下是WordCount程序的输出。

hadoop@hadoop-VirtualBox:~$hdfs dfs -cat /output/part-r-00000
 This    2
 first     1
 is        2
 mapreduce   1
 my      1
 program        1
 test     1
 wordcount     1
 hadoop@hadoop-VirtualBox:~$