从 eclipse 启动 mapreduce 作业

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11236305/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 18:24:58  来源:igfitidea点击:

Launch a mapreduce job from eclipse

eclipsehadoopmapreduce

提问by Tucker

I've written a mapreduce program in Java, which I can submit to a remote cluster running in distributed mode. Currently, I submit the job using the following steps:

我已经用 Java 编写了一个 mapreduce 程序,我可以将它提交到以分布式模式运行的远程集群。目前,我使用以下步骤提交作业:

  1. export the mapreuce job as a jar (e.g. myMRjob.jar)
  2. submit the job to the remote cluster using the following shell command: hadoop jar myMRjob.jar
  1. 将 mapreuce 作业导出为 jar(例如myMRjob.jar
  2. 使用以下 shell 命令将作业提交到远程集群: hadoop jar myMRjob.jar

I would like to submit the job directly from Eclipse when I try to run the program. How can I do this?

当我尝试运行该程序时,我想直接从 Eclipse 提交作业。我怎样才能做到这一点?

I am currently using CDH3, and an abridged version of my conf is:

我目前正在使用 CDH3,我的 conf 的删节版是:

conf.set("hbase.zookeeper.quorum", getZookeeperServers());
conf.set("fs.default.name","hdfs://namenode/");
conf.set("mapred.job.tracker", "jobtracker:jtPort");
Job job = new Job(conf, "COUNT ROWS");
job.setJarByClass(CountRows.class);

// Set up Mapper
TableMapReduceUtil.initTableMapperJob(inputTable, scan, 
    CountRows.MyMapper.class, ImmutableBytesWritable.class,  
    ImmutableBytesWritable.class, job);  

// Set up Reducer
job.setReducerClass(CountRows.MyReducer.class);
job.setNumReduceTasks(16);

// Setup Overall Output
job.setOutputFormatClass(MultiTableOutputFormat.class);

job.submit();

When I run this directly from Eclipse, the job is launched but Hadoop cannot find the mappers/reducers. I get the following errors:

当我直接从 Eclipse 运行它时,作业已启动,但 Hadoop 找不到映射器/减速器。我收到以下错误:

12/06/27 23:23:29 INFO mapred.JobClient:  map 0% reduce 0%  
12/06/27 23:23:37 INFO mapred.JobClient: Task Id :   attempt_201206152147_0645_m_000000_0, Status : FAILED  
java.lang.RuntimeException: java.lang.ClassNotFoundException:   com.mypkg.mapreduce.CountRows$MyMapper  
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)  
    at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:212)  
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:602)  
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)   
    at org.apache.hadoop.mapred.Child.run(Child.java:270)  
    at java.security.AccessController.doPrivileged(Native Method)  
    at javax.security.auth.Subject.doAs(Subject.java:396)  
    at   org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)  
    at org.apache.hadoop.mapred.Child.main(Child.java:264)  
...

Does anyone know how to get past these errors? If I can fix this, I can integrate more MR jobs into my scripts which would be awesome!

有谁知道如何克服这些错误?如果我能解决这个问题,我就可以将更多的 MR 作业集成到我的脚本中,这会很棒!

采纳答案by Chris White

If you're submitting the hadoop job from within the Eclipse project that defines the classes for the job then you most probably have a classpath problem.

如果您从定义作业类的 Eclipse 项目中提交 hadoop 作业,那么您很可能遇到了类路径问题。

The job.setjarByClass(CountRows.class)call is finding the class file on the build classpath, and not in the CountRows.jar (which may or may not have been built yet, or even on the classpath).

job.setjarByClass(CountRows.class)呼叫在寻找的构建类路径类文件,而不是在CountRows.jar(这可能会或可能还没有被建立,甚至在classpath)。

You should be able to assert this is true by printing out the result of job.getJar()after you call job.setjarByClass(..), and if it doesn't display a jar filepath, then it's found the build class, rather than the jar'd class

您应该能够通过打印job.getJar()调用 after的结果来断言这是真的job.setjarByClass(..),如果它没有显示 jar 文件路径,那么它找到了构建类,而不是 jar 类

回答by Peeter Kokk

What worked for me was exporting a runnable JAR (the difference between it and a JAR is that the first defines the class, which has the main method) and selecting the "packaging required libraries into JAR" option (choosing the "extracting..." option leads to duplicate errors and it also has to extract the class files from the jars, which, ultimately, in my case, resulted in not resolving the class not found exception).

对我有用的是导出一个可运行的 JAR(它和 JAR 之间的区别在于第一个定义了具有主要方法的类)并选择“将所需的库打包到 JAR”选项(选择“提取... " 选项会导致重复错误,并且还必须从 jar 中提取类文件,最终,在我的情况下,这导致无法解决未找到类的异常)。

After that, you can just set the jar, as was suggested by Chris White. For Windows it would look like this: job.setJar("C:\\\MyJar.jar");

之后,您可以按照 Chris White 的建议设置罐子。对于 Windows,它看起来像这样:job.setJar("C:\\\MyJar.jar");

If it helps anybody, I made a presentation on what I learned from creating a MapReduce project and running it in Hadoop 2.2.0 in Windows 7 (in Eclipse Luna)

如果它对任何人有帮助,我将介绍我从创建 MapReduce 项目并在 Windows 7(在 Eclipse Luna 中)的 Hadoop 2.2.0 中运行它所学到的知识

回答by Kshirsagar Naidu

I have used this method from the following website to configure a Map/Reduce project of mine to run the project using Eclipse (w/o exporting project as JAR) Configuring Eclipse to run Hadoop Map/Reduce project

我使用以下网站中的这种方法来配置我的 Map/Reduce 项目以使用 Eclipse 运行该项目(不将项目导出为 JAR) 配置 Eclipse 以运行 Hadoop Map/Reduce 项目

Note: If you decide to debug you program, your Mapperclass and Reducerclass won't be debug-able.

注意:如果您决定调试您的程序,您的Mapper类和Reducer类将无法调试。

Hope it helps. :)

希望能帮助到你。:)