从简单的 Java 程序调用 mapreduce 作业
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9849776/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calling a mapreduce job from a simple java program
提问by Ravi Trivedi
I have been trying to call a mapreduce job from a simple java program in the same package.. I tried to refer the mapreduce jar file in my java program and call it using the runJar(String args[])
method by also passing the input and output paths for the mapreduce job.. But the program dint work..
我一直在尝试从同一个包中的一个简单的 java 程序调用一个 mapreduce 作业。我试图在我的 java 程序中引用 mapreduce jar 文件,并runJar(String args[])
通过传递 mapreduce 作业的输入和输出路径来使用该方法调用它..但是程序可以工作..
How do I run such a program where I just use pass input, output and jar path to its main method?? Is it possible to run a mapreduce job (jar) through it?? I want to do this because I want to run several mapreduce jobs one after another where my java program vl call each such job by referring its jar file.. If this gets possible, I might as well just use a simple servlet to do such calling and refer its output files for the graph purpose..
我如何运行这样一个程序,我只使用传递输入、输出和 jar 路径到它的主要方法?是否可以通过它运行 mapreduce 作业(jar)?我想这样做是因为我想一个接一个地运行几个 mapreduce 作业,其中我的 java 程序 vl 通过引用它的 jar 文件来调用每个这样的作业。并参考其输出文件用于图形目的..
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
/**
*
* @author root
*/
import org.apache.hadoop.util.RunJar;
import java.util.*;
public class callOther {
public static void main(String args[])throws Throwable
{
ArrayList arg=new ArrayList();
String output="/root/Desktp/output";
arg.add("/root/NetBeansProjects/wordTool/dist/wordTool.jar");
arg.add("/root/Desktop/input");
arg.add(output);
RunJar.main((String[])arg.toArray(new String[0]));
}
}
采纳答案by Thomas Jungblut
Oh please don't do it with runJar
, the Java API is very good.
哦,请不要这样做runJar
,Java API 非常好。
See how you can start a job from normal code:
了解如何从普通代码开始工作:
// create a configuration
Configuration conf = new Configuration();
// create a new job based on the configuration
Job job = new Job(conf);
// here you have to put your mapper class
job.setMapperClass(Mapper.class);
// here you have to put your reducer class
job.setReducerClass(Reducer.class);
// here you have to set the jar which is containing your
// map/reduce class, so you can use the mapper class
job.setJarByClass(Mapper.class);
// key/value of your reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// this is setting the format of your input, can be TextInputFormat
job.setInputFormatClass(SequenceFileInputFormat.class);
// same with output
job.setOutputFormatClass(TextOutputFormat.class);
// here you can set the path of your input
SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));
// this deletes possible output paths to prevent job failures
FileSystem fs = FileSystem.get(conf);
Path out = new Path("files/out/processed/");
fs.delete(out, true);
// finally set the empty out path
TextOutputFormat.setOutputPath(job, out);
// this waits until the job completes and prints debug out to STDOUT or whatever
// has been configured in your log4j properties.
job.waitForCompletion(true);
If you are using an external cluster, you have to put the following infos to your configuration via:
如果您使用的是外部集群,则必须通过以下方式将以下信息添加到您的配置中:
// this should be like defined in your mapred-site.xml
conf.set("mapred.job.tracker", "jobtracker.com:50001");
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");
This should be no problem when the hadoop-core.jar
is in your application containers classpath.
But I think you should put some kind of progress indicator to your web page, because it may take minutes to hours to complete a hadoop job ;)
当hadoop-core.jar
位于您的应用程序容器类路径中时,这应该没有问题。但我认为你应该在你的网页上放置某种进度指示器,因为完成一个 hadoop 工作可能需要几分钟到几小时;)
For YARN (> Hadoop 2)
对于 YARN (> Hadoop 2)
For YARN, the following configurations need to be set.
对于YARN,需要设置如下配置。
// this should be like defined in your yarn-site.xml
conf.set("yarn.resourcemanager.address", "yarn-manager.com:50001");
// framework is now "yarn", should be defined like this in mapred-site.xm
conf.set("mapreduce.framework.name", "yarn");
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");
回答by Chris White
I can't think of many ways you can do this without involving the hadoop-core library (or indeed like @ThomasJungblut said, why you would want to).
我想不出很多方法可以在不涉及 hadoop-core 库的情况下做到这一点(或者确实像@ThomasJungblut 所说的那样,你为什么想要这样做)。
But if you absolutely must, you could set up an Oozie server with a workflow for your job, and then use the Oozie webservice interface to submit the workflow to Hadoop.
但是,如果您绝对必须,您可以为您的工作设置一个带有工作流的 Oozie 服务器,然后使用 Oozie Web 服务接口将工作流提交到 Hadoop。
- http://yahoo.github.com/oozie/
- http://yahoo.github.com/oozie/releases/2.3.0/WorkflowFunctionalSpec.html#a11.3.1_Job_Submission
- http://yahoo.github.com/oozie/
- http://yahoo.github.com/oozie/releases/2.3.0/WorkflowFunctionalSpec.html#a11.3.1_Job_Submission
Again, this seems like a lot of work for something that could just be resolved using the Thomas's answer (include the hadoop-core jar and use his code snippet)
同样,对于可以使用 Thomas 的答案(包括 hadoop-core jar 并使用他的代码片段)来解决的问题,这似乎需要做很多工作
回答by faridasabry
Another way for jobs already implemented in hadoop examples and also it requires hadoop jars being imported.. then just call the static main function of the desired job Class with the appropriate String[] of arguments
已经在 hadoop 示例中实现的作业的另一种方法,并且它需要导入 hadoop jars..然后只需使用适当的 String[] 参数调用所需作业类的静态主函数
回答by RS Software -Competency Team
Calling MapReduce job from java web application (Servlet)
从 Java Web 应用程序 (Servlet) 调用 MapReduce 作业
You can call a MapReduce job from web application using Java API. Here is a small example of calling a MapReduce job from servlet. The steps are given below:
您可以使用 Java API 从 Web 应用程序调用 MapReduce 作业。这是一个从 servlet 调用 MapReduce 作业的小示例。步骤如下:
Step 1: At first create a MapReduce driver servlet class. Also develop map & reduce service. Here goes a sample code snippet:
步骤 1:首先创建一个 MapReduce 驱动程序 servlet 类。还开发地图和减少服务。这是一个示例代码片段:
CallJobFromServlet.java
CallJobFromServlet.java
public class CallJobFromServlet extends HttpServlet {
protected void doPost(HttpServletRequest request,HttpServletResponse response) throws ServletException, IOException {
Configuration conf = new Configuration();
// Replace CallJobFromServlet.class name with your servlet class
Job job = new Job(conf, " CallJobFromServlet.class");
job.setJarByClass(CallJobFromServlet.class);
job.setJobName("Job Name");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class); // Replace Map.class name with your Mapper class
job.setNumReduceTasks(30);
job.setReducerClass(Reducer.class); //Replace Reduce.class name with your Reducer class
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// Job Input path
FileInputFormat.addInputPath(job, new
Path("hdfs://localhost:54310/user/hduser/input/"));
// Job Output path
FileOutputFormat.setOutputPath(job, new
Path("hdfs://localhost:54310/user/hduser/output"));
job.waitForCompletion(true);
}
}
Step 2: Place all the related jar (hadoop, application specific jars) files inside lib folder of the web server (e.g. Tomcat). This is mandatory for accessing the Hadoop configurations ( hadoop ‘conf' folder has configuration xml files i.e. core-site.xml , hdfs-site.xml etc ) . Just copy the jars from hadoop lib folder to web server(tomcat) lib directory. The list of jar names are as follows:
第 2 步:将所有相关的 jar(hadoop,特定于应用程序的 jar)文件放在 Web 服务器(例如 Tomcat)的 lib 文件夹中。这是访问 Hadoop 配置所必需的(hadoop 'conf' 文件夹具有配置 xml 文件,即 core-site.xml、hdfs-site.xml 等)。只需将 jars 从 hadoop lib 文件夹复制到 web 服务器(tomcat)lib 目录。jar 名称列表如下:
1. commons-beanutils-1.7.0.jar
2. commons-beanutils-core-1.8.0.jar
3. commons-cli-1.2.jar
4. commons-collections-3.2.1.jar
5. commons-configuration-1.6.jar
6. commons-httpclient-3.0.1.jar
7. commons-io-2.1.jar
8. commons-lang-2.4.jar
9. commons-logging-1.1.1.jar
10. hadoop-client-1.0.4.jar
11. hadoop-core-1.0.4.jar
12. Hymanson-core-asl-1.8.8.jar
13. Hymanson-mapper-asl-1.8.8.jar
14. jersey-core-1.8.jar
Step 3: Deploy your web application into web server (in 'webapps' folder for Tomcat).
第 3 步:将您的 Web 应用程序部署到 Web 服务器(在 Tomcat 的“webapps”文件夹中)。
Step 4: Create a jsp file and link the servlet class (CallJobFromServlet.java) in form action attribute. Here goes a sample code snippet:
步骤4:创建一个jsp文件并在form action属性中链接servlet类(CallJobFromServlet.java)。这是一个示例代码片段:
Index.jsp
索引.jsp
<form id="trigger_hadoop" name="trigger_hadoop" action="./CallJobFromServlet ">
<span class="back">Trigger Hadoop Job from Web Page </span>
<input type="submit" name="submit" value="Trigger Job" />
</form>
回答by techlearner
You can do in this way
你可以这样做
public class Test {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new YourJob(), args);
System.exit(res);
}
回答by Jiang Libo
Because map and reduce run on different machines, all your referenced classes and jars must move from machine to machine.
因为 map 和 reduce 在不同的机器上运行,所以所有引用的类和 jar 必须在机器之间移动。
If you have package jar, and run on your desktop, @ThomasJungblut's answer is OK. But if you run in Eclipse, right click your class and run, it doesn't work.
如果您有 jar 包并在桌面上运行,@ThomasJungblut 的回答是可以的。但是如果你在 Eclipse 中运行,右键单击你的类并运行,它不起作用。
Instead of:
代替:
job.setJarByClass(Mapper.class);
Use:
用:
job.setJar("build/libs/hdfs-javac-1.0.jar");
At same time, your jar's manifest must include Main-Class property, which is your main class.
同时,您的 jar 清单必须包含 Main-Class 属性,这是您的主类。
For gradle users, can put these lines in build.gradle:
对于 gradle 用户,可以将这些行放在 build.gradle 中:
jar {
manifest {
attributes("Main-Class": mainClassName)
}}