如何运行 Spark Java 程序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22298192/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 14:48:42  来源:igfitidea点击:

How to run a Spark Java program

javaapache-spark

提问by Pooja3101

I have written a Java program for Spark. But how to run and compile it from Unix command line. Do I have to include any jar while compiling for running

我为 Spark 编写了一个 Java 程序。但是如何从 Unix 命令行运行和编译它。编译运行时是否必须包含任何 jar

回答by Viacheslav Rodionov

Combining steps from official Quick Start Guideand Launching Spark on YARNwe get:

结合官方快速入门指南在 YARN 上启动 Spark 的步骤,我们得到:

We'll create a very simple Spark application, SimpleApp.java:

我们将创建一个非常简单的 Spark 应用程序 SimpleApp.java:

/*** SimpleApp.java ***/
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "$YOUR_SPARK_HOME/README.md"; // Should be some file on your system
    JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
      "$YOUR_SPARK_HOME", new String[]{"target/simple-project-1.0.jar"});
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
  }
}

This program just counts the number of lines containing ‘a' and the number containing ‘b' in a text file. Note that you'll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. As with the Scala example, we initialize a SparkContext, though we use the special JavaSparkContext class to get a Java-friendly one. We also create RDDs (represented by JavaRDD) and run transformations on them. Finally, we pass functions to Spark by creating classes that extend spark.api.java.function.Function. The Java programming guide describes these differences in more detail.

该程序仅计算文本文件中包含“a”的行数和包含“b”的行数。请注意,您需要将 $YOUR_SPARK_HOME 替换为安装 Spark 的位置。与 Scala 示例一样,我们初始化了一个SparkContext,尽管我们使用特殊的 JavaSparkContext 类来获得一个对 Java 友好的类。我们还创建了 RDD(由 JavaRDD 表示)并对其运行转换。最后,我们通过创建扩展 spark.api.java.function.Function 的类将函数传递给 Spark。Java 编程指南更详细地描述了这些差异。

To build the program, we also write a Maven pom.xmlfile that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.

为了构建程序,我们还编写了一个 Maven pom.xml文件,该文件将 Spark 作为依赖项列出。请注意,Spark 工件标有 Scala 版本。

<project>
  <groupId>edu.berkeley</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <repositories>
    <repository>
      <id>Akka repository</id>
      <url>http://repo.akka.io/releases</url>
    </repository>
  </repositories>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>0.9.0-incubating</version>
    </dependency>
  </dependencies>
</project>

If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on hadoop-client for your version of HDFS:

如果您还希望从 Hadoop 的 HDFS 读取数据,您还需要为您的 HDFS 版本添加对 hadoop-client 的依赖:

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>...</version>
</dependency>

We lay out these files according to the canonical Maven directory structure:

我们根据规范的 Maven 目录结构布置这些文件:

$ find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java

Now, we can execute the application using Maven:

现在,我们可以使用 Maven 执行应用程序:

$ mvn package
$ mvn exec:java -Dexec.mainClass="SimpleApp"
...
Lines with a: 46, Lines with b: 23

And then follow the steps from Launching Spark on YARN:

然后按照在 YARN 上启动 Spark 中的步骤操作:

Building a YARN-Enabled Assembly JAR

构建支持 YARN 的程序集 JAR

We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster. This can be built by setting the Hadoop version and SPARK_YARN environment variable, as follows:

我们需要一个整合的 Spark JAR(它捆绑了所有必需的依赖项)来在 YARN 集群上运行 Spark 作业。这可以通过设置 Hadoop 版本和 SPARK_YARN 环境变量来构建,如下所示:

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

The assembled JAR will be something like this: ./assembly/target/scala-2.10/spark-assembly_0.9.0-incubating-hadoop2.0.5.jar.

组装好的 JAR 将是这样的:./assembly/target/scala-2.10/spark-assembly_0.9.0-incubating-hadoop2.0.5.jar。

The build process now also supports new YARN versions (2.2.x). See below.

构建过程现在还支持新的 YARN 版本 (2.2.x)。见下文。

Preparations

准备工作

  • Building a YARN-enabled assembly (see above).
  • The assembled jar can be installed into HDFS or used locally.
  • Your application code must be packaged into a separate JAR file.
  • 构建启用 YARN 的程序集(见上文)。
  • 组装好的 jar 可以安装到 HDFS 中,也可以在本地使用。
  • 您的应用程序代码必须打包到一个单独的 JAR 文件中。

If you want to test out the YARN deployment mode, you can use the current Spark examples. A spark-examples_2.10-0.9.0-incubating file can be generated by running:

如果要测试 YARN 部署模式,可以使用当前的 Spark 示例。可以通过运行以下命令生成 spark-examples_2.10-0.9.0-incubating 文件:

sbt/sbt assembly 

NOTE:since the documentation you're reading is for Spark version 0.9.0-incubating, we are assuming here that you have downloaded Spark 0.9.0-incubating or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.

注意:由于您正在阅读的文档是针对 Spark 0.9.0-incubating 版本的,我们在此假设您已下载 Spark 0.9.0-incubating 或将其从源代码管理中检出。如果使用不同版本的Spark,sbt package命令生成的jar中的版本号会明显不同。

Configuration

配置

Most of the configs are the same for Spark on YARN as other deploys. See the Configuration page for more information on those. These are configs that are specific to SPARK on YARN.

YARN 上的 Spark 的大多数配置与其他部署相同。有关这些的更多信息,请参阅配置页面。这些是特定于 YARN 上的 SPARK 的配置。

Environment variables:

环境变量:

  • SPARK_YARN_USER_ENV, to add environment variables to the Spark processes launched on YARN. This can be a comma separated list of environment variables, e.g.
  • SPARK_YARN_USER_ENV,将环境变量添加到在 YARN 上启动的 Spark 进程。这可以是逗号分隔的环境变量列表,例如
SPARK_YARN_USER_ENV="JAVA_HOME=/jdk64,FOO=bar"

System Properties:

系统属性:

  • spark.yarn.applicationMaster.waitTries, property to set the number of times the ApplicationMaster waits for the the spark master and then also the number of tries it waits for the Spark Context to be intialized. Default is 10.
  • spark.yarn.submit.file.replication, the HDFS replication level for the files uploaded into HDFS for the application. These include things like the spark jar, the app jar, and any distributed cache files/archives.
  • spark.yarn.preserve.staging.files, set to true to preserve the staged files(spark jar, app jar, distributed cache files) at the end of the job rather then delete them.
  • spark.yarn.scheduler.heartbeat.interval-ms, the interval in ms in which the Spark application master heartbeats into the YARN ResourceManager. Default is 5 seconds.
  • spark.yarn.max.worker.failures, the maximum number of worker failures before failing the application. Default is the number of workers requested times 2 with minimum of 3.
  • spark.yarn.applicationMaster.waitTries,用于设置 ApplicationMaster 等待 spark master 的次数以及它等待 Spark Context 初始化的尝试次数的属性。默认值为 10。
  • spark.yarn.submit.file.replication,应用程序上传到 HDFS 的文件的 HDFS 复制级别。其中包括 spark jar、app jar 和任何分布式缓存文件/存档。
  • spark.yarn.preserve.staging.files,设置为 true 以在作业结束时保留暂存文件(spark jar、app jar、分布式缓存文件)而不是删除它们。
  • spark.yarn.scheduler.heartbeat.interval-ms,Spark 应用程序主心跳进入 YARN ResourceManager 的时间间隔(以毫秒为单位)。默认值为 5 秒。
  • spark.yarn.max.worker.failures,应用程序失败前工作失败的最大次数。默认是请求的工作人员数量乘以 2,最少为 3。

Launching Spark on YARN

在 YARN 上启动 Spark

Ensure that HADOOP_CONF_DIRor YARN_CONF_DIRpoints to the directory which contains the (client side) configuration files for the hadoop cluster. This would be used to connect to the cluster, write to the dfs and submit jobs to the resource manager.

确保HADOOP_CONF_DIRYARN_CONF_DIR指向包含 hadoop 集群(客户端)配置文件的目录。这将用于连接到集群、写入 dfs 并将作业提交给资源管理器。

There are two scheduler mode that can be used to launch spark application on YARN.

有两种调度程序模式可用于在 YARN 上启动 Spark 应用程序。

Launch spark application by YARN Client with yarn-standalone mode.

使用纱线独立模式通过纱线客户端启动 spark 应用程序。

The command to launch the YARN Client is as follows:

启动 YARN 客户端的命令如下:

SPARK_JAR=<SPARK_ASSEMBLY_JAR_FILE> ./bin/spark-class org.apache.spark.deploy.yarn.Client \
  --jar <YOUR_APP_JAR_FILE> \
  --class <APP_MAIN_CLASS> \
  --args <APP_MAIN_ARGUMENTS> \
  --num-workers <NUMBER_OF_WORKER_MACHINES> \
  --master-class <ApplicationMaster_CLASS>
  --master-memory <MEMORY_FOR_MASTER> \
  --worker-memory <MEMORY_PER_WORKER> \
  --worker-cores <CORES_PER_WORKER> \
  --name <application_name> \
  --queue <queue_name> \
  --addJars <any_local_files_used_in_SparkContext.addJar> \
  --files <files_for_distributed_cache> \
  --archives <archives_for_distributed_cache>

For example:

例如:

# Build the Spark assembly JAR and the Spark examples JAR
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

# Configure logging
$ cp conf/log4j.properties.template conf/log4j.properties

# Submit Spark's ApplicationMaster to YARN's ResourceManager, and instruct Spark to run the SparkPi example
$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.5-alpha.jar \
    ./bin/spark-class org.apache.spark.deploy.yarn.Client \
      --jar examples/target/scala-2.10/spark-examples-assembly-0.9.0-incubating.jar \
      --class org.apache.spark.examples.SparkPi \
      --args yarn-standalone \
      --num-workers 3 \
      --master-memory 4g \
      --worker-memory 2g \
      --worker-cores 1

# Examine the output (replace $YARN_APP_ID in the following with the "application identifier" output by the previous command)
# (Note: YARN_APP_LOGS_DIR is usually /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version.)
$ cat $YARN_APP_LOGS_DIR/$YARN_APP_ID/container*_000001/stdout
Pi is roughly 3.13794

The above starts a YARN Client programs which start the default Application Master. Then SparkPi will be run as a child thread of Application Master, YARN Client will periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.

以上启动了一个 YARN 客户端程序,它启动默认的应用程序主程序。然后 SparkPi 将作为 Application Master 的子线程运行,YARN Client 将定期轮询 Application Master 以获取状态更新并将其显示在控制台中。一旦您的应用程序完成运行,客户端将退出。

With this mode, your application is actually run on the remote machine where the Application Master is run upon. Thus application that involve local interaction will not work well, e.g. spark-shell.

在这种模式下,您的应用程序实际上是在运行 Application Master 的远程机器上运行的。因此,涉及本地交互的应用程序将无法正常工作,例如 spark-shell。

回答by psmith

I had the same question a few days ago and yesterday managed to solve it.
That's what I've done:

几天前我有同样的问题,昨天设法解决了它。
这就是我所做的:

  1. Download sbt and unzip and untar it :http://www.scala-sbt.org/download.html
  2. I have downloaded Spark Prebuild package for Hadoop 2, unzipped and untarred it: http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
  3. I've created standalone application SimpleApp.scala as described in: http://spark.apache.org/docs/latest/quick-start.html#standalone-applicationswith proper simple.sbt file (just copied from the description) and proper directory layout
  4. Make sure you have sbt in you PATH. Go to directory with your application and build your package using sbt package
  5. Start Spark Server using SPARK_HOME_DIR/sbin/spark_master.sh
  6. Go to localhost:8080and make sure your server is running. Copy link from URL (from server description, not localhost. It shoul be something with port 7077 or similiar)
  7. Start Workers using SPARK_HOME_DIR/bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORTwhere IP:PORT is the URL copied in 6
  8. Deploy you application to the server: SPARK_HOME_DIR/bin/spark-submit --class "SimpleApp" --master URL target/scala-2.10/simple-project_2.10-1.0.jar
  1. 下载 sbt 并解压并解压:http://www.scala-sbt.org/download.html
  2. 我已经下载了 Hadoop 2 的 Spark Prebuild 包,解压并解压它:http: //www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
  3. 我已经创建了独立的应用程序 SimpleApp.scala,如:http: //spark.apache.org/docs/latest/quick-start.html#standalone-applications 和适当的 simple.sbt 文件(刚刚从描述中复制)和正确的目录布局
  4. 确保你的 PATH 中有 sbt。转到您的应用程序所在的目录并使用sbt package
  5. 使用启动 Spark 服务器 SPARK_HOME_DIR/sbin/spark_master.sh
  6. 转到localhost:8080并确保您的服务器正在运行。从 URL 复制链接(来自服务器描述,而不是本地主机。它应该是端口 7077 或类似的东西)
  7. 使用SPARK_HOME_DIR/bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORTIP:PORT 是 6 中复制的 URL启动工作程序
  8. 将您的应用程序部署到服务器: SPARK_HOME_DIR/bin/spark-submit --class "SimpleApp" --master URL target/scala-2.10/simple-project_2.10-1.0.jar

That's worked for me and hope will help you.
Pawel

这对我有用,希望能帮到你。
帕维尔

回答by Alexis Gamarra

Aditionally to the selected answer, if you want to connect to an external standalone Spark instance:

除了所选答案,如果您想连接到外部独立 Spark 实例:

SparkConf conf =
new SparkConf()
     .setAppName("Simple Application")
     .setMaster("spark://10.3.50.139:7077");

JavaSparkContext sc = new JavaSparkContext(conf);

Here you can find more "master" configuration depending on where Spark is running: http://spark.apache.org/docs/latest/submitting-applications.html#master-urls

在这里,您可以根据 Spark 的运行位置找到更多“主”配置:http: //spark.apache.org/docs/latest/submitting-applications.html#master-urls

回答by Binita Bharati

This answer is for Spark 2.3.If you want to test your Spark application locally, ie without the pre-requisite of a Hadoop cluster, and even without having to start any of the standalone Spark services, you could do this:

这个答案适用于 Spark 2.3。如果你想在本地测试你的 Spark 应用程序,即没有 Hadoop 集群的先决条件,甚至不需要启动任何独立的 Spark 服务,你可以这样做:

JavaSparkContext jsc = new JavaSparkContext(new SparkConf().setAppName("Simple App"));

And then, to run your application locally:

然后,在本地运行您的应用程序:

$SPARK_HOME/bin/spark-submit --class SimpleApp --master local target/scala-2.10/simple-project_2.10-1.0.jar

For this to work , you just need to extract the Spark tar file into $SPARK_HOME, and set $SPARK_HOME into the Spark user's .profile

为此,您只需将 Spark tar 文件提取到 $SPARK_HOME 中,并将 $SPARK_HOME 设置为 Spark 用户的 .profile