Java 将 jars 添加到 Spark 作业 - spark-submit
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37132559/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Add jars to a Spark Job - spark-submit
提问by YoYo
True ... it has been discussed quite a lot.
是的……已经讨论了很多。
However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.
但是,有很多含糊之处,并且提供了一些答案……包括在 jars/executor/driver 配置或选项中复制 jar 引用。
The ambiguous and/or omitted details
模棱两可和/或遗漏的细节
Following ambiguity, unclear, and/or omitted details should be clarified for each option:
应为每个选项澄清以下模棱两可、不清楚和/或遗漏的细节:
- How ClassPath is affected
- Driver
- Executor (for tasks running)
- Both
- not at all
- Separation character: comma, colon, semicolon
- If provided files are automatically distributed
- for the tasks (to each executor)
- for the remote Driver (if ran in cluster mode)
- type of URI accepted: local file, hdfs, http, etc
- If copied intoa common location, where that location is (hdfs, local?)
- ClassPath 受到的影响
- 司机
- Executor(用于正在运行的任务)
- 两个都
- 一点也不
- 分隔符:逗号、冒号、分号
- 如果提供的文件自动分发
- 对于任务(对每个执行者)
- 对于远程驱动程序(如果在集群模式下运行)
- 接受的 URI 类型:本地文件、hdfs、http 等
- 如果复制到一个公共位置,该位置在哪里(hdfs,本地?)
The options to which it affects :
它影响的选项:
--jars
SparkContext.addJar(...)
methodSparkContext.addFile(...)
method--conf spark.driver.extraClassPath=...
or--driver-class-path ...
--conf spark.driver.extraLibraryPath=...
, or--driver-library-path ...
--conf spark.executor.extraClassPath=...
--conf spark.executor.extraLibraryPath=...
- not to forget, the last parameter of the spark-submit is also a .jar file.
--jars
SparkContext.addJar(...)
方法SparkContext.addFile(...)
方法--conf spark.driver.extraClassPath=...
或者--driver-class-path ...
--conf spark.driver.extraLibraryPath=...
, 或者--driver-library-path ...
--conf spark.executor.extraClassPath=...
--conf spark.executor.extraLibraryPath=...
- 不要忘记,spark-submit 的最后一个参数也是一个 .jar 文件。
I am aware where I can find the main spark documentation, and specifically about how to submit, the optionsavailable, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.
我知道在哪里可以找到主要的 spark 文档,特别是关于如何提交、可用选项以及JavaDoc。然而,这给我留下了很多漏洞,尽管它也部分回答了。
I hope that it is not all that complex, and that someone can give me a clear and concise answer.
我希望事情不是那么复杂,有人能给我一个清晰简洁的答案。
If I were to guess from documentation, it seems that --jars
, and the SparkContext
addJar
and addFile
methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.
如果我从文档中猜测,似乎--jars
, 和SparkContext
addJar
和addFile
方法是将自动分发文件的方法,而其他选项仅修改类路径。
Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:
为了简单起见,我可以同时使用 3 个主要选项添加额外的应用程序 jar 文件是否安全:
spark-submit --jar additional1.jar,additional2.jar \
--driver-library-path additional1.jar:additional2.jar \
--conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
Found a nice article on an answer to another posting. However nothing new learned. The poster does make a good remark on the difference between Local driver (yarn-client) and Remote Driver (yarn-cluster). Definitely important to keep in mind.
在另一个帖子的答案上找到了一篇不错的文章。然而没有学到新东西。海报确实很好地说明了本地驱动程序(yarn-client)和远程驱动程序(yarn-cluster)之间的区别。绝对重要的是要记住。
采纳答案by Yuval Itzchakov
ClassPath:
类路径:
ClassPath is affected depending on what you provide. There are a couple of ways to set something on the classpath:
ClassPath 会受到影响,具体取决于您提供的内容。有几种方法可以在类路径上设置一些东西:
spark.driver.extraClassPath
or it's alias--driver-class-path
to set extra classpaths on the node running the driver.spark.executor.extraClassPath
to set extra class path on the Worker nodes.
spark.driver.extraClassPath
或者它是--driver-class-path
在运行驱动程序的节点上设置额外类路径的别名。spark.executor.extraClassPath
在 Worker 节点上设置额外的类路径。
If you want a certain JAR to be effected on both the Master and the Worker, you have to specify these separately in BOTH flags.
如果您希望某个 JAR 对 Master 和 Worker 都有效,则必须在 BOTH 标志中分别指定它们。
Separation character:
分隔符:
Following the same rules as the JVM:
- Linux: A colon
:
- e.g:
--conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar:/opt/prog/aws-java-sdk-1.10.50.jar"
- e.g:
- Windows: A semicolon
;
- e.g:
--conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar;/opt/prog/aws-java-sdk-1.10.50.jar"
- e.g:
- Linux:冒号
:
- 例如:
--conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar:/opt/prog/aws-java-sdk-1.10.50.jar"
- 例如:
- Windows:分号
;
- 例如:
--conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar;/opt/prog/aws-java-sdk-1.10.50.jar"
- 例如:
File distribution:
文件分发:
This depends on the mode which you're running your job under:
这取决于您运行工作的模式:
Client mode - Spark fires up a Netty HTTP server which distributes the files on start up for each of the worker nodes. You can see that when you start your Spark job:
16/05/08 17:29:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-48911afa-db63-4ffc-a298-015e8b96bc55/httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b 16/05/08 17:29:12 INFO HttpServer: Starting HTTP Server 16/05/08 17:29:12 INFO Utils: Successfully started service 'HTTP file server' on port 58922. 16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/foo.jar at http://***:58922/jars/com.mycode.jar with timestamp 1462728552732 16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/aws-java-sdk-1.10.50.jar at http://***:58922/jars/aws-java-sdk-1.10.50.jar with timestamp 1462728552767
Cluster mode - In cluster mode spark selected a leader Worker node to execute the Driver process on. This means the job isn't running directly from the Master node. Here, Spark will notset an HTTP server. You have to manually make your JARS available to all the worker node via HDFS/S3/Other sources which are available to all nodes.
客户端模式 - Spark 启动 Netty HTTP 服务器,该服务器在启动时为每个工作节点分发文件。您可以看到,当您开始 Spark 作业时:
16/05/08 17:29:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-48911afa-db63-4ffc-a298-015e8b96bc55/httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b 16/05/08 17:29:12 INFO HttpServer: Starting HTTP Server 16/05/08 17:29:12 INFO Utils: Successfully started service 'HTTP file server' on port 58922. 16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/foo.jar at http://***:58922/jars/com.mycode.jar with timestamp 1462728552732 16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/aws-java-sdk-1.10.50.jar at http://***:58922/jars/aws-java-sdk-1.10.50.jar with timestamp 1462728552767
集群模式 - 在集群模式下,spark 选择了一个领导工作节点来执行驱动程序进程。这意味着作业不是直接从主节点运行的。在这里,Spark不会设置 HTTP 服务器。您必须通过所有节点都可用的 HDFS/S3/Other 源手动将 JARS 提供给所有工作节点。
Accepted URI's for files
接受的文件 URI
In "Submitting Applications", the Spark documentation does a good job of explaining the accepted prefixes for files:
在“提交应用程序”中,Spark 文档很好地解释了文件的可接受前缀:
When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars:
- file: - Absolute paths and file:/ URIs are served by the driver's HTTP file server, and every executor pulls the file from the driver HTTP server.
- hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
- local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.
使用 spark-submit 时,应用程序 jar 以及 --jars 选项中包含的任何 jar 将自动传输到集群。Spark 使用以下 URL 方案来允许使用不同的策略来传播 jars:
- file: - 绝对路径和 file:/ URI 由驱动程序的 HTTP 文件服务器提供服务,每个执行程序从驱动程序 HTTP 服务器中提取文件。
- hdfs:, http:, https:, ftp: - 这些按预期从 URI 中下拉文件和 JAR
- local: - 以 local:/ 开头的 URI 预期作为每个工作节点上的本地文件存在。这意味着不会产生网络 IO,并且适用于推送到每个工作人员或通过 NFS、GlusterFS 等共享的大文件/JAR。
请注意,JAR 和文件被复制到执行器节点上每个 SparkContext 的工作目录中。
As noted, JARs are copied to the working directoryfor each Worker node. Where exactly is that? It is usuallyunder /var/run/spark/work
, you'll see them like this:
如前所述,JAR 被复制到每个 Worker 节点的工作目录中。那究竟是哪里?它通常在 下/var/run/spark/work
,你会看到它们是这样的:
drwxr-xr-x 3 spark spark 4096 May 15 06:16 app-20160515061614-0027
drwxr-xr-x 3 spark spark 4096 May 15 07:04 app-20160515070442-0028
drwxr-xr-x 3 spark spark 4096 May 15 07:18 app-20160515071819-0029
drwxr-xr-x 3 spark spark 4096 May 15 07:38 app-20160515073852-0030
drwxr-xr-x 3 spark spark 4096 May 15 08:13 app-20160515081350-0031
drwxr-xr-x 3 spark spark 4096 May 18 17:20 app-20160518172020-0032
drwxr-xr-x 3 spark spark 4096 May 18 17:20 app-20160518172045-0033
And when you look inside, you'll see all the JARs you deployed along:
当您查看内部时,您会看到您部署的所有 JAR:
[*@*]$ cd /var/run/spark/work/app-20160508173423-0014/1/
[*@*]$ ll
total 89988
-rwxr-xr-x 1 spark spark 801117 May 8 17:34 awscala_2.10-0.5.5.jar
-rwxr-xr-x 1 spark spark 29558264 May 8 17:34 aws-java-sdk-1.10.50.jar
-rwxr-xr-x 1 spark spark 59466931 May 8 17:34 com.mycode.code.jar
-rwxr-xr-x 1 spark spark 2308517 May 8 17:34 guava-19.0.jar
-rw-r--r-- 1 spark spark 457 May 8 17:34 stderr
-rw-r--r-- 1 spark spark 0 May 8 17:34 stdout
Affected options:
受影响的选项:
The most important thing to understand is priority. If you pass any property via code, it will take precedence over any option you specify via spark-submit
. This is mentioned in the Spark documentation:
要了解的最重要的事情是优先级。如果您通过代码传递任何属性,它将优先于您通过 指定的任何选项spark-submit
。Spark文档中提到了这一点:
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file
任何指定为标志或属性文件中的值都将传递给应用程序并与通过 SparkConf 指定的值合并。直接在 SparkConf 上设置的属性具有最高优先级,然后标志传递给 spark-submit 或 spark-shell,然后是 spark-defaults.conf 文件中的选项
So make sure you set those values in the proper places, so you won't be surprised when one takes priority over the other.
因此,请确保将这些值设置在正确的位置,这样当一个优先于另一个时您就不会感到惊讶。
Lets analyze each option in question:
让我们分析每个有问题的选项:
--jars
vsSparkContext.addJar
: These are identical, only one is set through spark submit and one via code. Choose the one which suites you better. One important thing to note is that using either of these options does not add the JAR to your driver/executor classpath, you'll need to explicitly add them using theextraClassPath
config on both.SparkContext.addJar
vsSparkContext.addFile
: Use the former when you have a dependencythat needs to be used with your code. Use the latter when you simply want to pass an arbitrary file around to your worker nodes, which isn't a run-time dependency in your code.--conf spark.driver.extraClassPath=...
or--driver-class-path
: These are aliases, doesn't matter which one you choose--conf spark.driver.extraLibraryPath=..., or --driver-library-path ...
Same as above, aliases.--conf spark.executor.extraClassPath=...
: Use this when you have a dependency which can't be included in an uber JAR (for example, because there are compile time conflicts between library versions) and which you need to load at runtime.--conf spark.executor.extraLibraryPath=...
This is passed as thejava.library.path
option for the JVM. Use this when you need a library path visible to the JVM.
--jars
vsSparkContext.addJar
:这些是相同的,只有一个是通过 spark submit 设置的,一个是通过代码设置的。选择更适合您的那一款。需要注意的一件重要事情是,使用这些选项中的任何一个都不会将 JAR 添加到您的驱动程序/执行程序类路径,您需要使用extraClassPath
两者的配置显式添加它们。SparkContext.addJar
vsSparkContext.addFile
:当您有需要与代码一起使用的依赖项时,请使用前者。当您只想将任意文件传递到工作节点时使用后者,这不是代码中的运行时依赖项。--conf spark.driver.extraClassPath=...
or--driver-class-path
: 这些是别名,你选择哪一个都没有关系--conf spark.driver.extraLibraryPath=..., or --driver-library-path ...
同上,别名。--conf spark.executor.extraClassPath=...
:当您有一个无法包含在 uber JAR 中的依赖项(例如,因为库版本之间存在编译时冲突)并且您需要在运行时加载时使用此选项。--conf spark.executor.extraLibraryPath=...
这作为java.library.path
JVM的选项传递。当您需要对 JVM 可见的库路径时使用此选项。
Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:
为了简单起见,我可以同时使用 3 个主要选项添加额外的应用程序 jar 文件是否安全:
You can safely assume this only for Client mode, not Cluster mode. As I've previously said. Also, the example you gave has some redundant arguments. For example, passing JARs to --driver-library-path
is useless, you need to pass them to extraClassPath
if you want them to be on your classpath. Ultimately, what you want to do when you deploy external JARs on both the driver and the worker is:
您可以安全地假设这仅适用于客户端模式,而不是集群模式。正如我之前所说的。此外,你给出的例子有一些多余的论点。例如,将 JAR 传递给--driver-library-path
是无用的,extraClassPath
如果您希望它们在您的类路径上,则需要将它们传递给。最终,当您在驱动程序和工作线程上部署外部 JAR 时,您想要做的是:
spark-submit --jars additional1.jar,additional2.jar \
--driver-class-path additional1.jar:additional2.jar \
--conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
回答by Stanislav
Another approach in spark 2.1.0
is to use --conf spark.driver.userClassPathFirst=true
during spark-submit which changes the priority of dependency load, and thus the behavior of the spark-job, by giving priority to the jars the user is adding to the class-path with the --jars
option.
另一种方法spark 2.1.0
是--conf spark.driver.userClassPathFirst=true
在 spark-submit 期间使用它改变依赖加载的优先级,从而改变 spark-job 的行为,通过优先考虑用户添加到类路径的 jars--jars
选项。
回答by Tanveer
There is restriction on using --jars
: if you want to specify a directory for location of jar/xml
file, it doesn't allow directory expansions. This means if you need to specify absolute path for each jar.
使用有限制--jars
:如果要为jar/xml
文件位置指定目录,则不允许目录扩展。这意味着如果您需要为每个 jar 指定绝对路径。
If you specify --driver-class-path
and you are executing in yarn cluster mode, then driver class doesn't get updated. We can verify if class path is updated or not under spark ui or spark history server under tab environment.
如果您指定--driver-class-path
并在纱线集群模式下执行,则驱动程序类不会更新。我们可以在 tab 环境下的 spark ui 或 spark history server 下验证类路径是否更新。
Option which worked for me to pass jars which contain directory expansions and which worked in yarn cluster mode was --conf
option. It's better to pass driver and executor class paths as --conf
, which adds them to spark session object itself and those paths are reflected on Spark Configuration. But Please make sure to put jars on the same path across the cluster.
对我有用的选项是传递包含目录扩展并在纱线集群模式下工作的 jars 是--conf
选项。最好将驱动程序和执行程序类路径作为 传递--conf
,这会将它们添加到 spark 会话对象本身,并且这些路径反映在 Spark 配置中。但是请确保将 jars 放在整个集群的同一路径上。
spark-submit \
--master yarn \
--queue spark_queue \
--deploy-mode cluster \
--num-executors 12 \
--executor-memory 4g \
--driver-memory 8g \
--executor-cores 4 \
--conf spark.ui.enabled=False \
--conf spark.driver.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
--conf spark.hadoop.mapred.output.dir=/tmp \
--conf spark.executor.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
--conf spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/tmp
回答by bala
While we submit spark jobs using spark-submit utility, there is an option --jars
. Using this option, we can pass jar file to spark applications.
虽然我们使用 spark-submit 实用程序提交 spark 作业,但有一个选项--jars
。使用此选项,我们可以将 jar 文件传递给 Spark 应用程序。
回答by DaRkMaN
Other configurable Spark option relating to jars and classpath, in case of yarn
as deploy mode are as follows
From the spark documentation,
其他与 jars 和 classpath 相关的可配置 Spark 选项,在yarn
部署模式的情况下如下
来自 spark 文档,
spark.yarn.jars
List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
spark.yarn.archive
An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.
spark.yarn.jars
包含要分发到 YARN 容器的 Spark 代码的库列表。默认情况下,YARN 上的 Spark 将使用本地安装的 Spark jar,但 Spark jar 也可以位于 HDFS 上的全局可读位置。这允许 YARN 将其缓存在节点上,以便每次应用程序运行时都不需要分发。例如,要指向 HDFS 上的 jar,请将此配置设置为 hdfs:///some/path。允许使用 Glob。
spark.yarn.archive
包含分发到 YARN 缓存所需的 Spark jar 的存档。如果设置,此配置将替换 spark.yarn.jars 并且存档用于所有应用程序的容器。存档应在其根目录中包含 jar 文件。与前一个选项一样,存档也可以托管在 HDFS 上以加快文件分发。
Users can configure this parameter to specify their jars, which inturn gets included in Spark driver's classpath.
用户可以配置此参数来指定他们的 jar,这些 jar 又包含在 Spark 驱动程序的类路径中。