Java 如何设置 Spark 执行程序的数量?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26168254/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to set amount of Spark executors?
提问by Roman Nikitchenko
How could I configure from Java (or Scala) code amount of executors having SparkConfig
and SparkContext
? I see constantly 2 executors. Looks like spark.default.parallelism
does not work and is about something different.
我如何从 Java(或 Scala)代码中配置具有SparkConfig
和的执行程序SparkContext
?我经常看到 2 个执行者。看起来spark.default.parallelism
不起作用并且是关于不同的东西。
I just need to set amount of executors to be equal to cluster size but there are always only 2 of them. I know my cluster size. I run on YARN if this matters.
我只需要将执行程序的数量设置为等于集群大小,但始终只有 2 个。我知道我的集群大小。如果这很重要,我会在 YARN 上运行。
采纳答案by Roman Nikitchenko
OK, got it.
Number of executors is not actually Spark property itself but rather driver used to place job on YARN. So as I'm using SparkSubmit class as driver and it has appropriate --num-executors
parameter which is exactly what I need.
好的,我知道了。执行程序的数量实际上不是 Spark 属性本身,而是用于在 YARN 上放置作业的驱动程序。因此,当我使用 SparkSubmit 类作为驱动程序时,它具有适当的--num-executors
参数,这正是我所需要的。
UPDATE:
更新:
For some jobs I don't follow SparkSubmit
approach anymore. I cannot do it primarily for applications where Spark job is only one of application component (and is even optional). For these cases I use spark-defaults.conf
attached to cluster configuration and spark.executor.instances
property inside it. This approach is much more universal allowing me to balance resources properly depending on cluster (developer workstation, staging, production).
对于某些工作,我不再遵循SparkSubmit
方法。我不能主要针对 Spark 作业只是应用程序组件之一(甚至是可选的)的应用程序执行此操作。对于这些情况,我使用spark-defaults.conf
附加到集群配置和其中的spark.executor.instances
属性。这种方法更通用,允许我根据集群(开发人员工作站、登台、生产)适当地平衡资源。
回答by A. One
You could also do it programmatically by setting the parameters "spark.executor.instances" and "spark.executor.cores" on the SparkConf object.
您还可以通过在 SparkConf 对象上设置参数“spark.executor.instances”和“spark.executor.cores”来以编程方式执行此操作。
Example:
例子:
SparkConf conf = new SparkConf()
// 4 executor per instance of each worker
.set("spark.executor.instances", "4")
// 5 cores on each executor
.set("spark.executor.cores", "5");
The second parameter is only for YARN and standalone mode. It allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker.
第二个参数仅适用于 YARN 和独立模式。它允许应用程序在同一个 worker 上运行多个 executor,前提是该 worker 上有足够的内核。
回答by Ajay Ahuja
In Spark 2.0+ version
在 Spark 2.0+ 版本中
use spark session variable to set number of executors dynamically (from within program)
使用 spark 会话变量动态设置执行程序的数量(从程序内)
spark.conf.set("spark.executor.instances", 4)
spark.conf.set("spark.executor.cores", 4)
In above case maximum 16 tasks will be executed at any given time.
在上述情况下,在任何给定时间最多将执行 16 个任务。
other option is dynamic allocation of executors as below -
另一个选项是执行程序的动态分配,如下所示 -
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.executor.cores", 4)
spark.conf.set("spark.dynamicAllocation.minExecutors","1")
spark.conf.set("spark.dynamicAllocation.maxExecutors","5")
This was you can let spark decide on allocating number of executors based on processing and memory requirements for running job.
这是您可以让 Spark 根据运行作业的处理和内存要求来决定分配执行程序的数量。
I feel second option works better that first option and is widely used.
我觉得第二个选项比第一个选项效果更好,并且被广泛使用。
Hope this will help.
希望这会有所帮助。
回答by Bon Ryu
We had a similar problem in my lab running Spark on Yarn with data on hdfs, but no matter which of the above solutions I tried, I could not increase the number of Spark executors beyond two.
在我的实验室中,我们在 Yarn 上运行 Spark 并在 hdfs 上运行数据时遇到了类似的问题,但是无论我尝试了上述哪种解决方案,我都无法将 Spark 执行程序的数量增加到两个以上。
Turns out the dataset was too small (less than the hdfs block size of 128 MB), and only existed on two of the data nodes (1 master, 7 data nodes in my cluster) due to hadoop's default data replication heuristic.
结果发现数据集太小(小于 128 MB 的 hdfs 块大小),并且由于 hadoop 的默认数据复制启发式,仅存在于两个数据节点(我的集群中的 1 个主节点,7 个数据节点)上。
Once my lab-mates and I had more files (and larger files) and the data was spread on all nodes, we could set the number of Spark executors, and finally see an inverse relationship between --num-executors
and time to completion.
一旦我和我的实验室伙伴有了更多的文件(和更大的文件)并且数据分布在所有节点上,我们就可以设置 Spark 执行器的数量,最终看到--num-executors
完成时间和完成时间之间的反比关系。
Hope this helps someone else in a similar situation.
希望这可以帮助处于类似情况的其他人。