Java 如何设置 Spark 执行程序的数量？

Question

提问by Roman Nikitchenko

How could I configure from Java (or Scala) code amount of executors having SparkConfigand SparkContext? I see constantly 2 executors. Looks like spark.default.parallelismdoes not work and is about something different.

我如何从 Java（或 Scala）代码中配置具有SparkConfig和的执行程序SparkContext？我经常看到 2 个执行者。看起来spark.default.parallelism不起作用并且是关于不同的东西。

I just need to set amount of executors to be equal to cluster size but there are always only 2 of them. I know my cluster size. I run on YARN if this matters.

我只需要将执行程序的数量设置为等于集群大小，但始终只有 2 个。我知道我的集群大小。如果这很重要，我会在 YARN 上运行。

Answer 1

采纳答案by Roman Nikitchenko

OK, got it. Number of executors is not actually Spark property itself but rather driver used to place job on YARN. So as I'm using SparkSubmit class as driver and it has appropriate --num-executorsparameter which is exactly what I need.

好的，我知道了。执行程序的数量实际上不是 Spark 属性本身，而是用于在 YARN 上放置作业的驱动程序。因此，当我使用 SparkSubmit 类作为驱动程序时，它具有适当的--num-executors参数，这正是我所需要的。

UPDATE:

更新：

For some jobs I don't follow SparkSubmitapproach anymore. I cannot do it primarily for applications where Spark job is only one of application component (and is even optional). For these cases I use spark-defaults.confattached to cluster configuration and spark.executor.instancesproperty inside it. This approach is much more universal allowing me to balance resources properly depending on cluster (developer workstation, staging, production).

对于某些工作，我不再遵循SparkSubmit方法。我不能主要针对 Spark 作业只是应用程序组件之一（甚至是可选的）的应用程序执行此操作。对于这些情况，我使用spark-defaults.conf附加到集群配置和其中的spark.executor.instances属性。这种方法更通用，允许我根据集群（开发人员工作站、登台、生产）适当地平衡资源。

Answer 2

回答by A. One

You could also do it programmatically by setting the parameters "spark.executor.instances" and "spark.executor.cores" on the SparkConf object.

您还可以通过在 SparkConf 对象上设置参数“spark.executor.instances”和“spark.executor.cores”来以编程方式执行此操作。

Example:

例子：

SparkConf conf = new SparkConf()
      // 4 executor per instance of each worker 
      .set("spark.executor.instances", "4")
      // 5 cores on each executor
      .set("spark.executor.cores", "5");

The second parameter is only for YARN and standalone mode. It allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker.

第二个参数仅适用于 YARN 和独立模式。它允许应用程序在同一个 worker 上运行多个 executor，前提是该 worker 上有足够的内核。

Answer 3

回答by Ajay Ahuja

In Spark 2.0+ version

在 Spark 2.0+ 版本中

use spark session variable to set number of executors dynamically (from within program)

使用 spark 会话变量动态设置执行程序的数量（从程序内）

spark.conf.set("spark.executor.instances", 4)
spark.conf.set("spark.executor.cores", 4)

In above case maximum 16 tasks will be executed at any given time.

在上述情况下，在任何给定时间最多将执行 16 个任务。

other option is dynamic allocation of executors as below -

另一个选项是执行程序的动态分配，如下所示 -

spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.executor.cores", 4)
spark.conf.set("spark.dynamicAllocation.minExecutors","1")
spark.conf.set("spark.dynamicAllocation.maxExecutors","5")

This was you can let spark decide on allocating number of executors based on processing and memory requirements for running job.

这是您可以让 Spark 根据运行作业的处理和内存要求来决定分配执行程序的数量。

I feel second option works better that first option and is widely used.

我觉得第二个选项比第一个选项效果更好，并且被广泛使用。

Hope this will help.

希望这会有所帮助。

Answer 4

回答by Bon Ryu

We had a similar problem in my lab running Spark on Yarn with data on hdfs, but no matter which of the above solutions I tried, I could not increase the number of Spark executors beyond two.

在我的实验室中，我们在 Yarn 上运行 Spark 并在 hdfs 上运行数据时遇到了类似的问题，但是无论我尝试了上述哪种解决方案，我都无法将 Spark 执行程序的数量增加到两个以上。

Turns out the dataset was too small (less than the hdfs block size of 128 MB), and only existed on two of the data nodes (1 master, 7 data nodes in my cluster) due to hadoop's default data replication heuristic.

结果发现数据集太小（小于 128 MB 的 hdfs 块大小），并且由于 hadoop 的默认数据复制启发式，仅存在于两个数据节点（我的集群中的 1 个主节点，7 个数据节点）上。

Once my lab-mates and I had more files (and larger files) and the data was spread on all nodes, we could set the number of Spark executors, and finally see an inverse relationship between --num-executorsand time to completion.

一旦我和我的实验室伙伴有了更多的文件（和更大的文件）并且数据分布在所有节点上，我们就可以设置 Spark 执行器的数量，最终看到--num-executors完成时间和完成时间之间的反比关系。

Hope this helps someone else in a similar situation.

希望这可以帮助处于类似情况的其他人。

Java 如何设置 Spark 执行程序的数量？

提问by Roman Nikitchenko

采纳答案by Roman Nikitchenko

回答by A. One

回答by Ajay Ahuja

回答by Bon Ryu

相关推荐

最近更新

标签

Java 如何设置 Spark 执行程序的数量？

提问by Roman Nikitchenko

采纳答案by Roman Nikitchenko

回答by A. One

回答by Ajay Ahuja

回答by Bon Ryu

相关推荐

Java Maven 不下载依赖项的 jars

如何将 SOAP XML 解组为 Java 对象

Java 将字节数组转换为 JSONArray 的简单方法

Java 将 BigDecimal 向上舍入为整数值

相关推荐

最近更新

标签