scala spark.default.parallelism for Parallelize RDD 默认为 2 for spark submit

Question

提问by Sami

Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.

Spark 独立集群具有一个主节点和 2 个工作节点，每个工作节点上有 4 个 CPU 核心。所有工人共有 8 个核心。

When running the following via spark-submit (spark.default.parallelism is not set)

通过 spark-submit 运行以下命令时（未设置 spark.default.parallelism）

val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)

It returns value 2 for partition size.

它返回分区大小的值 2。

When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.

当通过连接到 spark 独立集群使用 spark-shell 时，相同的代码返回正确的分区大小 8。

What can be the reason ?

可能是什么原因？

Thanks.

谢谢。

Answer 1

回答by Joe Widen

spark.default.parallelismdefaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism.

spark.default.parallelism默认为所有机器上所有内核的数量。parallelize api 没有父 RDD 来确定分区的数量，因此它使用spark.default.parallelism.

When running spark-submit, you're probably running it locally. Try submitting your spark-submitwith the same start up configs as you do the spark-shell.

运行时spark-submit，您可能正在本地运行它。尝试spark-submit使用与 spark-shell 相同的启动配置提交您的配置。

Pulled this from the documentation:

从文档中提取：

spark.default.parallelism

For distributed shuffle operations like reduceByKeyand join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

对于像reduceByKey和这样的分布式混洗操作join，父RDD中的最大分区数。对于没有父 RDD 的并行化等操作，它取决于集群管理器：

Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

本地模式：本地机器上的内核数
Mesos 细粒度模式：8
其他：所有执行器节点上的内核总数或 2 个，以较大者为准

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

通过变换返回RDDS分区的默认数量一样join，reduceByKey和并行化时，不能由用户设置。

scala spark.default.parallelism for Parallelize RDD 默认为 2 for spark submit

提问by Sami

回答by Joe Widen

相关推荐

最近更新

标签

scala spark.default.parallelism for Parallelize RDD 默认为 2 for spark submit

提问by Sami

回答by Joe Widen

相关推荐

scala 我应该在声明案例类时使用 final 修饰符吗？

如何使用模式匹配在 Scala 中获取非空列表？

如何在 Scala 中获取部署到 YARN 的 Spark 应用程序的 applicationId？

scala Spark 应用程序中的垃圾收集时间非常长，导致程序停止

相关推荐

最近更新

标签