scala spark.default.parallelism for Parallelize RDD 默认为 2 for spark submit
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35384251/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit
提问by Sami
Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.
Spark 独立集群具有一个主节点和 2 个工作节点,每个工作节点上有 4 个 CPU 核心。所有工人共有 8 个核心。
When running the following via spark-submit (spark.default.parallelism is not set)
通过 spark-submit 运行以下命令时(未设置 spark.default.parallelism)
val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)
It returns value 2 for partition size.
它返回分区大小的值 2。
When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.
当通过连接到 spark 独立集群使用 spark-shell 时,相同的代码返回正确的分区大小 8。
What can be the reason ?
可能是什么原因?
Thanks.
谢谢。
回答by Joe Widen
spark.default.parallelismdefaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism.
spark.default.parallelism默认为所有机器上所有内核的数量。parallelize api 没有父 RDD 来确定分区的数量,因此它使用spark.default.parallelism.
When running spark-submit, you're probably running it locally. Try submitting your spark-submitwith the same start up configs as you do the spark-shell.
运行时spark-submit,您可能正在本地运行它。尝试spark-submit使用与 spark-shell 相同的启动配置提交您的配置。
Pulled this from the documentation:
从文档中提取:
spark.default.parallelism
For distributed shuffle operations like reduceByKeyand join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
对于像reduceByKey和这样的分布式混洗操作join,父RDD中的最大分区数。对于没有父 RDD 的并行化等操作,它取决于集群管理器:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
本地模式:本地机器上的内核数
Mesos 细粒度模式:8
其他:所有执行器节点上的内核总数或 2 个,以较大者为准
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
通过变换返回RDDS分区的默认数量一样join,reduceByKey和并行化时,不能由用户设置。

