理解 Spark 和 Scala 中的并行性
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19774860/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Understanding parallelism in Spark and Scala
提问by MARK
I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.
我对 Spark 和 Scala 中的并行性有些困惑。我正在运行一个实验,我必须从磁盘更改/处理某些列中读取许多(csv)文件,然后将其写回磁盘。
In my experiments, if I use SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.
在我的实验中,如果我只使用SparkContext 的 parallelize 方法,那么它似乎不会对性能产生任何影响。然而,简单地使用 Scala 的并行集合(通过 par)几乎可以将时间减少一半。
I am running my experiments in localhost mode with the arguments local[2] for the spark context.
我在 localhost 模式下运行我的实验,参数 local[2] 作为 spark 上下文。
My question is when should I use scala's parallel collections and when to use spark context's parallelize?
我的问题是什么时候应该使用 Scala 的并行集合,什么时候使用 spark 上下文的并行化?
回答by samthebest
SparkContext will have additional processing in order to support generality of multiple nodes, this will be constant on the data size so may be negligible for huge data sets. On 1 node this overhead will make it slower than Scala's parallel collections.
SparkContext 将有额外的处理以支持多个节点的通用性,这将在数据大小上保持不变,因此对于庞大的数据集可以忽略不计。在 1 个节点上,这种开销将使其比 Scala 的并行集合慢。
Use Spark when
使用 Spark 时
- You have more than 1 node
- You want your job to be ready to scale to multiple nodes
- The Spark overhead on 1 node is negligible because the data is huge, so you might as well choose the richer framework
- 您有 1 个以上的节点
- 您希望您的工作准备好扩展到多个节点
- 1个节点上的Spark开销可以忽略不计,因为数据量巨大,所以不妨选择更丰富的框架
回答by Utgarda
SparkContext's parallelize may makes your collection suitable for processing on multiple nodes, as well as on multiple local cores of your single worker instance ( local[2] ), but then again, you probably get too much overhead from running Spark's task scheduler an all that magic. Of course, Scala's parallel collections should be faster on single machine.
SparkContext 的并行化可能使您的集合适合在多个节点上处理,以及在您的单个工作实例 (local[2]) 的多个本地核心上进行处理,但话说回来,您可能会因为运行 Spark 的任务调度程序而获得太多的开销以及所有这些魔法。当然,Scala 的并行集合在单机上应该更快。
http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections- are your files big enough to be automatically split to multiple slices, did you try setting slices number manually?
http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections- 您的文件是否大到可以自动拆分为多个切片,您是否尝试手动设置切片编号?
Did you try running the same Spark job on single core and then on two cores?
您是否尝试在单核和两个核上运行相同的 Spark 作业?
Expect best result from Spark with one really big uniformly structured file, not with multiple smaller files.
期望 Spark 的最佳结果是使用一个非常大的统一结构的文件,而不是多个较小的文件。

