理解 Spark 和 Scala 中的并行性

Question

提问by MARK

I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.

我对 Spark 和 Scala 中的并行性有些困惑。我正在运行一个实验，我必须从磁盘更改/处理某些列中读取许多（csv）文件，然后将其写回磁盘。

In my experiments, if I use SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.

在我的实验中，如果我只使用SparkContext 的 parallelize 方法，那么它似乎不会对性能产生任何影响。然而，简单地使用 Scala 的并行集合（通过 par）几乎可以将时间减少一半。

I am running my experiments in localhost mode with the arguments local[2] for the spark context.

我在 localhost 模式下运行我的实验，参数 local[2] 作为 spark 上下文。

My question is when should I use scala's parallel collections and when to use spark context's parallelize?

我的问题是什么时候应该使用 Scala 的并行集合，什么时候使用 spark 上下文的并行化？

Answer 1

回答by samthebest

SparkContext will have additional processing in order to support generality of multiple nodes, this will be constant on the data size so may be negligible for huge data sets. On 1 node this overhead will make it slower than Scala's parallel collections.

SparkContext 将有额外的处理以支持多个节点的通用性，这将在数据大小上保持不变，因此对于庞大的数据集可以忽略不计。在 1 个节点上，这种开销将使其比 Scala 的并行集合慢。

Use Spark when

使用 Spark 时

You have more than 1 node
You want your job to be ready to scale to multiple nodes
The Spark overhead on 1 node is negligible because the data is huge, so you might as well choose the richer framework

您有 1 个以上的节点
您希望您的工作准备好扩展到多个节点
1个节点上的Spark开销可以忽略不计，因为数据量巨大，所以不妨选择更丰富的框架

Answer 2

回答by Utgarda

SparkContext's parallelize may makes your collection suitable for processing on multiple nodes, as well as on multiple local cores of your single worker instance ( local[2] ), but then again, you probably get too much overhead from running Spark's task scheduler an all that magic. Of course, Scala's parallel collections should be faster on single machine.

SparkContext 的并行化可能使您的集合适合在多个节点上处理，以及在您的单个工作实例 (local[2]) 的多个本地核心上进行处理，但话说回来，您可能会因为运行 Spark 的任务调度程序而获得太多的开销以及所有这些魔法。当然，Scala 的并行集合在单机上应该更快。

http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections- are your files big enough to be automatically split to multiple slices, did you try setting slices number manually?

http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections- 您的文件是否大到可以自动拆分为多个切片，您是否尝试手动设置切片编号？

Did you try running the same Spark job on single core and then on two cores?

您是否尝试在单核和两个核上运行相同的 Spark 作业？

Expect best result from Spark with one really big uniformly structured file, not with multiple smaller files.

期望 Spark 的最佳结果是使用一个非常大的统一结构的文件，而不是多个较小的文件。

理解 Spark 和 Scala 中的并行性

提问by MARK

回答by samthebest

回答by Utgarda

相关推荐

最近更新

标签

理解 Spark 和 Scala 中的并行性

提问by MARK

回答by samthebest

回答by Utgarda

相关推荐

scala 字符串上的 Map 与 FlatMap

scala 通过特征覆盖方法时如何调用超级方法

Scala 中的全局变量

冲突的跨版本后缀（sbt、Scala-STM、Play-JSON）

相关推荐

最近更新

标签