scala 什么是导致 Shuffle 的 Spark 转换？

Question

提问by poiuytrez

I have trouble to find in the Spark documentation operations that causes a shuffle and operation that does not. In this list, which ones does cause a shuffle and which ones does not?

我很难在 Spark 文档中找到导致 shuffle 的操作和不会的操作。在这个列表中，哪些会导致洗牌，哪些不会？

Map and filter does not. However, I am not sure with the others.

映射和过滤器没有。但是，我不确定其他人。

map(func)
filter(func)
flatMap(func)
mapPartitions(func)
mapPartitionsWithIndex(func)
sample(withReplacement, fraction, seed)
union(otherDataset)
intersection(otherDataset)
distinct([numTasks]))
groupByKey([numTasks])
reduceByKey(func, [numTasks])
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
sortByKey([ascending], [numTasks])
join(otherDataset, [numTasks])
cogroup(otherDataset, [numTasks])
cartesian(otherDataset)
pipe(command, [envVars])
coalesce(numPartitions)

Answer 1

回答by aaronman

It is actually extremely easy to find this out, without the documentation. For any of these functions just create an RDD and call to debug string, here is one example you can do the rest on ur own.

在没有文档的情况下，实际上很容易找到这一点。对于这些函数中的任何一个，只需创建一个 RDD 并调用调试字符串，这是一个示例，您可以自己完成其余的工作。

scala> val a  = sc.parallelize(Array(1,2,3)).distinct
scala> a.toDebugString
MappedRDD[5] at distinct at <console>:12 (1 partitions)
  MapPartitionsRDD[4] at distinct at <console>:12 (1 partitions)
    **ShuffledRDD[3] at distinct at <console>:12 (1 partitions)**
      MapPartitionsRDD[2] at distinct at <console>:12 (1 partitions)
        MappedRDD[1] at distinct at <console>:12 (1 partitions)
          ParallelCollectionRDD[0] at parallelize at <console>:12 (1 partitions)

So as you can see distinctcreates a shuffle. It is also particularly important to find out this way rather than docs because there are situations where a shuffle will be required or not required for a certain function. For example join usually requires a shuffle but if you join two RDD's that branch from the same RDD spark can sometimes elide the shuffle.

正如你所看到的，distinct创建了一个随机播放。找出这种方式而不是文档也特别重要，因为在某些情况下，某些功能需要或不需要 shuffle。例如，join 通常需要 shuffle，但如果你加入两个 RDD，来自同一个 RDD spark 的分支有时会忽略 shuffle。

Answer 2

回答by ruhong

Here is a list of operations that mightcause a shuffle:

以下是可能导致 shuffle的操作列表：

cogroup

groupWith

join: hash partition

join: 哈希分区

leftOuterJoin: hash partition

leftOuterJoin: 哈希分区

rightOuterJoin: hash partition

rightOuterJoin: 哈希分区

groupByKey: hash partition

groupByKey: 哈希分区

reduceByKey: hash partition

reduceByKey: 哈希分区

combineByKey: hash partition

combineByKey: 哈希分区

sortByKey: range partition

sortByKey: 范围分区

distinct

intersection: hash partition

intersection: 哈希分区

repartition

coalesce

Source: Big Data Analysis with Spark and Scala, Optimizing with Partitions, Coursera

来源：使用 Spark 和 Scala 进行大数据分析，使用分区进行优化，Coursera

Answer 3

回答by Glenn Strycker

This might be helpful: https://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations

这可能会有所帮助：https: //spark.apache.org/docs/latest/programming-guide.html#shuffle-operations

or this: http://www.slideshare.net/SparkSummit/dev-ops-training, starting with slide 208

或者这个： http://www.slideshare.net/SparkSummit/dev-ops-training，从幻灯片 208 开始

from slide 209: "Transformations that use 'numPartitions' like distinct will probably shuffle"

来自幻灯片 209：“使用 'numPartitions' 之类的 distinct 的转换可能会随机播放”

Answer 4

回答by mrsrinivas

Here is the generalised statement on shuffling transformations.

这是关于改组变换的一般性陈述。

Transformations which can cause a shuffle include repartitionoperations like repartitionand coalesce, ‘ByKeyoperations (except for counting) like groupByKeyand reduceByKey, and joinoperations like cogroupand join.

这可能导致洗牌转换包括再分配像操作repartition和coalesce，ByKey”操作，比如（除计数）groupByKey和reduceByKey，并加入操作，如cogroup和 join。

source

来源

scala 什么是导致 Shuffle 的 Spark 转换？

提问by poiuytrez

回答by aaronman

回答by ruhong

回答by Glenn Strycker

回答by mrsrinivas

相关推荐

最近更新

标签

scala 什么是导致 Shuffle 的 Spark 转换？

提问by poiuytrez

回答by aaronman

回答by ruhong

回答by Glenn Strycker

回答by mrsrinivas

相关推荐

在 Intellij Scala 插件中显示推断类型

scala.util.Try：如何获得 Throwable 值？模式匹配？

IntelliJ Scala 配置问题

Scala 中的高效字符串连接

相关推荐

最近更新

标签