scala 什么是导致 Shuffle 的 Spark 转换?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26273664/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:37:03  来源:igfitidea点击:

What are the Spark transformations that causes a Shuffle?

javapythonscalaapache-spark

提问by poiuytrez

I have trouble to find in the Spark documentation operations that causes a shuffle and operation that does not. In this list, which ones does cause a shuffle and which ones does not?

我很难在 Spark 文档中找到导致 shuffle 的操作和不会的操作。在这个列表中,哪些会导致洗牌,哪些不会?

Map and filter does not. However, I am not sure with the others.

映射和过滤器没有。但是,我不确定其他人。

map(func)
filter(func)
flatMap(func)
mapPartitions(func)
mapPartitionsWithIndex(func)
sample(withReplacement, fraction, seed)
union(otherDataset)
intersection(otherDataset)
distinct([numTasks]))
groupByKey([numTasks])
reduceByKey(func, [numTasks])
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
sortByKey([ascending], [numTasks])
join(otherDataset, [numTasks])
cogroup(otherDataset, [numTasks])
cartesian(otherDataset)
pipe(command, [envVars])
coalesce(numPartitions)

回答by aaronman

It is actually extremely easy to find this out, without the documentation. For any of these functions just create an RDD and call to debug string, here is one example you can do the rest on ur own.

在没有文档的情况下,实际上很容易找到这一点。对于这些函数中的任何一个,只需创建一个 RDD 并调用调试字符串,这是一个示例,您可以自己完成其余的工作。

scala> val a  = sc.parallelize(Array(1,2,3)).distinct
scala> a.toDebugString
MappedRDD[5] at distinct at <console>:12 (1 partitions)
  MapPartitionsRDD[4] at distinct at <console>:12 (1 partitions)
    **ShuffledRDD[3] at distinct at <console>:12 (1 partitions)**
      MapPartitionsRDD[2] at distinct at <console>:12 (1 partitions)
        MappedRDD[1] at distinct at <console>:12 (1 partitions)
          ParallelCollectionRDD[0] at parallelize at <console>:12 (1 partitions)

So as you can see distinctcreates a shuffle. It is also particularly important to find out this way rather than docs because there are situations where a shuffle will be required or not required for a certain function. For example join usually requires a shuffle but if you join two RDD's that branch from the same RDD spark can sometimes elide the shuffle.

正如你所看到的,distinct创建了一个随机播放。找出这种方式而不是文档也特别重要,因为在某些情况下,某些功能需要或不需要 shuffle。例如,join 通常需要 shuffle,但如果你加入两个 RDD,来自同一个 RDD spark 的分支有时会忽略 shuffle。

回答by ruhong

Here is a list of operations that mightcause a shuffle:

以下是可能导致 shuffle的操作列表:

cogroup

cogroup

groupWith

groupWith

join: hash partition

join: 哈希分区

leftOuterJoin: hash partition

leftOuterJoin: 哈希分区

rightOuterJoin: hash partition

rightOuterJoin: 哈希分区

groupByKey: hash partition

groupByKey: 哈希分区

reduceByKey: hash partition

reduceByKey: 哈希分区

combineByKey: hash partition

combineByKey: 哈希分区

sortByKey: range partition

sortByKey: 范围分区

distinct

distinct

intersection: hash partition

intersection: 哈希分区

repartition

repartition

coalesce

coalesce

Source: Big Data Analysis with Spark and Scala, Optimizing with Partitions, Coursera

来源:使用 Spark 和 Scala 进行大数据分析,使用分区进行优化,Coursera

回答by Glenn Strycker

This might be helpful: https://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations

这可能会有所帮助:https: //spark.apache.org/docs/latest/programming-guide.html#shuffle-operations

or this: http://www.slideshare.net/SparkSummit/dev-ops-training, starting with slide 208

或者这个: http://www.slideshare.net/SparkSummit/dev-ops-training,从幻灯片 208 开始

from slide 209: "Transformations that use 'numPartitions' like distinct will probably shuffle"

来自幻灯片 209:“使用 'numPartitions' 之类的 distinct 的转换可能会随机播放”

回答by mrsrinivas

Here is the generalised statement on shuffling transformations.

这是关于改组变换的一般性陈述。

Transformations which can cause a shuffle include repartitionoperations like repartitionand coalesce, ‘ByKeyoperations (except for counting) like groupByKeyand reduceByKey, and joinoperations like cogroupand join.

这可能导致洗牌转换包括再分配像操作repartitioncoalesceByKey”操作,比如(除计数)groupByKeyreduceByKey,并加入操作,如cogroupjoin

source

来源