scala Apache Spark 中的 join 和 cogroup 有什么区别

Question

提问by miaoiao

What's the difference between join and cogroup in Apache Spark? What's the use case for each method?

Apache Spark 中的 join 和 cogroup 有什么区别？每种方法的用例是什么？

Answer 1

回答by ashburshui

Let me help you to clarify them, both are common to use and important!

让我来帮助您澄清它们，两者都是常用且重要的！

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

This is prototypeof join, please carefully look at it. For example,

这是prototype加入的，请仔细看。例如，

val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)

scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))

All keys that will appear in the final result is common tordd1 and rdd2. This is similar to relation database operation INNER JOIN.

将出现在最终结果中的所有键对于rdd1 和 rdd2都是通用的。这类似于relation database operation INNER JOIN.

But cogroup is different,

但同组不同，

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

as one key at least appear in either of the two rdds, it will appear in the final result, let me clarify it:

由于一个key至少出现在两个rdd中的任何一个中，它会出现在最终结果中，让我澄清一下：

val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)

scala> var rdd3 = rdd1.cogroup(rdd2).collect
res0: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())), 
(D,(CompactBuffer(),CompactBuffer(d))), 
(A,(CompactBuffer(1),CompactBuffer(a))), 
(C,(CompactBuffer(3),CompactBuffer(c)))
)

This is very similarto relation database operation FULL OUTER JOIN, but instead of flattening the result per line per record, it will give you the interable interfaceto you, the following operation is up to you as convenient!

这是非常similar为了relation database operation FULL OUTER JOIN，但不是将每条记录每行的结果弄平interable interface，而是给你，下面的操作就看你方便了！

Good Luck!

祝你好运！

Spark docs is: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Spark 文档是：http: //spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

scala Apache Spark 中的 join 和 cogroup 有什么区别

提问by miaoiao

回答by ashburshui

相关推荐

最近更新

标签

scala Apache Spark 中的 join 和 cogroup 有什么区别

提问by miaoiao

回答by ashburshui

相关推荐

scala 用于 StructType / Row 的 Spark UDF

scala 如何从现有的 SparkContext 创建 SparkSession

scala Spark 数据帧过滤器

scala 我们如何对数据框进行排名？

相关推荐

最近更新

标签