scala Apache Spark 中的 join 和 cogroup 有什么区别
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43960583/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What's the difference between join and cogroup in Apache Spark
提问by miaoiao
What's the difference between join and cogroup in Apache Spark? What's the use case for each method?
Apache Spark 中的 join 和 cogroup 有什么区别?每种方法的用例是什么?
回答by ashburshui
Let me help you to clarify them, both are common to use and important!
让我来帮助您澄清它们,两者都是常用且重要的!
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
This is prototypeof join, please carefully look at it. For example,
这是prototype加入的,请仔细看。例如,
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))
All keys that will appear in the final result is common tordd1 and rdd2. This is similar to relation database operation INNER JOIN.
将出现在最终结果中的所有键对于rdd1 和 rdd2都是通用的。这类似于relation database operation INNER JOIN.
But cogroup is different,
但同组不同,
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
as one key at least appear in either of the two rdds, it will appear in the final result, let me clarify it:
由于一个key至少出现在两个rdd中的任何一个中,它会出现在最终结果中,让我澄清一下:
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> var rdd3 = rdd1.cogroup(rdd2).collect
res0: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())),
(D,(CompactBuffer(),CompactBuffer(d))),
(A,(CompactBuffer(1),CompactBuffer(a))),
(C,(CompactBuffer(3),CompactBuffer(c)))
)
This is very similarto relation database operation FULL OUTER JOIN, but instead of flattening the result per line per record, it will give you the interable interfaceto you, the following operation is up to you as convenient!
这是非常similar为了relation database operation FULL OUTER JOIN,但不是将每条记录每行的结果弄平interable interface,而是给你,下面的操作就看你方便了!
Good Luck!
祝你好运!
Spark docs is: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
Spark 文档是:http: //spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

