scala 相当于 SPARK 中的左外连接
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23193611/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Equivalent to left outer join in SPARK
提问by user3279189
Is there a left outer join equivalent in SPARK SCALA ? I understand there is join operation which is equivalent to database inner join.
SPARK SCALA 中是否有等效的左外连接?我知道有连接操作相当于数据库内连接。
回答by MARK
Spark Scala does have the support of left outer join. Have a look here http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD
Spark Scala 确实支持左外连接。看看这里 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD
Usage is quite simple as
用法很简单,因为
rdd1.leftOuterJoin(rdd2)
回答by Thang Tran
It is as simple as rdd1.leftOuterJoin(rdd2)but you have to make sure both rdd's are in the form of (key, value) for each element of the rdd's.
这很简单,rdd1.leftOuterJoin(rdd2)但您必须确保 rdd 的每个元素都采用 (key, value) 形式。
回答by gaganbm
Yes, there is. Have a look at the DStream APIsand they have provided left as well as right outer joins.
就在这里。查看DStream API,它们提供了左外连接和右外连接。
If you have a stream of of type let's say 'Record', and you wish to join two streams of records, then you can do this like :
如果您有一个类型为“记录”的流,并且您希望加入两个记录流,那么您可以这样做:
var res: DStream[(Long, (Record, Option[Record]))] = left.leftOuterJoin(right)
As the APIs say, the left and right streams have to be hash partitioned. i.e., you can take some attributes from a Record, (or may be in any other way) to calculate a Hash value and convert it to pair DStream. leftand rightstreams will be of type DStream[(Long, Record)]before you call that join function. (It is just an example. The Hash type can be of some type other than Longas well.)
正如 API 所说,左右流必须进行哈希分区。即,您可以从 Record 中获取一些属性(或可能以任何其他方式)来计算 Hash 值并将其转换为对 DStream。left并且在您调用该连接函数之前,right流将是类型的DStream[(Long, Record)]。(这只是一个例子。Hash 类型可以是其他类型Long。)
回答by Tagar
Spark SQL / Data Frame API also supports LEFT/RIGHT/FULL outerjoins directly:
Spark SQL / Data Frame API 也直接支持 LEFT/RIGHT/FULL外连接:
https://spark.apache.org/docs/latest/sql-programming-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
Because of this bug: https://issues.apache.org/jira/browse/SPARK-11111outer joins in Spark prior to 1.6 might be very slow (unless you have really small data sets to join). It used to use cartesian product and then filtering before 1.6. Now it is using SortMergeJoin instead.
由于这个错误:https: //issues.apache.org/jira/browse/SPARK-11111 在 Spark 1.6 之前的外部联接可能会非常慢(除非你有非常小的数据集要联接)。1.6之前是用笛卡尔积再过滤。现在它使用 SortMergeJoin 代替。

