scala 相当于 SPARK 中的左外连接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23193611/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:13:01  来源:igfitidea点击:

Equivalent to left outer join in SPARK

scalaapache-spark

提问by user3279189

Is there a left outer join equivalent in SPARK SCALA ? I understand there is join operation which is equivalent to database inner join.

SPARK SCALA 中是否有等效的左外连接?我知道有连接操作相当于数据库内连接。

回答by MARK

Spark Scala does have the support of left outer join. Have a look here http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD

Spark Scala 确实支持左外连接。看看这里 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD

Usage is quite simple as

用法很简单,因为

rdd1.leftOuterJoin(rdd2)

回答by Thang Tran

It is as simple as rdd1.leftOuterJoin(rdd2)but you have to make sure both rdd's are in the form of (key, value) for each element of the rdd's.

这很简单,rdd1.leftOuterJoin(rdd2)但您必须确保 rdd 的每个元素都采用 (key, value) 形式。

回答by gaganbm

Yes, there is. Have a look at the DStream APIsand they have provided left as well as right outer joins.

就在这里。查看DStream API,它们提供了左外连接和右外连接。

If you have a stream of of type let's say 'Record', and you wish to join two streams of records, then you can do this like :

如果您有一个类型为“记录”的流,并且您希望加入两个记录流,那么您可以这样做:

var res: DStream[(Long, (Record, Option[Record]))] = left.leftOuterJoin(right)

As the APIs say, the left and right streams have to be hash partitioned. i.e., you can take some attributes from a Record, (or may be in any other way) to calculate a Hash value and convert it to pair DStream. leftand rightstreams will be of type DStream[(Long, Record)]before you call that join function. (It is just an example. The Hash type can be of some type other than Longas well.)

正如 API 所说,左右流必须进行哈希分区。即,您可以从 Record 中获取一些属性(或可能以任何其他方式)来计算 Hash 值并将其转换为对 DStream。left并且在您调用该连接函数之前,right流将是类型的DStream[(Long, Record)]。(这只是一个例子。Hash 类型可以是其他类型Long。)

回答by Tagar

Spark SQL / Data Frame API also supports LEFT/RIGHT/FULL outerjoins directly:

Spark SQL / Data Frame API 也直接支持 LEFT/RIGHT/FULL连接:

https://spark.apache.org/docs/latest/sql-programming-guide.html

https://spark.apache.org/docs/latest/sql-programming-guide.html

Because of this bug: https://issues.apache.org/jira/browse/SPARK-11111outer joins in Spark prior to 1.6 might be very slow (unless you have really small data sets to join). It used to use cartesian product and then filtering before 1.6. Now it is using SortMergeJoin instead.

由于这个错误:https: //issues.apache.org/jira/browse/SPARK-11111 在 Spark 1.6 之前的外部联接可能会非常慢(除非你有非常小的数据集要联接)。1.6之前是用笛卡尔积再过滤。现在它使用 SortMergeJoin 代替。