scala 在 Spark SQL 中加入两个 DataFrame 并仅选择一列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38721218/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Joining two DataFrames in Spark SQL and selecting columns of only one
提问by Avi
I have two DataFrames in Spark SQL (D1 and D2).
我在 Spark SQL 中有两个数据帧(D1 和 D2)。
I am trying to inner join both of them D1.join(D2, "some column")and get back data of only D1, not the complete data set.
我正在尝试对它们进行内部连接D1.join(D2, "some column")并仅取回D1 的数据,而不是完整的数据集。
Both D1 and D2 are having the same columns.
D1 和 D2 都具有相同的列。
Could some one please help me on this?
有人可以帮助我吗?
I am using Spark 1.6.
我正在使用 Spark 1.6。
回答by cheseaux
Let say you want to join on "id" column. Then you could write :
假设您想加入“id”列。然后你可以写:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select($"d1.*")
回答by nsanglar
As an alternate answer, you could also do the following without adding aliases:
作为替代答案,您还可以在不添加别名的情况下执行以下操作:
d1.join(d2, d1("id") === d2("id"))
.select(d1.columns.map(c => d1(c)): _*)
回答by Krzysztof At?asik
You could use left_semi:
你可以使用left_semi:
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left_semi")
Semi-join takes only rows from the left dataset where joining condition is met.
半连接仅从左侧数据集中获取满足连接条件的行。
There's also another interesting join type: left_anti, which works similarily to left_semibut takes only those rows where the condition is notmet.
还有另一个有趣的连接类型:left_anti,这相若方式工作,以left_semi但仅当条件的那些行没有满足。

