scala 在 Spark SQL 中加入两个 DataFrame 并仅选择一列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38721218/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:32:00  来源:igfitidea点击:

Joining two DataFrames in Spark SQL and selecting columns of only one

scalaapache-sparkapache-spark-sql

提问by Avi

I have two DataFrames in Spark SQL (D1 and D2).

我在 Spark SQL 中有两个数据帧(D1 和 D2)。

I am trying to inner join both of them D1.join(D2, "some column")and get back data of only D1, not the complete data set.

我正在尝试对它们进行内部连接D1.join(D2, "some column")取回D1 的数据,而不是完整的数据集

Both D1 and D2 are having the same columns.

D1 和 D2 都具有相同的列。

Could some one please help me on this?

有人可以帮助我吗?

I am using Spark 1.6.

我正在使用 Spark 1.6。

回答by cheseaux

Let say you want to join on "id" column. Then you could write :

假设您想加入“id”列。然后你可以写:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._    
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select($"d1.*")

回答by nsanglar

As an alternate answer, you could also do the following without adding aliases:

作为替代答案,您还可以在不添加别名的情况下执行以下操作:

d1.join(d2, d1("id") === d2("id"))
  .select(d1.columns.map(c => d1(c)): _*)

回答by Krzysztof At?asik

You could use left_semi:

你可以使用left_semi

d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left_semi")

Semi-join takes only rows from the left dataset where joining condition is met.

半连接仅从左侧数据集中获取满足连接条件的行。

There's also another interesting join type: left_anti, which works similarily to left_semibut takes only those rows where the condition is notmet.

还有另一个有趣的连接类型:left_anti,这相若方式工作,以left_semi但仅当条件的那些行没有满足。