scala 在 Spark SQL 中加入两个 DataFrame 并仅选择一列

Question

提问by Avi

I have two DataFrames in Spark SQL (D1 and D2).

我在 Spark SQL 中有两个数据帧（D1 和 D2）。

I am trying to inner join both of them D1.join(D2, "some column")and get back data of only D1, not the complete data set.

我正在尝试对它们进行内部连接D1.join(D2, "some column")并仅取回D1 的数据，而不是完整的数据集。

Both D1 and D2 are having the same columns.

D1 和 D2 都具有相同的列。

Could some one please help me on this?

有人可以帮助我吗？

I am using Spark 1.6.

我正在使用 Spark 1.6。

Answer 1

回答by cheseaux

Let say you want to join on "id" column. Then you could write :

假设您想加入“id”列。然后你可以写：

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._    
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select($"d1.*")

Answer 2

回答by nsanglar

As an alternate answer, you could also do the following without adding aliases:

作为替代答案，您还可以在不添加别名的情况下执行以下操作：

d1.join(d2, d1("id") === d2("id"))
  .select(d1.columns.map(c => d1(c)): _*)

Answer 3

回答by Krzysztof At?asik

You could use left_semi:

你可以使用left_semi：

d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left_semi")

Semi-join takes only rows from the left dataset where joining condition is met.

半连接仅从左侧数据集中获取满足连接条件的行。

There's also another interesting join type: left_anti, which works similarily to left_semibut takes only those rows where the condition is notmet.

还有另一个有趣的连接类型：left_anti，这相若方式工作，以left_semi但仅当条件的那些行没有满足。

scala 在 Spark SQL 中加入两个 DataFrame 并仅选择一列

提问by Avi

回答by cheseaux

回答by nsanglar

回答by Krzysztof At?asik

相关推荐

最近更新

标签

scala 在 Spark SQL 中加入两个 DataFrame 并仅选择一列

提问by Avi

回答by cheseaux

回答by nsanglar

回答by Krzysztof At?asik

相关推荐

java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object

scala Spark：错误：值拆分不是 org.apache.spark.rdd.RDD[String] 的成员

scala 这是从 S3 读取 Json 文件的最快方法：Spark

Scala Option[Future[T]] 到 Future[Option[T]]

相关推荐

最近更新

标签