scala Spark DataFrame 中的条件连接
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39417209/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Conditional Join in Spark DataFrame
提问by Avijit
I am trying to join two DataFramewith condition.
我正在尝试加入两个DataFrame有条件的人。
I have two dataframe A and B.
我有两个数据框 A 和 B。
A contains id,m_cd and c_cd columns B contains m_cd,c_cd and record columns
A 包含 id,m_cd 和 c_cd 列 B 包含 m_cd,c_cd 和 record 列
Conditions are -
条件是——
- If m_cd is null then join c_cd of A with B
- If m_cd is not null then join m_cd of A with B
- 如果 m_cd 为空,则将 A 的 c_cd 与 B 连接
- 如果 m_cd 不为空,则将 A 的 m_cd 与 B 连接
we can use "when" and "otherwise()" in withcolumn() method of dataframe, so is there any way to do this for the case of join in dataframe.
我们可以在dataframe的()方法中使用“ when”和“ otherwise()” withcolumn,那么对于join in dataframe的情况有没有办法做到这一点。
I have already done this using Union.But wanted to know if there any other option available.
我已经使用Union. 但想知道是否有其他选项可用。
回答by alghimo
You can use the "when" / "otherwise" in the join condition:
您可以在连接条件中使用“when”/“otherwise”:
case class Foo(m_cd: Option[Int], c_cd: Option[Int])
val dfA = spark.createDataset(Array(
Foo(Some(1), Some(2)),
Foo(Some(2), Some(3)),
Foo(None: Option[Int], Some(4))
))
val dfB = spark.createDataset(Array(
Foo(Some(1), Some(5)),
Foo(Some(2), Some(6)),
Foo(Some(10), Some(4))
))
val joinCondition = when($"a.m_cd".isNull, $"a.c_cd"===$"b.c_cd")
.otherwise($"a.m_cd"===$"b.m_cd")
dfA.as('a).join(dfB.as('b), joinCondition).show
It might still be more readable to use the union, though.
不过,使用联合可能仍然更具可读性。

