scala 如何获得两个DataFrame之间的对称差异?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36199901/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:06:33  来源:igfitidea点击:

How to obtain the symmetric difference between two DataFrames?

scalaapache-sparkapache-spark-sql

提问by WillD

In the SparkSQL1.6 API (scala) Dataframehas functions for intersect and except, but not one for difference. Obviously, a combination of union and except can be used to generate difference:

SparkSQL1.6 API (scala) 中Dataframe有用于 intersect 和 except 的函数,但没有用于差异的函数。显然,可以使用 union 和 except 的组合来产生差异:

df1.except(df2).union(df2.except(df1))

But this seems a bit awkward. In my experience, if something seems awkward, there's a better way to do it, especially in Scala.

但这似乎有点尴尬。根据我的经验,如果有些事情看起来很尴尬,有更好的方法来解决,尤其是在 Scala 中。

回答by zero323

You can always rewrite it as:

您可以随时将其重写为:

df1.unionAll(df2).except(df1.intersect(df2))

Seriously though this UNION, INTERSECTand EXCEPT/ MINUSis pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.

认真虽然这UNIONINTERSECTEXCEPT/MINUS几乎是一个标准的SQL集合结合运算符。我不知道有任何系统提供开箱即用的 XOR 之类的操作。很可能是因为使用其他三个实现是微不足道的,并且没有太多可以优化的地方。

回答by Tal Barda

why not the below?

为什么不是下面的?

df1.except(df2)

回答by Tagar

Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:

请注意,EXCEPT(或 MINUS 只是 EXCEPT 的别名)重复结果。因此,如果您希望“except”集(您提到的差异)+“intersect”集等于原始数据帧,请考虑此保留重复项的功能请求:

https://issues.apache.org/jira/browse/SPARK-21274

https://issues.apache.org/jira/browse/SPARK-21274

As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as

正如我在那里写的,“EXCEPT ALL”可以在 Spark SQL 中重写为

SELECT a,b,c
FROM    tab1 t1
     LEFT OUTER JOIN 
        tab2 t2
     ON (
        (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
     )
WHERE
    COALESCE(t2.a, t2.b, t2.c) IS NULL

回答by J. Salmoral

I think it could be more efficient using a left join and then filtering out the nulls.

我认为使用左连接然后过滤掉空值可能会更有效。

df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)

回答by Aaron

If you are looking for Pyspark solution, you should use subtract() docs.

如果您正在寻找 Pyspark 解决方案,您应该使用subtract() docs

Also, unionAll is deprecated in 2.0, use union() instead.

此外,unionAll 在 2.0 中已弃用,请改用 union()。

df1.union(df2).subtract(df1.intersect(df2))

df1.union(df2).subtract(df1.intersect(df2))