scala 如何获得两个DataFrame之间的对称差异?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36199901/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to obtain the symmetric difference between two DataFrames?
提问by WillD
In the SparkSQL1.6 API (scala) Dataframehas functions for intersect and except, but not one for difference. Obviously, a combination of union and except can be used to generate difference:
在SparkSQL1.6 API (scala) 中Dataframe有用于 intersect 和 except 的函数,但没有用于差异的函数。显然,可以使用 union 和 except 的组合来产生差异:
df1.except(df2).union(df2.except(df1))
But this seems a bit awkward. In my experience, if something seems awkward, there's a better way to do it, especially in Scala.
但这似乎有点尴尬。根据我的经验,如果有些事情看起来很尴尬,有更好的方法来解决,尤其是在 Scala 中。
回答by zero323
You can always rewrite it as:
您可以随时将其重写为:
df1.unionAll(df2).except(df1.intersect(df2))
Seriously though this UNION, INTERSECTand EXCEPT/ MINUSis pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.
认真虽然这UNION,INTERSECT和EXCEPT/MINUS几乎是一个标准的SQL集合结合运算符。我不知道有任何系统提供开箱即用的 XOR 之类的操作。很可能是因为使用其他三个实现是微不足道的,并且没有太多可以优化的地方。
回答by Tal Barda
why not the below?
为什么不是下面的?
df1.except(df2)
回答by Tagar
Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:
请注意,EXCEPT(或 MINUS 只是 EXCEPT 的别名)重复结果。因此,如果您希望“except”集(您提到的差异)+“intersect”集等于原始数据帧,请考虑此保留重复项的功能请求:
https://issues.apache.org/jira/browse/SPARK-21274
https://issues.apache.org/jira/browse/SPARK-21274
As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as
正如我在那里写的,“EXCEPT ALL”可以在 Spark SQL 中重写为
SELECT a,b,c
FROM tab1 t1
LEFT OUTER JOIN
tab2 t2
ON (
(t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
)
WHERE
COALESCE(t2.a, t2.b, t2.c) IS NULL
回答by J. Salmoral
I think it could be more efficient using a left join and then filtering out the nulls.
我认为使用左连接然后过滤掉空值可能会更有效。
df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)

