scala 如何获得两个DataFrame之间的对称差异？

Question

提问by WillD

In the SparkSQL1.6 API (scala) Dataframehas functions for intersect and except, but not one for difference. Obviously, a combination of union and except can be used to generate difference:

在SparkSQL1.6 API (scala) 中Dataframe有用于 intersect 和 except 的函数，但没有用于差异的函数。显然，可以使用 union 和 except 的组合来产生差异：

df1.except(df2).union(df2.except(df1))

But this seems a bit awkward. In my experience, if something seems awkward, there's a better way to do it, especially in Scala.

但这似乎有点尴尬。根据我的经验，如果有些事情看起来很尴尬，有更好的方法来解决，尤其是在 Scala 中。

Answer 1

回答by zero323

You can always rewrite it as:

您可以随时将其重写为：

df1.unionAll(df2).except(df1.intersect(df2))

Seriously though this UNION, INTERSECTand EXCEPT/ MINUSis pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.

认真虽然这UNION，INTERSECT和EXCEPT/MINUS几乎是一个标准的SQL集合结合运算符。我不知道有任何系统提供开箱即用的 XOR 之类的操作。很可能是因为使用其他三个实现是微不足道的，并且没有太多可以优化的地方。

Answer 2

回答by Tal Barda

why not the below?

为什么不是下面的？

df1.except(df2)

Answer 3

回答by Tagar

Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:

请注意，EXCEPT（或 MINUS 只是 EXCEPT 的别名）重复结果。因此，如果您希望“except”集（您提到的差异）+“intersect”集等于原始数据帧，请考虑此保留重复项的功能请求：

https://issues.apache.org/jira/browse/SPARK-21274

As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as

正如我在那里写的，“EXCEPT ALL”可以在 Spark SQL 中重写为

SELECT a,b,c
FROM    tab1 t1
     LEFT OUTER JOIN 
        tab2 t2
     ON (
        (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
     )
WHERE
    COALESCE(t2.a, t2.b, t2.c) IS NULL

Answer 4

回答by J. Salmoral

I think it could be more efficient using a left join and then filtering out the nulls.

我认为使用左连接然后过滤掉空值可能会更有效。

df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)

Answer 5

回答by Aaron

If you are looking for Pyspark solution, you should use subtract() docs.

如果您正在寻找 Pyspark 解决方案，您应该使用subtract() docs。

Also, unionAll is deprecated in 2.0, use union() instead.

此外，unionAll 在 2.0 中已弃用，请改用 union()。

df1.union(df2).subtract(df1.intersect(df2))

scala 如何获得两个DataFrame之间的对称差异？

提问by WillD

回答by zero323

回答by Tal Barda

回答by Tagar

回答by J. Salmoral

回答by Aaron

相关推荐

最近更新

标签

scala 如何获得两个DataFrame之间的对称差异？

提问by WillD

回答by zero323

回答by Tal Barda

回答by Tagar

回答by J. Salmoral

回答by Aaron

相关推荐

scala spark.default.parallelism for Parallelize RDD 默认为 2 for spark submit

scala 使用要填充的默认元素压缩两个不同长度的列表

scala 从 Spark SQL 中的字符串列表创建文字和列数组

scala Spark saveAsTextFile() 写入多个文件而不是一个

相关推荐

最近更新

标签