Java Spark DataFrame 类上的 union() 方法在哪里？

Question

提问by Milen Kovachev

I am using the Java connector for Spark and would like to union two DataFrames but bizarrely the DataFrame class has only unionAll? Is this intentional and is there a way to union two DataFrames without duplicates?

我正在使用 Spark 的 Java 连接器，并想合并两个 DataFrame，但奇怪的是 DataFrame 类只有 unionAll？这是故意的，有没有办法在没有重复的情况下联合两个数据帧？

Answer 1

采纳答案by zero323

Is this intentional

这是故意的吗

If think it is safe to assume that it is intentional. Other union operators like RDD.unionand DataSet.unionwill keep duplicates as well.

如果认为可以安全地假设它是故意的。其他工会运营商喜欢RDD.union和DataSet.union将继续重复也。

If you think about it make sense. While operation equivalent to UNION ALLis just a logical operation which requires no data access or network traffic finding distinct elements requires shuffle and because of that can be quite expensive.

如果你考虑一下就有意义了。虽然操作相当于UNION ALL只是一个不需要数据访问或网络流量的逻辑操作，但查找不同元素需要 shuffle，因此可能非常昂贵。

is there a way to union two DataFrames without duplicates?

有没有办法在没有重复的情况下联合两个数据帧？

df1.unionAll(df2).distinct()

Java Spark DataFrame 类上的 union() 方法在哪里？

提问by Milen Kovachev

采纳答案by zero323

相关推荐

最近更新

标签

Java Spark DataFrame 类上的 union() 方法在哪里？

提问by Milen Kovachev

采纳答案by zero323

相关推荐

根据其成员项的“toString”值按字母顺序对 Java 集合进行排序

Java 运行由 shade 插件构建的独立应用程序时未找到 Log4j2 配置

实时 Java 图形/图表库？

java.lang.NumberFormatException: Invalid int: "" : 错误

相关推荐

最近更新

标签