Java Spark DataFrame 类上的 union() 方法在哪里?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34992182/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Where is the union() method on the Spark DataFrame class?
提问by Milen Kovachev
I am using the Java connector for Spark and would like to union two DataFrames but bizarrely the DataFrame class has only unionAll? Is this intentional and is there a way to union two DataFrames without duplicates?
我正在使用 Spark 的 Java 连接器,并想合并两个 DataFrame,但奇怪的是 DataFrame 类只有 unionAll?这是故意的,有没有办法在没有重复的情况下联合两个数据帧?
采纳答案by zero323
Is this intentional
这是故意的吗
If think it is safe to assume that it is intentional. Other union operators like RDD.union
and DataSet.union
will keep duplicates as well.
如果认为可以安全地假设它是故意的。其他工会运营商喜欢RDD.union
和DataSet.union
将继续重复也。
If you think about it make sense. While operation equivalent to UNION ALL
is just a logical operation which requires no data access or network traffic finding distinct elements requires shuffle and because of that can be quite expensive.
如果你考虑一下就有意义了。虽然操作相当于UNION ALL
只是一个不需要数据访问或网络流量的逻辑操作,但查找不同元素需要 shuffle,因此可能非常昂贵。
is there a way to union two DataFrames without duplicates?
有没有办法在没有重复的情况下联合两个数据帧?
df1.unionAll(df2).distinct()