SQL 为什么我想要 .union 而不是 .unionAll 在 Spark 中用于 SchemaRDD?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29022530/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 03:24:51  来源:igfitidea点击:

Why would I want .union over .unionAll in Spark for SchemaRDDs?

sqlscalaapache-sparkunionunion-all

提问by duber

I'm trying to wrap my head around these two functions in the Spark SQL documentation

我正在尝试围绕Spark SQL 文档中的这两个函数进行思考

  • def union(other: RDD[Row]): RDD[Row]

    Return the union of this RDD and another one.

  • def unionAll(otherPlan: SchemaRDD): SchemaRDD

    Combines the tuples of two RDDs with the same schema, keeping duplicates.

  • def union(other: RDD[Row]): RDD[Row]

    返回这个 RDD 和另一个 RDD 的并集。

  • def unionAll(otherPlan: SchemaRDD): SchemaRDD

    将具有相同模式的两个 RDD 的元组组合在一起,保留重复项。

This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question.

这不是 UNION 与 UNION ALL 的标准行为,如此 SO question 中所述

My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results.

我这里的代码是从Spark SQL 文档中借用的,这两个函数返回相同的结果。

scala> case class Person(name: String, age: Int)
scala> import org.apache.spark.sql._
scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2)))
scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2),  Person("Gamma", 3)))
scala> val schemaString = "name age"
scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema)
scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema)
scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect
res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect
res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])

Why would I prefer one over the other?

为什么我更喜欢一个?

回答by Kris

In Spark 1.6, the above version of unionwas removed, so unionAllwas all that remained.

在 Spark 1.6 中,上面的版本union被删除了,unionAll剩下的也一样。

In Spark 2.0, unionAllwas renamed to union, with unionAllkept in for backward compatibility (I guess).

在 Spark 2.0 中,unionAll被重命名为unionunionAll保留为向后兼容(我猜)。

In any case, no deduplication is done in either union(Spark 2.0)or unionAll(Spark 1.6).

在任何情况下,union(Spark 2.0)unionAll(Spark 1.6) 都没有进行重复数据删除。

回答by Keshav Potluri

unionAll()was deprecated in Spark 2.0, and for all future reference, union()is the only recommended method.

unionAll()Spark 2.0中已弃用,对于所有将来的参考,union()是唯一推荐的方法。

In either case, unionor unionAll, both do not do a SQL style deduplication of data. In order to remove any duplicate rows, just use union()followed by a distinct().

在任一情况下,unionunionAll,两者都不会对数据执行 SQL 样式的重复数据删除。为了删除任何重复的行,只需使用union()后跟一个distinct().

回答by Joe Halliwell

Judging from its type signature and (questionable) semantics I believe union()was vestigial.

从它的类型签名和(有问题的)语义来看,我认为union()是残留的。

The more modern DataFrame APIoffers only unionAll().

更现代的DataFrame API仅提供unionAll().