SQL 为什么我想要 .union 而不是 .unionAll 在 Spark 中用于 SchemaRDD?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29022530/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why would I want .union over .unionAll in Spark for SchemaRDDs?
提问by duber
I'm trying to wrap my head around these two functions in the Spark SQL documentation–
我正在尝试围绕Spark SQL 文档中的这两个函数进行思考–
def union(other: RDD[Row]): RDD[Row]
Return the union of this RDD and another one.
def unionAll(otherPlan: SchemaRDD): SchemaRDD
Combines the tuples of two RDDs with the same schema, keeping duplicates.
def union(other: RDD[Row]): RDD[Row]
返回这个 RDD 和另一个 RDD 的并集。
def unionAll(otherPlan: SchemaRDD): SchemaRDD
将具有相同模式的两个 RDD 的元组组合在一起,保留重复项。
This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question.
这不是 UNION 与 UNION ALL 的标准行为,如此 SO question 中所述。
My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results.
我这里的代码是从Spark SQL 文档中借用的,这两个函数返回相同的结果。
scala> case class Person(name: String, age: Int)
scala> import org.apache.spark.sql._
scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2)))
scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2), Person("Gamma", 3)))
scala> val schemaString = "name age"
scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema)
scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema)
scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect
res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect
res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
Why would I prefer one over the other?
为什么我更喜欢一个?
回答by Kris
In Spark 1.6, the above version of union
was removed, so unionAll
was all that remained.
在 Spark 1.6 中,上面的版本union
被删除了,unionAll
剩下的也一样。
In Spark 2.0, unionAll
was renamed to union
, with unionAll
kept in for backward compatibility (I guess).
在 Spark 2.0 中,unionAll
被重命名为union
,unionAll
保留为向后兼容(我猜)。
In any case, no deduplication is done in either union
(Spark 2.0)or unionAll
(Spark 1.6).
在任何情况下,union
(Spark 2.0)或unionAll
(Spark 1.6) 都没有进行重复数据删除。
回答by Keshav Potluri
unionAll()
was deprecated in Spark 2.0, and for all future reference, union()
is the only recommended method.
unionAll()
在Spark 2.0中已弃用,对于所有将来的参考,union()
是唯一推荐的方法。
In either case, union
or unionAll
, both do not do a SQL style deduplication of data. In order to remove any duplicate rows, just use union()
followed by a distinct()
.
在任一情况下,union
或unionAll
,两者都不会对数据执行 SQL 样式的重复数据删除。为了删除任何重复的行,只需使用union()
后跟一个distinct()
.
回答by Joe Halliwell
Judging from its type signature and (questionable) semantics I believe union()
was vestigial.
从它的类型签名和(有问题的)语义来看,我认为union()
是残留的。
The more modern DataFrame APIoffers only unionAll()
.
更现代的DataFrame API仅提供unionAll()
.