SQL 为什么我想要 .union 而不是 .unionAll 在 Spark 中用于 SchemaRDD？

Question

提问by duber

I'm trying to wrap my head around these two functions in the Spark SQL documentation–

我正在尝试围绕Spark SQL 文档中的这两个函数进行思考–

def union(other: RDD[Row]): RDD[Row]
Return the union of this RDD and another one.
def unionAll(otherPlan: SchemaRDD): SchemaRDD
Combines the tuples of two RDDs with the same schema, keeping duplicates.

def union(other: RDD[Row]): RDD[Row]
返回这个 RDD 和另一个 RDD 的并集。
def unionAll(otherPlan: SchemaRDD): SchemaRDD
将具有相同模式的两个 RDD 的元组组合在一起，保留重复项。

This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question.

这不是 UNION 与 UNION ALL 的标准行为，如此 SO question 中所述。

My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results.

我这里的代码是从Spark SQL 文档中借用的，这两个函数返回相同的结果。

scala> case class Person(name: String, age: Int)
scala> import org.apache.spark.sql._
scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2)))
scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2),  Person("Gamma", 3)))
scala> val schemaString = "name age"
scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema)
scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema)
scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect
res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect
res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])

Why would I prefer one over the other?

为什么我更喜欢一个？

Answer 1

回答by Kris

In Spark 1.6, the above version of unionwas removed, so unionAllwas all that remained.

在 Spark 1.6 中，上面的版本union被删除了，unionAll剩下的也一样。

In Spark 2.0, unionAllwas renamed to union, with unionAllkept in for backward compatibility (I guess).

在 Spark 2.0 中，unionAll被重命名为union，unionAll保留为向后兼容（我猜）。

In any case, no deduplication is done in either union(Spark 2.0)or unionAll(Spark 1.6).

在任何情况下，union(Spark 2.0)或unionAll(Spark 1.6) 都没有进行重复数据删除。

Answer 2

回答by Keshav Potluri

unionAll()was deprecated in Spark 2.0, and for all future reference, union()is the only recommended method.

unionAll()在Spark 2.0中已弃用，对于所有将来的参考，union()是唯一推荐的方法。

In either case, unionor unionAll, both do not do a SQL style deduplication of data. In order to remove any duplicate rows, just use union()followed by a distinct().

在任一情况下，union或unionAll，两者都不会对数据执行 SQL 样式的重复数据删除。为了删除任何重复的行，只需使用union()后跟一个distinct().

Answer 3

回答by Joe Halliwell

Judging from its type signature and (questionable) semantics I believe union()was vestigial.

从它的类型签名和（有问题的）语义来看，我认为union()是残留的。

The more modern DataFrame APIoffers only unionAll().

更现代的DataFrame API仅提供unionAll().

SQL 为什么我想要 .union 而不是 .unionAll 在 Spark 中用于 SchemaRDD？

提问by duber

回答by Kris

回答by Keshav Potluri

回答by Joe Halliwell

相关推荐

最近更新

标签

SQL 为什么我想要 .union 而不是 .unionAll 在 Spark 中用于 SchemaRDD？

提问by duber

回答by Kris

回答by Keshav Potluri

回答by Joe Halliwell

相关推荐

SQL 没有聚合函数的 TSQL Pivot

SQL 为什么我得到“同名游标已经存在”？

如何在 Ruby on Rails 中查看由给定 ActiveRecord 查询生成的 SQL

SQL 无法识别 where 子句中的列别名

相关推荐

最近更新

标签