scala Spark `DataFrame` 的 `unionAll` 出了什么问题?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32705056/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is going wrong with `unionAll` of Spark `DataFrame`?
提问by Martin Senne
Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I'm using some FunSuite for passing in SparkContext sc:
使用 Spark 1.5.0 并给出以下代码,我希望 unionAllDataFrame根据它们的列名来 union 。在代码中,我使用了一些 FunSuite 来传入 SparkContext sc:
object Entities {
case class A (a: Int, b: Int)
case class B (b: Int, a: Int)
val as = Seq(
A(1,3),
A(2,4)
)
val bs = Seq(
B(5,3),
B(6,4)
)
}
class UnsortedTestSuite extends SparkFunSuite {
configuredUnitTest("The truth test.") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val aDF = sc.parallelize(Entities.as, 4).toDF
val bDF = sc.parallelize(Entities.bs, 4).toDF
aDF.show()
bDF.show()
aDF.unionAll(bDF).show
}
}
Output:
输出:
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
+---+---+
+---+---+
| b| a|
+---+---+
| 5| 3|
| 6| 4|
+---+---+
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
| 5| 3|
| 6| 4|
+---+---+
Why does the result contain intermixed "b" and "a"columns, instead of aligning columns bases on column names? Sounds like a seriousbug!?
为什么结果包含混合的“b”和“a”列,而不是根据列名对齐列?听起来像一个严重的错误!?
回答by zero323
It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracleand MS SQLbehaves exactly the same. You'll find SQL Fiddle examples linked with names.
它看起来根本不像是一个错误。您看到的是标准的 SQL 行为,每个主要的 RDMBS,包括PostgreSQL、MySQL、Oracle和MS SQL 的行为都完全相同。您会发现与名称相关联的 SQL Fiddle 示例。
To quote PostgreSQL manual:
In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types
为了计算两个查询的并集、交集或差值,两个查询必须是“并集兼容”的,这意味着它们返回相同数量的列并且对应的列具有兼容的数据类型
Column names, excluding the first table in the set operation, are simply ignored.
列名,不包括设置操作中的第一个表,将被简单地忽略。
This behavior comes directly form the Relational Algebra where basic building block is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.
这种行为直接来自关系代数,其中基本构建块是一个元组。由于元组是有序的,因此两组元组的联合与您在此处获得的输出等效(忽略重复处理)。
If you want to match using names you can do something like this
如果您想使用名称进行匹配,您可以执行以下操作
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).unionAll(b.select(columns: _*))
}
To check both names and types it is should be enough to replace columnswith:
要检查名称和类型,替换columns为:
a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq
回答by Avishek Bhattacharya
This issue is getting fixed in spark2.3. They are adding support of unionByName in the dataset.
这个问题在 spark2.3 中得到解决。他们正在数据集中添加对 unionByName 的支持。
https://issues.apache.org/jira/browse/SPARK-21043
回答by SUBBAREDDY JANGALAPALLI
no issues/bugs - if you observe your case class B very closely then you will be clear. Case Class A --> you have mentioned the order (a,b), and Case Class B --> you have mentioned the order (b,a) ---> this is expected as per order
没有问题/错误 - 如果您非常仔细地观察 B 类案例,那么您就会清楚。Case Class A --> 您已经提到了订单 (a,b),而 Case Class B --> 您已经提到了订单 (b,a) ---> 这是按订单预期的
case class A (a: Int, b: Int) case class B (b: Int, a: Int)
case class A (a: Int, b: Int) case class B (b: Int, a: Int)
thanks, Subbu
谢谢,苏布
回答by Mario Rugeles Perez
Use unionByName:
使用 unionByName:
Excerpt from the documentation:
文档摘录:
def unionByName(other: Dataset[T]): Dataset[T]
def unionByName(other: Dataset[T]): Dataset[T]
The difference between this function and union is that this function resolves columns by name (not by position):
此函数与 union 的区别在于,此函数按名称(而不是按位置)解析列:
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.union(df2).show
// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// | 1| 2| 3|
// | 4| 5| 6|
// +----+----+----+
回答by Rohan Aletty
As discussed in SPARK-9813, it seems like as long as the data types and number of columns are the same across frames, the unionAll operation should work. Please see the comments for additional discussion.
正如在SPARK-9813 中讨论的那样,只要跨帧的数据类型和列数相同, unionAll 操作应该可以工作。请参阅评论以进行更多讨论。

