scala Spark `DataFrame` 的 `unionAll` 出了什么问题？

Question

提问by Martin Senne

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I'm using some FunSuite for passing in SparkContext sc:

使用 Spark 1.5.0 并给出以下代码，我希望 unionAllDataFrame根据它们的列名来 union 。在代码中，我使用了一些 FunSuite 来传入 SparkContext sc：

object Entities {

  case class A (a: Int, b: Int)
  case class B (b: Int, a: Int)

  val as = Seq(
    A(1,3),
    A(2,4)
  )

  val bs = Seq(
    B(5,3),
    B(6,4)
  )
}

class UnsortedTestSuite extends SparkFunSuite {

  configuredUnitTest("The truth test.") { sc =>
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    val aDF = sc.parallelize(Entities.as, 4).toDF
    val bDF = sc.parallelize(Entities.bs, 4).toDF
    aDF.show()
    bDF.show()
    aDF.unionAll(bDF).show
  }
}

Output:

输出：

+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
+---+---+

+---+---+
|  b|  a|
+---+---+
|  5|  3|
|  6|  4|
+---+---+

+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
|  5|  3|
|  6|  4|
+---+---+

Why does the result contain intermixed "b" and "a"columns, instead of aligning columns bases on column names? Sounds like a seriousbug!?

为什么结果包含混合的“b”和“a”列，而不是根据列名对齐列？听起来像一个严重的错误！？

Answer 1

回答by zero323

It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracleand MS SQLbehaves exactly the same. You'll find SQL Fiddle examples linked with names.

它看起来根本不像是一个错误。您看到的是标准的 SQL 行为，每个主要的 RDMBS，包括PostgreSQL、MySQL、Oracle和MS SQL 的行为都完全相同。您会发现与名称相关联的 SQL Fiddle 示例。

To quote PostgreSQL manual:

引用PostgreSQL 手册：

In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types

为了计算两个查询的并集、交集或差值，两个查询必须是“并集兼容”的，这意味着它们返回相同数量的列并且对应的列具有兼容的数据类型

Column names, excluding the first table in the set operation, are simply ignored.

列名，不包括设置操作中的第一个表，将被简单地忽略。

This behavior comes directly form the Relational Algebra where basic building block is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.

这种行为直接来自关系代数，其中基本构建块是一个元组。由于元组是有序的，因此两组元组的联合与您在此处获得的输出等效（忽略重复处理）。

If you want to match using names you can do something like this

如果您想使用名称进行匹配，您可以执行以下操作

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col

def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
  val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
  a.select(columns: _*).unionAll(b.select(columns: _*))
}

To check both names and types it is should be enough to replace columnswith:

要检查名称和类型，替换columns为：

a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq

Answer 2

回答by Avishek Bhattacharya

This issue is getting fixed in spark2.3. They are adding support of unionByName in the dataset.

这个问题在 spark2.3 中得到解决。他们正在数据集中添加对 unionByName 的支持。

https://issues.apache.org/jira/browse/SPARK-21043

Answer 3

回答by SUBBAREDDY JANGALAPALLI

no issues/bugs - if you observe your case class B very closely then you will be clear. Case Class A --> you have mentioned the order (a,b), and Case Class B --> you have mentioned the order (b,a) ---> this is expected as per order

没有问题/错误 - 如果您非常仔细地观察 B 类案例，那么您就会清楚。Case Class A --> 您已经提到了订单 (a,b)，而 Case Class B --> 您已经提到了订单 (b,a) ---> 这是按订单预期的

case class A (a: Int, b: Int) case class B (b: Int, a: Int)

thanks, Subbu

谢谢，苏布

Answer 4

回答by Mario Rugeles Perez

Use unionByName:

使用 unionByName：

Excerpt from the documentation:

文档摘录：

def unionByName(other: Dataset[T]): Dataset[T]

The difference between this function and union is that this function resolves columns by name (not by position):

此函数与 union 的区别在于，此函数按名称（而不是按位置）解析列：

val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.union(df2).show

// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// |   1|   2|   3|
// |   4|   5|   6|
// +----+----+----+

Answer 5

回答by Rohan Aletty

As discussed in SPARK-9813, it seems like as long as the data types and number of columns are the same across frames, the unionAll operation should work. Please see the comments for additional discussion.

正如在SPARK-9813 中讨论的那样，只要跨帧的数据类型和列数相同， unionAll 操作应该可以工作。请参阅评论以进行更多讨论。

scala Spark `DataFrame` 的 `unionAll` 出了什么问题？

提问by Martin Senne

回答by zero323

回答by Avishek Bhattacharya

回答by SUBBAREDDY JANGALAPALLI

回答by Mario Rugeles Perez

回答by Rohan Aletty

相关推荐

最近更新

标签

scala Spark `DataFrame` 的 `unionAll` 出了什么问题？

提问by Martin Senne

回答by zero323

回答by Avishek Bhattacharya

回答by SUBBAREDDY JANGALAPALLI

回答by Mario Rugeles Perez

回答by Rohan Aletty

相关推荐

错误处理 Scala：理解的未来

scala 如果不是整数，如何四舍五入？

akka HttpResponse 将正文读取为字符串 scala

Scala 从 onComplete 返回值

相关推荐

最近更新

标签