Spark 和 Scala 中数据框的转换模式

Question

提问by Massimo Paolucci

I want to cast the schema of a dataframe to change the type of some columns using Spark and Scala.

我想转换数据框的模式以使用 Spark 和 Scala 更改某些列的类型。

Specifically I am trying to use as[U] function whose description reads: "Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U"

具体来说，我试图使用 as[U] 函数，其描述为：“返回一个新的数据集，其中每个记录都已映射到指定类型。用于映射列的方法取决于 U 的类型”

In principle this is exactly what I want, but I cannot get it to work.

原则上这正是我想要的，但我无法让它工作。

Here is a simple example taken from https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

这是取自https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 的一个简单示例



    // definition of data
    val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")

As expected the schema of data is:

正如预期的那样，数据模式是：

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

I would like to cast the column "b" to Double. So I try the following:

我想将列“b”转换为 Double。所以我尝试以下操作：



    import session.implicits._;

    println(" --------------------------- Casting using (String Double)")

    val data_TupleCast=data.as[(String, Double)]
    data_TupleCast.show()
    data_TupleCast.printSchema()

    println(" --------------------------- Casting using ClassData_Double")

    case class ClassData_Double(a: String, b: Double)

    val data_ClassCast= data.as[ClassData_Double]
    data_ClassCast.show()
    data_ClassCast.printSchema()

As I understand the definition of as[u], the new DataFrames should have the following schema

据我了解 as[u] 的定义，新的 DataFrame 应具有以下架构

    root
     |-- a: string (nullable = true)
     |-- b: double (nullable = false)

But the output is

但输出是

     --------------------------- Casting using (String Double)
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

     --------------------------- Casting using ClassData_Double
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

which shows that column "b" has not been cast to double.

这表明列“b”没有被强制转换为双倍。

Any hints on what I am doing wrong?

关于我做错了什么的任何提示？

BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" (see How to change column types in Spark SQL's DataFrame?). I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process).

顺便说一句：我知道上一篇文章“如何更改 Spark SQL 的 DataFrame 中的列类型？” （请参阅如何更改 Spark SQL 的 DataFrame 中的列类型？）。我知道我可以一次更改一个列的类型，但我正在寻找一种更通用的解决方案，可以一次性更改整个数据的架构（并且我试图在此过程中了解 Spark）。

Answer 1

回答by Glennie Helles Sindholt

Well, since functions are chained and Spark does lazy evaluation, it actually doeschange the schema of the whole data in one shot, even if you do write it as changing one column at the time like this:

好吧，由于函数是链接的并且 Spark 进行惰性求值，因此它实际上确实会一次性更改整个数据的架构，即使您确实将其写为更改一列，如下所示：

import spark.implicits._

df.withColumn("x", 'x.cast(DoubleType)).withColumn("y", 'y.cast(StringType))...

As an alternative, I'm thinking you could use mapto do your cast in one go, like:

作为替代方案，我认为您可以map一次性完成您的演员表，例如：

df.map{t => (t._1, t._2.asInstanceOf[Double], t._3.asInstanceOf[], ...)}

Spark 和 Scala 中数据框的转换模式

提问by Massimo Paolucci

回答by Glennie Helles Sindholt

相关推荐

最近更新

标签

Spark 和 Scala 中数据框的转换模式

提问by Massimo Paolucci

回答by Glennie Helles Sindholt

相关推荐

scala.reflect.internal.MissingRequirementError：未找到编译器镜像中的对象 java.lang.Object

scala 如何在scala spark中通过键连接两个数据集

scala Spark 合并数据帧与模式不匹配，无需额外的磁盘 IO

如何在 Intellij IDEA 上调试基于 Scala 的 Spark 程序

相关推荐

最近更新

标签