Spark 和 Scala 中数据框的转换模式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40232615/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
cast schema of a data frame in Spark and Scala
提问by Massimo Paolucci
I want to cast the schema of a dataframe to change the type of some columns using Spark and Scala.
我想转换数据框的模式以使用 Spark 和 Scala 更改某些列的类型。
Specifically I am trying to use as[U] function whose description reads: "Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U"
具体来说,我试图使用 as[U] 函数,其描述为:“返回一个新的数据集,其中每个记录都已映射到指定类型。用于映射列的方法取决于 U 的类型”
In principle this is exactly what I want, but I cannot get it to work.
原则上这正是我想要的,但我无法让它工作。
Here is a simple example taken from https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
这是取自https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 的一个简单示例
// definition of data
val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")
As expected the schema of data is:
正如预期的那样,数据模式是:
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
I would like to cast the column "b" to Double. So I try the following:
我想将列“b”转换为 Double。所以我尝试以下操作:
import session.implicits._;
println(" --------------------------- Casting using (String Double)")
val data_TupleCast=data.as[(String, Double)]
data_TupleCast.show()
data_TupleCast.printSchema()
println(" --------------------------- Casting using ClassData_Double")
case class ClassData_Double(a: String, b: Double)
val data_ClassCast= data.as[ClassData_Double]
data_ClassCast.show()
data_ClassCast.printSchema()
As I understand the definition of as[u], the new DataFrames should have the following schema
据我了解 as[u] 的定义,新的 DataFrame 应具有以下架构
root
|-- a: string (nullable = true)
|-- b: double (nullable = false)
But the output is
但输出是
--------------------------- Casting using (String Double)
+---+---+
| a| b|
+---+---+
| a| 1|
| b| 2|
+---+---+
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
--------------------------- Casting using ClassData_Double
+---+---+
| a| b|
+---+---+
| a| 1|
| b| 2|
+---+---+
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
which shows that column "b" has not been cast to double.
这表明列“b”没有被强制转换为双倍。
Any hints on what I am doing wrong?
关于我做错了什么的任何提示?
BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" (see How to change column types in Spark SQL's DataFrame?). I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process).
顺便说一句:我知道上一篇文章“如何更改 Spark SQL 的 DataFrame 中的列类型?” (请参阅如何更改 Spark SQL 的 DataFrame 中的列类型?)。我知道我可以一次更改一个列的类型,但我正在寻找一种更通用的解决方案,可以一次性更改整个数据的架构(并且我试图在此过程中了解 Spark)。
回答by Glennie Helles Sindholt
Well, since functions are chained and Spark does lazy evaluation, it actually doeschange the schema of the whole data in one shot, even if you do write it as changing one column at the time like this:
好吧,由于函数是链接的并且 Spark 进行惰性求值,因此它实际上确实会一次性更改整个数据的架构,即使您确实将其写为更改一列,如下所示:
import spark.implicits._
df.withColumn("x", 'x.cast(DoubleType)).withColumn("y", 'y.cast(StringType))...
As an alternative, I'm thinking you could use mapto do your cast in one go, like:
作为替代方案,我认为您可以map一次性完成您的演员表,例如:
df.map{t => (t._1, t._2.asInstanceOf[Double], t._3.asInstanceOf[], ...)}

