如何最有效地将 Scala DataFrame 的 Row 转换为 case 类？

Question

提问by arivero

Once I have got in Spark some Row class, either Dataframe or Catalyst, I want to convert it to a case class in my code. This can be done by matching

一旦我在 Spark 中获得了一些 Row 类，Dataframe 或 Catalyst，我想在我的代码中将它转换为案例类。这可以通过匹配来完成

someRow match {case Row(a:Long,b:String,c:Double) => myCaseClass(a,b,c)}

But it becomes ugly when the row has a huge number of columns, say a dozen of Doubles, some Booleans and even the occasional null.

但是当该行有大量的列时它变得丑陋，比如十几个双精度数、一些布尔值甚至偶尔的空值。

I would like just to be able to -sorry- cast Row to myCaseClass. Is it possible, or have I already got the most economical syntax?

我只想能够 - 抱歉 - 将 Row 转换为 myCaseClass。有没有可能，或者我已经得到了最经济的语法？

Answer 1

采纳答案by Rahul

DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.

DataFrame 只是 Dataset[Row] 的类型别名。与强类型 Scala/Java 数据集附带的“类型转换”相比，这些操作也称为“无类型转换”。

The conversion from Dataset[Row] to Dataset[Person] is very simple in spark

Spark中Dataset[Row]到Dataset[Person]的转换非常简单

val DFtoProcess = SQLContext.sql("SELECT * FROM peoples WHERE name='test'")

At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.

此时，Spark 将您的数据转换为 DataFrame = Dataset[Row]，这是一个通用 Row 对象的集合，因为它不知道确切的类型。

// Create an Encoders for Java class (In my eg. Person is a JAVA class)
// For scala case class you can pass Person without .class reference
val personEncoder = Encoders.bean(Person.class) 

val DStoProcess = DFtoProcess.as[Person](personEncoder)

Now, Spark converts the Dataset[Row] -> Dataset[Person]type-specific Scala / Java JVM object, as dictated by the class Person.

现在，Spark 会Dataset[Row] -> Dataset[Person]按照类 Person 的指示转换特定于类型的 Scala/Java JVM 对象。

Please refer to below link provided by databricks for further details

请参阅下面由 databricks 提供的链接以获取更多详细信息

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Answer 2

回答by Glennie Helles Sindholt

As far as I know you cannot cast a Row to a case class, but I sometimes chose to access the row fields directly, like

据我所知，您不能将 Row 转换为 case 类，但我有时会选择直接访问行字段，例如

map(row => myCaseClass(row.getLong(0), row.getString(1), row.getDouble(2))

I find this to be easier, especially if the case class constructor only needs some of the fields from the row.

我发现这更容易，特别是如果案例类构造函数只需要行中的一些字段。

Answer 3

回答by secfree

scala> import spark.implicits._    
scala> val df = Seq((1, "james"), (2, "tony")).toDF("id", "name")
df: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> case class Student(id: Int, name: String)
defined class Student

scala> df.as[Student].collectAsList
res6: java.util.List[Student] = [Student(1,james), Student(2,tony)]

Here the sparkin spark.implicits._is your SparkSession. If you are inside the REPL the session is already defined as sparkotherwise you need to adjust the name accordingly to correspond to your SparkSession.

这里spark的spark.implicits._是你的SparkSession。如果您在 REPL 中，会话已经定义，spark否则您需要相应地调整名称以对应于您的SparkSession.

Answer 4

回答by Gianmario Spacagna

Of course you can match a Row object into a case class. Let's suppose your SchemaType has many fields and you want to match a few of them into your case class. If you don't have null fields you can simply do:

当然，您可以将 Row 对象匹配到案例类中。假设您的 SchemaType 有许多字段，并且您希望将其中的一些字段匹配到您的案例类中。如果您没有空字段，您可以简单地执行以下操作：

case class MyClass(a: Long, b: String, c: Int, d: String, e: String)

dataframe.map {
  case Row(a: java.math.BigDecimal, 
    b: String, 
    c: Int, 
    _: String,
    _: java.sql.Date, 
    e: java.sql.Date,
    _: java.sql.Timestamp, 
    _: java.sql.Timestamp, 
    _: java.math.BigDecimal, 
    _: String) => MyClass(a = a.longValue(), b = b, c = c, d = d.toString, e = e.toString)
}

This approach will fail in case of null values and also require you do explicitly define the type of each single field. If you have to handle null values you need to either discard all the rows containing null values by doing

如果出现空值，此方法将失败，并且还需要您明确定义每个字段的类型。如果必须处理空值，则需要通过执行以下操作来丢弃所有包含空值的行

dataframe.na.drop()

That will drop records even if the null fields are not the ones used in your pattern matching for your case class. Or if you want to handle it you could turn the Row object into a List and then use the option pattern:

即使空字段不是您的案例类的模式匹配中使用的字段，这也会删除记录。或者，如果您想处理它，您可以将 Row 对象转换为 List，然后使用选项模式：

case class MyClass(a: Long, b: String, c: Option[Int], d: String, e: String)

dataframe.map(_.toSeq.toList match {
  case List(a: java.math.BigDecimal, 
    b: String, 
    c: Int, 
    _: String,
    _: java.sql.Date, 
    e: java.sql.Date,
    _: java.sql.Timestamp, 
    _: java.sql.Timestamp, 
    _: java.math.BigDecimal, 
    _: String) => MyClass(
      a = a.longValue(), b = b, c = Option(c), d = d.toString, e = e.toString)
}

Check this github project Sparkz () which will soon introduce a lot of libraries for simplifying the Spark and DataFrame APIs and make them more functional programming oriented.

查看这个 github 项目 Sparkz () 它将很快引入许多用于简化 Spark 和 DataFrame API 的库，并使它们更加面向函数式编程。

如何最有效地将 Scala DataFrame 的 Row 转换为 case 类？

提问by arivero

采纳答案by Rahul

回答by Glennie Helles Sindholt

回答by secfree

回答by Gianmario Spacagna

相关推荐

最近更新

标签

如何最有效地将 Scala DataFrame 的 Row 转换为 case 类？

提问by arivero

采纳答案by Rahul

回答by Glennie Helles Sindholt

回答by secfree

回答by Gianmario Spacagna

相关推荐

scala 加特林喂食器的使用

在 Scala 中创建对象数组

使用 scala 在 Apache spark 中连接不同 RDD 的数据集

scala 在 Apache Spark 中传递参数

相关推荐

最近更新

标签