scala 如何将 RDD[Row] 转换回 DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37011267/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert an RDD[Row] back to DataFrame
提问by TheElysian
I've been playing around with converting RDDs to DataFrames and back again. First, I had an RDD of type (Int, Int) called dataPair. Then I created a DataFrame object with column headers using:
我一直在尝试将 RDD 转换为 DataFrame 并再次转换回来。首先,我有一个类型为 (Int, Int) 的 RDD,称为 dataPair。然后我使用以下方法创建了一个带有列标题的 DataFrame 对象:
val dataFrame = dataPair.toDF(header(0), header(1))
Then I converted it from a DataFrame back to an RDD using:
然后我使用以下方法将它从 DataFrame 转换回 RDD:
val testRDD = dataFrame.rdd
which returns an RDD of type org.apache.spark.sql.Row (not (Int, Int)). Then I'd like to convert it back to an RDD using .toDF but I get an error:
它返回一个 org.apache.spark.sql.Row 类型的 RDD(不是 (Int, Int))。然后我想使用 .toDF 将其转换回 RDD,但出现错误:
error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
I've tried defining a Schema of type Data(Int, Int) for testRDD, but I get type mismatch exceptions:
我已经尝试为 testRDD 定义一个 Data(Int, Int) 类型的架构,但是我得到了类型不匹配的异常:
error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[Data]
val testRDD: RDD[Data] = dataFrame.rdd
^
I've already imported
我已经导入了
import sqlContext.implicits._
回答by Daniel de Paula
To create a DataFrame from an RDD of Rows, usually you have two main options:
要从行的 RDD 创建 DataFrame,通常有两个主要选项:
1)You can use toDF()which can be imported by import sqlContext.implicits._. However, this approach only works for the following types of RDDs:
1)您可以使用toDF()可以通过import sqlContext.implicits._. 但是,这种方法仅适用于以下类型的 RDD:
RDD[Int]RDD[Long]RDD[String]RDD[T <: scala.Product]
RDD[Int]RDD[Long]RDD[String]RDD[T <: scala.Product]
(source: Scaladocof the SQLContext.implicitsobject)
(来源:Scaladoc所述的SQLContext.implicits对象)
The last signature actually means that it can work for an RDD of tuples or an RDD of case classes (because tuples and case classes are subclasses of scala.Product).
最后一个签名实际上意味着它可以用于元组的 RDD 或案例类的 RDD(因为元组和案例类是 scala.Product 的子类)。
So, to use this approach for an RDD[Row], you have to map it to an RDD[T <: scala.Product]. This can be done by mapping each row to a custom case class or to a tuple, as in the following code snippets:
因此,要将这种方法用于RDD[Row],您必须将其映射到RDD[T <: scala.Product]。这可以通过将每一行映射到自定义案例类或元组来完成,如下面的代码片段所示:
val df = rdd.map({
case Row(val1: String, ..., valN: Long) => (val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")
or
或者
case class MyClass(val1: String, ..., valN: Long = 0L)
val df = rdd.map({
case Row(val1: String, ..., valN: Long) => MyClass(val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")
The main drawback of this approach (in my opinion) is that you have to explicitly set the schema of the resulting DataFrame in the map function, column by column. Maybe this can be done programatically if you don't know the schema in advance, but things can get a little messy there. So, alternatively, there is another option:
这种方法的主要缺点(在我看来)是您必须在 map 函数中逐列显式设置生成的 DataFrame 的架构。如果您事先不知道架构,也许这可以通过编程方式完成,但那里的事情可能会变得有点混乱。因此,或者,还有另一种选择:
2)You can use createDataFrame(rowRDD: RDD[Row], schema: StructType), which is available in the SQLContextobject. Example:
2)您可以使用createDataFrame(rowRDD: RDD[Row], schema: StructType),它在SQLContext对象中可用。例子:
val df = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructTypeclass and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.
请注意,无需显式设置任何架构列。我们重用了旧的 DF 模式,它是StructType一流的并且可以轻松扩展。但是,这种方法有时是不可能的,并且在某些情况下可能比第一种方法效率低。
I hope it's clearer than before. Cheers.
我希望它比以前更清楚。干杯。

