如何在 Scala 的 Apache Spark 中将数据帧转换为数据集？

Question

提问by stefanobaghino

I need to convert my dataframe to a dataset and I used the following code:

我需要将我的数据框转换为数据集，我使用了以下代码：

    val final_df = Dataframe.withColumn(
      "features",
      toVec4(
        // casting into Timestamp to parse the string, and then into Int
        $"time_stamp_0".cast(TimestampType).cast(IntegerType),
        $"count",
        $"sender_ip_1",
        $"receiver_ip_2"
      )
    ).withColumn("label", (Dataframe("count"))).select("features", "label")

    final_df.show()

    val trainingTest = final_df.randomSplit(Array(0.3, 0.7))
    val TrainingDF = trainingTest(0)
    val TestingDF=trainingTest(1)
    TrainingDF.show()
    TestingDF.show()

    ///lets create our liner regression
    val lir= new LinearRegression()
    .setRegParam(0.3)
    .setElasticNetParam(0.8)
    .setMaxIter(100)
    .setTol(1E-6)

    case class df_ds(features:Vector, label:Integer)
    org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)

    val Training_ds = TrainingDF.as[df_ds]

My problem is that, I got the following error:

我的问题是，我收到以下错误：

Error:(96, 36) Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
    val Training_ds = TrainingDF.as[df_ds]

It seems that the number of values in dataframe is different with the number of value in my class. However I am using case class df_ds(features:Vector, label:Integer)on my TrainingDF dataframe since, It has a vector of features and an integer label. Here is TrainingDF dataframe:

似乎数据框中的值数量与我班级中的值数量不同。但是我case class df_ds(features:Vector, label:Integer)在我的 TrainingDF 数据帧上使用，因为它有一个特征向量和一个整数标签。这是 TrainingDF 数据框：

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,19...|   19|
|[1.497325796E9,10...|   10|
+--------------------+-----+

Also here is my original final_dfdataframe:

这里也是我原来的final_df数据框：

+------------+-----------+-------------+-----+
|time_stamp_0|sender_ip_1|receiver_ip_2|count|
+------------+-----------+-------------+-----+
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.2|     10.0.0.3|   19|
|    05:49:56|   10.0.0.3|     10.0.0.2|   10|
+------------+-----------+-------------+-----+

However I got the mentioned error! Can anybody help me? Thanks in advance.

但是我得到了提到的错误！有谁能够帮我？提前致谢。

Answer 1

回答by stefanobaghino

The error message you are reading is a pretty good pointer.

您正在阅读的错误消息是一个很好的指针。

When you convert a DataFrameto a Datasetyou have to have a proper Encoderfor whatever is stored in the DataFramerows.

当您将 a 转换DataFrame为 a 时，Dataset您必须对Encoder存储DataFrame在行中的任何内容进行适当的处理。

Encoders for primitive-like types (Ints, Strings, and so on) and case classesare provided by just importing the implicits for your SparkSessionlike follows:

原始类型（Ints、Strings 等）的编码器，case classes只需为您的SparkSession类导入隐式即可提供，如下所示：

case class MyData(intField: Int, boolField: Boolean) // e.g.

val spark: SparkSession = ???
val df: DataFrame = ???

import spark.implicits._

val ds: Dataset[MyData] = df.as[MyData]

If that doesn't work either is because the type you are trying to castthe DataFrameto isn't supported. In that case, you would have to write your own Encoder: you may find more information about it hereand see an example (the Encoderfor java.time.LocalDateTime) here.

如果这也不行，因为类型您要投的DataFrame，以不被支持。在这种情况下，你会写自己Encoder：你可能会发现更多关于它的信息在这里看到一个例子（Encoder为java.time.LocalDateTime）在这里。

Answer 2

回答by Shang Gao

Spark 1.6.0

火花 1.6.0

case class MyCase(id: Int, name: String)

val encoder = org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[MyCase]

val dataframe = …

val dataset = dataframe.as(encoder)

Spark 2.0 or above

Spark 2.0 或以上

case class MyCase(id: Int, name: String)

val encoder = org.apache.spark.sql.Encoders.product[MyCase]

val dataframe = …

val dataset = dataframe.as(encoder)

如何在 Scala 的 Apache Spark 中将数据帧转换为数据集？

提问by stefanobaghino

回答by stefanobaghino

回答by Shang Gao

相关推荐

最近更新

标签

如何在 Scala 的 Apache Spark 中将数据帧转换为数据集？

提问by stefanobaghino

回答by stefanobaghino

回答by Shang Gao

相关推荐

Spark Scala：任务不可序列化错误

scala 如何打乱 Spark 数据帧中的行？

scala 如何在 build.sbt 中使用环境变量？

scala 如何在 Spark 2.0+ 中编写单元测试？

相关推荐

最近更新

标签