scala 为什么在创建自定义案例类的数据集时“无法找到存储在数据集中的类型的编码器”？

Question

提问by clay

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

Spark 2.0（最终版）和 Scala 2.11.8。以下超级简单的代码产生了编译错误Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

import org.apache.spark.sql.SparkSession

case class SimpleTuple(id: Int, desc: String)

object DatasetTest {
  val dataList = List(
    SimpleTuple(5, "abc"),
    SimpleTuple(6, "bcd")
  )

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder.
      master("local")
      .appName("example")
      .getOrCreate()

    val dataset = sparkSession.createDataset(dataList)
  }
}

Answer 1

回答by zero323

Spark Datasetsrequire Encodersfor data type which is about to be stored. For common types (atomics, product types) there is a number of predefined encoders available but you have to import these first from SparkSession.implicitsto make it work:

SparkDatasets需要Encoders将要存储的数据类型。对于常见类型（原子、产品类型），有许多预定义的编码器可用，但您必须先导入它们SparkSession.implicits才能使其工作：

val sparkSession: SparkSession = ???
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)

Alternatively you can provide directly an explicit

或者，您可以直接提供一个明确的

import org.apache.spark.sql.{Encoder, Encoders}

val dataset = sparkSession.createDataset(dataList)(Encoders.product[SimpleTuple])

or implicit

或隐含的

implicit val enc: Encoder[SimpleTuple] = Encoders.product[SimpleTuple]
val dataset = sparkSession.createDataset(dataList)

Encoderfor the stored type.

Encoder对于存储类型。

Note that Encodersalso provide a number of predefined Encodersfor atomic types, and Encodersfor complex ones, can derived with ExpressionEncoder.

请注意，Encoders还Encoders为原子类型提供了许多预定义，Encoders对于复杂类型，可以使用ExpressionEncoder.

回答by MrProper

For other users (yours is correct), note that you it's also important that the case classis defined outside of the objectscope. So:

对于其他用户（您的用户是正确的），请注意，case class在object范围之外定义也很重要。所以：

Fails:

失败：

object DatasetTest {
  case class SimpleTuple(id: Int, desc: String)

  val dataList = List(
    SimpleTuple(5, "abc"),
    SimpleTuple(6, "bcd")
  )

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder
      .master("local")
      .appName("example")
      .getOrCreate()
    val dataset = sparkSession.createDataset(dataList)
  }
}

Add the implicits, still fails with the same error:

添加隐式，仍然失败并出现相同的错误：

object DatasetTest {
  case class SimpleTuple(id: Int, desc: String)

  val dataList = List(
    SimpleTuple(5, "abc"),
    SimpleTuple(6, "bcd")
  )

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder
      .master("local")
      .appName("example")
      .getOrCreate()

    import sparkSession.implicits._
    val dataset = sparkSession.createDataset(dataList)
  }
}

Works:

作品：

case class SimpleTuple(id: Int, desc: String)

object DatasetTest {   
  val dataList = List(
    SimpleTuple(5, "abc"),
    SimpleTuple(6, "bcd")
  )

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder
      .master("local")
      .appName("example")
      .getOrCreate()

    import sparkSession.implicits._
    val dataset = sparkSession.createDataset(dataList)
  }
}

Here's the relevant bug: https://issues.apache.org/jira/browse/SPARK-13540, so hopefully it will be fixed in the next release of Spark 2.

这是相关的错误：https: //issues.apache.org/jira/browse/SPARK-13540，所以希望它会在 Spark 2 的下一个版本中得到修复。

(Edit: Looks like that bugfix is actually in Spark 2.0.0... So I'm not sure why this still fails).

（编辑：看起来这个错误修复实际上是在 Spark 2.0.0 中......所以我不确定为什么这仍然失败）。

Answer 3

回答by clay

I'd clarify with an answer to my own question, that if the goal is to define a simple literal SparkData frame, rather than use Scala tuples and implicit conversion, the simpler route is to use the Spark API directly like this:

我会用我自己的问题的答案来澄清，如果目标是定义一个简单的文字 SparkData 框架，而不是使用 Scala 元组和隐式转换，更简单的路线是直接使用 Spark API，如下所示：

  import org.apache.spark.sql._
  import org.apache.spark.sql.types._
  import scala.collection.JavaConverters._

  val simpleSchema = StructType(
    StructField("a", StringType) ::
    StructField("b", IntegerType) ::
    StructField("c", IntegerType) ::
    StructField("d", IntegerType) ::
    StructField("e", IntegerType) :: Nil)

  val data = List(
    Row("001", 1, 0, 3, 4),
    Row("001", 3, 4, 1, 7),
    Row("001", null, 0, 6, 4),
    Row("003", 1, 4, 5, 7),
    Row("003", 5, 4, null, 2),
    Row("003", 4, null, 9, 2),
    Row("003", 2, 3, 0, 1)
  )

  val df = spark.createDataFrame(data.asJava, simpleSchema)

scala 为什么在创建自定义案例类的数据集时“无法找到存储在数据集中的类型的编码器”？

提问by clay

回答by zero323

回答by MrProper

回答by clay

相关推荐

最近更新

标签

scala 为什么在创建自定义案例类的数据集时“无法找到存储在数据集中的类型的编码器”？

提问by clay

回答by zero323

回答by MrProper

回答by clay

相关推荐

scala 如何更改火花数据框中的列位置？

Spark Scala：如何将 Dataframe[vector] 转换为 DataFrame[f1:Double, ..., fn: Double)]

java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object

scala Spark：错误：值拆分不是 org.apache.spark.rdd.RDD[String] 的成员

相关推荐

最近更新

标签