scala 为什么在创建自定义案例类的数据集时“无法找到存储在数据集中的类型的编码器”?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38664972/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:31:36  来源:igfitidea点击:

Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?

scalaapache-sparkapache-spark-datasetapache-spark-encoders

提问by clay

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

Spark 2.0(最终版)和 Scala 2.11.8。以下超级简单的代码产生了编译错误Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

import org.apache.spark.sql.SparkSession

case class SimpleTuple(id: Int, desc: String)

object DatasetTest {
  val dataList = List(
    SimpleTuple(5, "abc"),
    SimpleTuple(6, "bcd")
  )

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder.
      master("local")
      .appName("example")
      .getOrCreate()

    val dataset = sparkSession.createDataset(dataList)
  }
}

回答by zero323

Spark Datasetsrequire Encodersfor data type which is about to be stored. For common types (atomics, product types) there is a number of predefined encoders available but you have to import these first from SparkSession.implicitsto make it work:

SparkDatasets需要Encoders将要存储的数据类型。对于常见类型(原子、产品类型),有许多预定义的编码器可用,但您必须先导入它们SparkSession.implicits才能使其工作:

val sparkSession: SparkSession = ???
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)

Alternatively you can provide directly an explicit

或者,您可以直接提供一个明确的

import org.apache.spark.sql.{Encoder, Encoders}

val dataset = sparkSession.createDataset(dataList)(Encoders.product[SimpleTuple])

or implicit

或隐含的

implicit val enc: Encoder[SimpleTuple] = Encoders.product[SimpleTuple]
val dataset = sparkSession.createDataset(dataList)

Encoderfor the stored type.

Encoder对于存储类型。

Note that Encodersalso provide a number of predefined Encodersfor atomic types, and Encodersfor complex ones, can derived with ExpressionEncoder.

请注意,EncodersEncoders为原子类型提供了许多预定义,Encoders对于复杂类型,可以使用ExpressionEncoder.

Further reading:

进一步阅读:

回答by MrProper

For other users (yours is correct), note that you it's also important that the case classis defined outside of the objectscope. So:

对于其他用户(您的用户是正确的),请注意,case classobject范围之外定义也很重要。所以:

Fails:

失败:

object DatasetTest {
  case class SimpleTuple(id: Int, desc: String)

  val dataList = List(
    SimpleTuple(5, "abc"),
    SimpleTuple(6, "bcd")
  )

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder
      .master("local")
      .appName("example")
      .getOrCreate()
    val dataset = sparkSession.createDataset(dataList)
  }
}

Add the implicits, still fails with the same error:

添加隐式,仍然失败并出现相同的错误:

object DatasetTest {
  case class SimpleTuple(id: Int, desc: String)

  val dataList = List(
    SimpleTuple(5, "abc"),
    SimpleTuple(6, "bcd")
  )

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder
      .master("local")
      .appName("example")
      .getOrCreate()

    import sparkSession.implicits._
    val dataset = sparkSession.createDataset(dataList)
  }
}

Works:

作品:

case class SimpleTuple(id: Int, desc: String)

object DatasetTest {   
  val dataList = List(
    SimpleTuple(5, "abc"),
    SimpleTuple(6, "bcd")
  )

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder
      .master("local")
      .appName("example")
      .getOrCreate()

    import sparkSession.implicits._
    val dataset = sparkSession.createDataset(dataList)
  }
}

Here's the relevant bug: https://issues.apache.org/jira/browse/SPARK-13540, so hopefully it will be fixed in the next release of Spark 2.

这是相关的错误:https: //issues.apache.org/jira/browse/SPARK-13​​540,所以希望它会在 Spark 2 的下一个版本中得到修复。

(Edit: Looks like that bugfix is actually in Spark 2.0.0... So I'm not sure why this still fails).

(编辑:看起来这个错误修复实际上是在 Spark 2.0.0 中......所以我不确定为什么这仍然失败)。

回答by clay

I'd clarify with an answer to my own question, that if the goal is to define a simple literal SparkData frame, rather than use Scala tuples and implicit conversion, the simpler route is to use the Spark API directly like this:

我会用我自己的问题的答案来澄清,如果目标是定义一个简单的文字 SparkData 框架,而不是使用 Scala 元组和隐式转换,更简单的路线是直接使用 Spark API,如下所示:

  import org.apache.spark.sql._
  import org.apache.spark.sql.types._
  import scala.collection.JavaConverters._

  val simpleSchema = StructType(
    StructField("a", StringType) ::
    StructField("b", IntegerType) ::
    StructField("c", IntegerType) ::
    StructField("d", IntegerType) ::
    StructField("e", IntegerType) :: Nil)

  val data = List(
    Row("001", 1, 0, 3, 4),
    Row("001", 3, 4, 1, 7),
    Row("001", null, 0, 6, 4),
    Row("003", 1, 4, 5, 7),
    Row("003", 5, 4, null, 2),
    Row("003", 4, null, 9, 2),
    Row("003", 2, 3, 0, 1)
  )

  val df = spark.createDataFrame(data.asJava, simpleSchema)