scala Spark 错误：无法找到存储在数据集中的类型的编码器

Question

提问by HymanOrJones

I am using Spark on a Zeppelin notebook, and groupByKey() does not seem to be working.

我在 Zeppelin 笔记本上使用 Spark，而 groupByKey() 似乎不起作用。

This code:

这段代码：

df.groupByKey(row => row.getLong(0))
  .mapGroups((key, iterable) => println(key))

Gives me this error (presumably a compilation error, since it shows up in no time while the dataset I am working on is pretty big):

给我这个错误（大概是一个编译错误，因为它很快就会出现，而我正在处理的数据集非常大）：

error: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.

I tried to add a case class and map all of my rows into it, but still got the same error

我尝试添加一个案例类并将我的所有行映射到其中，但仍然遇到相同的错误

import spark.implicits._

case class DFRow(profileId: Long, jobId: String, state: String)

def getDFRow(row: Row):DFRow = {
    return DFRow(row.getLong(row.fieldIndex("item0")),
                 row.getString(row.fieldIndex("item1")), 
                 row.getString(row.fieldIndex("item2")))
}

df.map(DFRow(_))
  .groupByKey(row => row.getLong(0))
  .mapGroups((key, iterable) => println(key))

The schema of my Dataframe is:

我的数据框的架构是：

root
|-- item0: long (nullable = true)
|-- item1: string (nullable = true)
|-- item2: string (nullable = true)

Answer 1

回答by zero323

You're trying to mapGroupswith a function (Long, Iterator[Row]) => Unitand there is no Encoderfor Unit(not that it would make sense to have one).

您正在尝试mapGroups使用一个函数(Long, Iterator[Row]) => Unit并且没有Encoderfor Unit（并不是说拥有一个函数是有意义的）。

In general parts of the DatasetAPI which are not focused on the SQL DSL (DataFrame => DataFrame, DataFrame => RelationalGroupedDataset, RelationalGroupedDataset => DataFrame, RelationalGroupedDataset => RelationalGroupedDataset) require either implicit or explicit encoders for the output values.

通常，DatasetAPI 中不关注 SQL DSL（DataFrame => DataFrame、DataFrame => RelationalGroupedDataset、RelationalGroupedDataset => DataFrame、RelationalGroupedDataset => RelationalGroupedDataset）的部分需要隐式或显式编码器来输出值。

Since there are no predefined encoders for Rowobjects, using Dataset[Row]with methods design for statically typed data doesn't make much sense. As a rule of thumb you should always convert to the statically typed variant first:

由于Row对象没有预定义的编码器，因此Dataset[Row]对静态类型数据使用with 方法设计没有多大意义。根据经验，您应该始终首先转换为静态类型的变体：

df.as[(Long, String, String)]

See also Encoder error while trying to map dataframe row to updated row

另请参阅尝试将数据帧行映射到更新行时出现编码器错误

scala Spark 错误：无法找到存储在数据集中的类型的编码器

提问by HymanOrJones

回答by zero323

相关推荐

最近更新

标签

scala Spark 错误：无法找到存储在数据集中的类型的编码器

提问by HymanOrJones

回答by zero323

相关推荐

Spark 2.0 Scala - RDD.toDF()

如何为 IntelliJ 安装 Scala 插件

scala 将参数传递给特征

scala 定义一个接受 Spark DataFrame 中的对象数组的 UDF？

相关推荐

最近更新

标签