scala Spark 错误:无法找到存储在数据集中的类型的编码器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39517980/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark Error: Unable to find encoder for type stored in a Dataset
提问by HymanOrJones
I am using Spark on a Zeppelin notebook, and groupByKey() does not seem to be working.
我在 Zeppelin 笔记本上使用 Spark,而 groupByKey() 似乎不起作用。
This code:
这段代码:
df.groupByKey(row => row.getLong(0))
.mapGroups((key, iterable) => println(key))
Gives me this error (presumably a compilation error, since it shows up in no time while the dataset I am working on is pretty big):
给我这个错误(大概是一个编译错误,因为它很快就会出现,而我正在处理的数据集非常大):
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I tried to add a case class and map all of my rows into it, but still got the same error
我尝试添加一个案例类并将我的所有行映射到其中,但仍然遇到相同的错误
import spark.implicits._
case class DFRow(profileId: Long, jobId: String, state: String)
def getDFRow(row: Row):DFRow = {
return DFRow(row.getLong(row.fieldIndex("item0")),
row.getString(row.fieldIndex("item1")),
row.getString(row.fieldIndex("item2")))
}
df.map(DFRow(_))
.groupByKey(row => row.getLong(0))
.mapGroups((key, iterable) => println(key))
The schema of my Dataframe is:
我的数据框的架构是:
root
|-- item0: long (nullable = true)
|-- item1: string (nullable = true)
|-- item2: string (nullable = true)
回答by zero323
You're trying to mapGroupswith a function (Long, Iterator[Row]) => Unitand there is no Encoderfor Unit(not that it would make sense to have one).
您正在尝试mapGroups使用一个函数(Long, Iterator[Row]) => Unit并且没有Encoderfor Unit(并不是说拥有一个函数是有意义的)。
In general parts of the DatasetAPI which are not focused on the SQL DSL (DataFrame => DataFrame, DataFrame => RelationalGroupedDataset, RelationalGroupedDataset => DataFrame, RelationalGroupedDataset => RelationalGroupedDataset) require either implicit or explicit encoders for the output values.
通常,DatasetAPI 中不关注 SQL DSL(DataFrame => DataFrame、DataFrame => RelationalGroupedDataset、RelationalGroupedDataset => DataFrame、RelationalGroupedDataset => RelationalGroupedDataset)的部分需要隐式或显式编码器来输出值。
Since there are no predefined encoders for Rowobjects, using Dataset[Row]with methods design for statically typed data doesn't make much sense. As a rule of thumb you should always convert to the statically typed variant first:
由于Row对象没有预定义的编码器,因此Dataset[Row]对静态类型数据使用with 方法设计没有多大意义。根据经验,您应该始终首先转换为静态类型的变体:
df.as[(Long, String, String)]
See also Encoder error while trying to map dataframe row to updated row

