java 行类型 Spark 数据集的编码器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43238693/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Encoder for Row Type Spark Datasets
提问by tsar2512
I would like to write an encoder for a Rowtype in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders.
我想为DataSet 中的Row类型编写一个编码器,用于我正在执行的映射操作。本质上,我不明白如何编写编码器。
Below is an example of a map operation:
下面是一个地图操作的例子:
In the example below, instead of returning Dataset<String>, I would like to return Dataset<Row>
In the example below, instead of returning Dataset<String>, I would like to return Dataset<Row>
Dataset<String> output = dataset1.flatMap(new FlatMapFunction<Row, String>() {
@Override
public Iterator<String> call(Row row) throws Exception {
ArrayList<String> obj = //some map operation
return obj.iterator();
}
},Encoders.STRING());
I understand that instead of a string Encoder needs to be written as follows:
我知道,而不是字符串编码器需要写成如下:
Encoder<Row> encoder = new Encoder<Row>() {
@Override
public StructType schema() {
return join.schema();
//return null;
}
@Override
public ClassTag<Row> clsTag() {
return null;
}
};
However, I do not understand the clsTag() in the encoder, and I am trying to find a running example which can demostrate something similar (i.e. an encoder for a row type)
但是,我不理解编码器中的 clsTag(),我试图找到一个可以演示类似内容的运行示例(即行类型的编码器)
Edit - This is not a copy of the question mentioned : Encoder error while trying to map dataframe row to updated rowas the answer talks about using Spark 1.x in Spark 2.x (I am not doing so), also I am looking for an encoder for a Row class rather than resolve an error. Finally, I was looking for a solution in Java, not in Scala.
编辑 - 这不是提到的问题的副本:尝试将数据帧行映射到更新行时编码器错误,因为答案谈到在 Spark 2.x 中使用 Spark 1.x(我没有这样做),我也在寻找对于 Row 类的编码器,而不是解决错误。最后,我正在寻找 Java 中的解决方案,而不是 Scala 中的解决方案。
回答by tsar2512
The answer is to use a RowEncoderand the schema of the dataset using StructType.
答案是使用RowEncoder和使用StructType的数据集架构。
Below is a working example of a flatmap operation with Datasets:
以下是使用数据集进行平面图操作的工作示例:
StructType structType = new StructType();
structType = structType.add("id1", DataTypes.LongType, false);
structType = structType.add("id2", DataTypes.LongType, false);
ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
Dataset<Row> output = join.flatMap(new FlatMapFunction<Row, Row>() {
@Override
public Iterator<Row> call(Row row) throws Exception {
// a static map operation to demonstrate
List<Object> data = new ArrayList<>();
data.add(1l);
data.add(2l);
ArrayList<Row> list = new ArrayList<>();
list.add(RowFactory.create(data.toArray()));
return list.iterator();
}
}, encoder);
回答by Jim Bob
I had the same problem... Encoders.kryo(Row.class))
worked for me.
我有同样的问题......Encoders.kryo(Row.class))
对我有用。
As a bonus, the Apache Spark tuning docs refer to Kryo it since it's faster at serialization "often as much as 10x":
作为奖励,Apache Spark 调优文档引用了 Kryo it,因为它的序列化速度“通常高达 10 倍”: