java 行类型 Spark 数据集的编码器

Question

提问by tsar2512

I would like to write an encoder for a Rowtype in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders.

我想为DataSet 中的Row类型编写一个编码器，用于我正在执行的映射操作。本质上，我不明白如何编写编码器。

Below is an example of a map operation:

下面是一个地图操作的例子：

In the example below, instead of returning Dataset<String>, I would like to return Dataset<Row>

Dataset<String> output = dataset1.flatMap(new FlatMapFunction<Row, String>() {
            @Override
            public Iterator<String> call(Row row) throws Exception {

                ArrayList<String> obj = //some map operation
                return obj.iterator();
            }
        },Encoders.STRING());

I understand that instead of a string Encoder needs to be written as follows:

我知道，而不是字符串编码器需要写成如下：

    Encoder<Row> encoder = new Encoder<Row>() {
        @Override
        public StructType schema() {
            return join.schema();
            //return null;
        }

        @Override
        public ClassTag<Row> clsTag() {
            return null;
        }
    };

However, I do not understand the clsTag() in the encoder, and I am trying to find a running example which can demostrate something similar (i.e. an encoder for a row type)

但是，我不理解编码器中的 clsTag()，我试图找到一个可以演示类似内容的运行示例（即行类型的编码器）

Edit - This is not a copy of the question mentioned : Encoder error while trying to map dataframe row to updated rowas the answer talks about using Spark 1.x in Spark 2.x (I am not doing so), also I am looking for an encoder for a Row class rather than resolve an error. Finally, I was looking for a solution in Java, not in Scala.

编辑 - 这不是提到的问题的副本：尝试将数据帧行映射到更新行时编码器错误，因为答案谈到在 Spark 2.x 中使用 Spark 1.x（我没有这样做），我也在寻找对于 Row 类的编码器，而不是解决错误。最后，我正在寻找 Java 中的解决方案，而不是 Scala 中的解决方案。

Answer 1

回答by tsar2512

The answer is to use a RowEncoderand the schema of the dataset using StructType.

答案是使用RowEncoder和使用StructType的数据集架构。

Below is a working example of a flatmap operation with Datasets:

以下是使用数据集进行平面图操作的工作示例：

    StructType structType = new StructType();
    structType = structType.add("id1", DataTypes.LongType, false);
    structType = structType.add("id2", DataTypes.LongType, false);

    ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);

    Dataset<Row> output = join.flatMap(new FlatMapFunction<Row, Row>() {
        @Override
        public Iterator<Row> call(Row row) throws Exception {
            // a static map operation to demonstrate
            List<Object> data = new ArrayList<>();
            data.add(1l);
            data.add(2l);
            ArrayList<Row> list = new ArrayList<>();
            list.add(RowFactory.create(data.toArray()));
            return list.iterator();
        }
    }, encoder);

Answer 2

回答by Jim Bob

I had the same problem... Encoders.kryo(Row.class))worked for me.

我有同样的问题......Encoders.kryo(Row.class))对我有用。

As a bonus, the Apache Spark tuning docs refer to Kryo it since it's faster at serialization "often as much as 10x":

作为奖励，Apache Spark 调优文档引用了 Kryo it，因为它的序列化速度“通常高达 10 倍”：

https://spark.apache.org/docs/latest/tuning.html

java 行类型 Spark 数据集的编码器

提问by tsar2512

回答by tsar2512

回答by Jim Bob

相关推荐

最近更新

标签

java 行类型 Spark 数据集的编码器

提问by tsar2512

回答by tsar2512

回答by Jim Bob

相关推荐

Gradle 3.3 错误需要 Java 7 或更高版本未解决

java 取消刷新尝试：org.springframework.beans.factory.UnsatisfiedDependencyException：Spring Boot

java 在局部变量上同步是否合理？

java JpaRepository findAll() 返回空结果

相关推荐

最近更新

标签