java 使用读取器架构将 Avro 文件转换为 JSON

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47963172/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 09:53:51  来源:igfitidea点击:

Convert Avro file to JSON with reader schema

javajsonavro

提问by Régis B.

I would like to deserialize Avro data on the command line with a reader schema that is different from the writer schema. I can specify writer schema on serialization, but not during deserialization.

我想在命令行上使用不同于 writer schema 的 reader schema反序列化 Avro 数据。我可以在序列化时指定编写器模式,但不能在反序列化期间指定。

record.json(data file):

record.json(数据文件):

{"test1": 1, "test2": 2}

writer.avsc(writer schema):

writer.avsc(作家模式):

{
    "type": "record",
    "name": "pouac",
    "fields": [
        {
            "name": "test1",
            "type": "int"
        },
        {
            "name": "test2",
            "type": "int"
        }
    ]
}

reader.avsc(reader schema):

reader.avsc(读者架构):

{
    "type": "record",
    "name": "pouac",
    "fields": [{
        "name": "test2",
         "type": "int",
         "aliases": ["test1"]
    }]
}

Serializing data:

序列化数据:

$ java -jar avro-tools-1.8.2.jar fromjson --schema-file writer.avsc record.json > record.avro

For deserializing data, I tried the following:

为了反序列化数据,我尝试了以下操作:

$ java -jar avro-tools-1.8.2.jar tojson --schema-file reader.avsc record.avro
Exception in thread "main" joptsimple.UnrecognizedOptionException: 'schema-file' is not a recognized option
...

I'm looking primarily for a command line instruction because I'm not so confortable writing Java code, but I'd be happy with Java code to compile myself. Actually, what I'm interested in is the exact deserialization result. (the more fundamental issue at stake is described in this conversationon a fastavro PR that I opened to implement aliases)

我主要是在寻找命令行指令,因为我不太习惯编写 Java 代码,但我很乐意使用 Java 代码来编译自己。实际上,我感兴趣的是确切的反序列化结果。(在我为实现别名而打开的 fastavro PR 上的对话中描述了更基本的问题)

回答by Nathan

The avro-tools tojsontarget is only meant as a dump tool for translating a binary encoded Avro file to JSON. The schema always accompanies the records in the Avro file as outlined in the link below. As a result it cannot be overridden by avro-tools.

avro-toolstojson目标仅用作将二进制编码的 Avro 文件转换为 JSON 的转储工具。该架构始终伴随着 Avro 文件中的记录,如下面的链接所述。因此,它不能被 avro-tools 覆盖。

http://avro.apache.org/docs/1.8.2/#compare

http://avro.apache.org/docs/1.8.2/#compare

I am not aware of a stand-alone tool that can be used to achieve what you want. I think you'll need to do some programming to achieve the desired results. Avro has many supported languages including Python but the capabilities across languages is not uniform. Java is in my experience the most advanced. As an example Python lacks the ability to specify a reader schema on the DataFileReader which would help achieve what you want:

我不知道有一个独立的工具可以用来实现你想要的。我认为您需要进行一些编程才能达到预期的效果。Avro 有许多受支持的语言,包括 Python,但跨语言的功能并不统一。根据我的经验,Java 是最先进的。例如,Python 缺乏在 DataFileReader 上指定阅读器架构的能力,这将有助于实现您想要的:

https://github.com/apache/avro/blob/master/lang/py/src/avro/datafile.py#L224

https://github.com/apache/avro/blob/master/lang/py/src/avro/datafile.py#L224

The closest you can get in Python is the following;

您可以在 Python 中获得的最接近的是以下内容;

import avro.schema as avsc
import avro.datafile as avdf
import avro.io as avio

reader_schema = avsc.parse(open("reader.avsc", "rb").read())

#?need ability to inject reader schema as 3rd arg
with avdf.DataFileReader(open("record.avro", "rb"), avio.DatumReader()) as reader:
    for record in reader:
        print record

In terms of the schemas and the data you've outlined. The expected behaviour should be undefinedand therefore emit an error.

就架构和您概述的数据而言。预期的行为应该是未定义的,因此会发出错误

This behaviour can be verified with the following Java code;

可以使用以下 Java 代码验证此行为;

package ca.junctionbox.soavro;

import org.apache.avro.Schema;
import org.apache.avro.SchemaValidationException;
import org.apache.avro.SchemaValidationStrategy;
import org.apache.avro.SchemaValidator;
import org.apache.avro.SchemaValidatorBuilder;

import java.util.ArrayList;

public class Main {
    public static final String V1 = "{\n" +
            "    \"type\": \"record\",\n" +
            "    \"name\": \"pouac\",\n" +
            "    \"fields\": [\n" +
            "        {\n" +
            "            \"name\": \"test1\",\n" +
            "            \"type\": \"int\"\n" +
            "        },\n" +
            "        {\n" +
            "            \"name\": \"test2\",\n" +
            "            \"type\": \"int\"\n" +
            "        }\n" +
            "    ]\n" +
            "}";

    public static final String V2 = "{\n" +
            "    \"type\": \"record\",\n" +
            "    \"name\": \"pouac\",\n" +
            "    \"fields\": [{\n" +
            "        \"name\": \"test2\",\n" +
            "         \"type\": \"int\",\n" +
            "         \"aliases\": [\"test1\"]\n" +
            "    }]\n" +
            "}";

    public static void main(final String[] args) {
        final SchemaValidator sv = new SchemaValidatorBuilder()
                .canBeReadStrategy()
                .validateAll();
        final Schema sv1 = new Schema.Parser().parse(V1);
        final Schema sv2 = new Schema.Parser().parse(V2);
        final ArrayList<Schema> existing = new ArrayList<>();
        existing.add(sv1);

        try {
            sv.validate(sv2, existing);
            System.out.println("Good to go!");
        } catch (SchemaValidationException e) {
            e.printStackTrace();
        }
    }
}

This yields the following output:

这会产生以下输出:

org.apache.avro.SchemaValidationException: Unable to read schema: 
{
  "type" : "record",
  "name" : "pouac",
  "fields" : [ {
    "name" : "test2",
    "type" : "int",
    "aliases" : [ "test1" ]
  } ]
}
using schema:
{
  "type" : "record",
  "name" : "pouac",
  "fields" : [ {
    "name" : "test1",
    "type" : "int"
  }, {
    "name" : "test2",
    "type" : "int"
  } ]
}
    at org.apache.avro.ValidateMutualRead.canRead(ValidateMutualRead.java:70)
    at org.apache.avro.ValidateCanBeRead.validate(ValidateCanBeRead.java:39)
    at org.apache.avro.ValidateAll.validate(ValidateAll.java:51)
    at ca.junctionbox.soavro.Main.main(Main.java:47)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo.run(ExecJavaMojo.java:294)
    at java.lang.Thread.run(Thread.java:748)

Aliases are typically used for backwards compatibility in schema evolution allowing mappings from disparate/legacy keys to a common key name. Given your writer schema doesn't treat the test1 and test2 fields as "optional" through the use of unions I can't see what scenario you'd want this transformation. If you want to "drop" the test1 field then it can be achieved by excluding it from the v2 schema specification. Any reader that can apply a reader scheme would then ignore test1 using the v2 schema definition.

别名通常用于模式演化中的向后兼容性,允许从不同/遗留键映射到公共键名称。鉴于您的编写器架构没有通过使用联合将 test1 和 test2 字段视为“可选”字段,我看不出您想要这种转换的场景。如果您想“删除” test1 字段,则可以通过将其从 v2 架构规范中排除来实现。任何可以应用阅读器方案的阅读器都将使用 v2 模式定义忽略 test1。

To illustrate what I mean by evolution;

为了说明我所说的进化是什么意思;

v1 schema

v1 架构

{
  "type": "record",
  "name": "pouac",
  "fields": [
    {
        "name": "test1",
        "type": "int"
    }]
}

v2 schema

v2 架构

{
  "type": "record",
  "name": "pouac",
  "fields": [
    {
        "name": "test2",
        "type": "int",
        "aliases": ["test1"]
    }]
}

You could have terabytes of data in the v1 format and introduce the v2 format which renames the test1 field to test2. The alias would allow you to perform map-reduce jobs, Hive queries, etc on both v1 and v2 data without proactively rewriting all the old v1 data first. Note this assumes there is no change in type and the semantic meaning of the fields.

您可以拥有 v1 格式的 TB 数据并引入 v2 格式,该格式将 test1 字段重命名为 test2。别名将允许您对 v1 和 v2 数据执行 map-reduce 作业、Hive 查询等,而无需先主动重写所有旧的 v1 数据。请注意,这假设字段的类型和语义没有变化。

回答by ddarellis

You can run java -jar avro-tools-1.8.2.jar tojsonto see the help, what it tells is that you can use this command like:

您可以运行java -jar avro-tools-1.8.2.jar tojson查看帮助,它告诉您可以使用此命令,例如:

java -jar avro-tools-1.8.2.jar tojson record.avro > tost.json

and this will output to the file:

这将输出到文件:

{"test1":1,"test2":2}

Also you can call it with --prettyargumment:

你也可以用--pretty参数调用它:

java -jar avro-tools-1.8.2.jar tojson --pretty record.avro > tost.json

and the output will be pretty:

输出会很漂亮:

{
  "test1" : 1,
  "test2" : 2
}