Java 在 avro 中存储模式有什么好处?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20556324/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 02:47:22  来源:igfitidea点击:

What is the advantage of storing schema in avro?

javaapachehadoopsolravro

提问by user2250246

We need to serialize some data for putting into solr as well as hadoop.

我们需要序列化一些数据以放入 solr 和 hadoop。

I am evaluating serialization tools for the same.

我正在评估相同的序列化工具。

The top two in my list are Gson and Avro.

我列表中的前两个是 Gson 和 Avro。

As far as I understand, Avro = Gson + Schema-In-JSON

据我了解,Avro = Gson + Schema-In-JSON

If that is correct, I do not see why Avro is so popular for Solr/Hadoop?

如果这是正确的,我不明白为什么 Avro 在 Solr/Hadoop 中如此受欢迎?

I have searched a lot on the Internet, but cannot find a single correct answer for this.

我在互联网上搜索了很多,但找不到一个正确的答案。

Everywhere it says, Avro is good because it stores schema. My question is what to do with that schema?

到处都说,Avro 很好,因为它存储模式。我的问题是如何处理该架构?

It may be good for very large objects in Hadoop where a single object is stored in multiple file blocks such that storing schema with each part helps to analyze it better. But even in that case, schema can be stored separately and just a reference to that is sufficient to describe the schema. I see no reason why schema should be part of each and every piece.

它可能适用于 Hadoop 中非常大的对象,其中单个对象存储在多个文件块中,这样每个部分的存储模式有助于更好地分析它。但即使在这种情况下,模式也可以单独存储,仅对它的引用就足以描述模式。我看不出为什么架构应该成为每个部分的一部分。

If someone can give me some good use case how Avro helped them and Gson/Hymanson were insufficient for the purpose, it would be really helpful.

如果有人可以给我一些很好的用例,Avro 如何帮助他们而 Gson/Hymanson 不足以达到目的,那将非常有帮助。

Also, official documentation at the Avro site says that we need to give a schema to Avro to help it produce Schema+Data. My question is, if schema is input and the same is sent to output along with JSON representation of data, then what extra is being achieved by Avro? Can I not do that myself by serializing an object using JSON, adding my input schema and calling it Avro?

此外,Avro 站点上的官方文档说我们需要为 Avro 提供一个模式来帮助它生成 Schema+Data。我的问题是,如果输入模式并将其与数据的 JSON 表示形式一起发送到输出,那么 Avro 实现了什么额外功能?我不能通过使用 JSON 序列化一个对象,添加我的输入模式并将其称为 Avro 来自己做吗?

I am really confused with this!

我真的很困惑这个!

回答by Vishal John

  1. Evolving schemas
  1. 不断发展的模式

Suppose intially you designed an schema like this for your Employee class

假设您最初为 Employee 类设计了这样的模式

{
{"name": "emp_name", "type":"string"},
{"name":"dob", "type":"string"},
{"name":"age", "type":"int"}
}

Later you realized that age is redundant and removed it from the schema.

后来您意识到年龄是多余的,并将其从模式中删除。

{
{"name": "emp_name", "type":"string"},
{"name":"dob", "type":"string"}
}

What about the records that were serialized and stored before this schema change. How will you read back those records?

在此架构更改之前序列化和存储的记录呢?你将如何读回这些记录?

That's why the avro reader/deserializer asks for the reader and writer schema. Internally it does schema resolution ie. it tries to adapt the old schema to new schema.

这就是 avro 读取器/解串器要求读取器和写入器模式的原因。它在内部进行模式解析,即。它尝试使旧模式适应新模式。

Go to this link - http://avro.apache.org/docs/1.7.2/api/java/org/apache/avro/io/parsing/doc-files/parsing.html- section "Resolution using action symbols"

转到此链接 - http://avro.apache.org/docs/1.7.2/api/java/org/apache/avro/io/parsing/doc-files/parsing.html- 部分“使用动作符号的解析”

In this case it does skip action, ie it leaves out reading "age". It can also handle cases like a field changes from int to long etc..

在这种情况下,它确实跳过动作,即它省略了阅读“年龄”。它还可以处理字段从 int 更改为 long 等情况。

This is a very nice article explaining schema evolution - http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

这是一篇很好的解释模式演变的文章 - http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

  1. Schema is stored only once for multiple records in a single file.

  2. Size, encoded in very few bytes.

  1. 对于单个文件中的多条记录,Schema 仅存储一次。

  2. 大小,以很少的字节编码。

回答by user2250246

I think one of the key problems solved by schema evolution is not mentioned anywhere explicitly and that is why it causes so much confusion for the new-comers.

我认为模式演化解决的关键问题之一在任何地方都没有明确提及,这就是为什么它会给新手带来如此多的困惑。

An example will clarify this:

一个例子将阐明这一点:

Let us say a bank stores an audit log of all its transactions. The logs have a particular format and need to be stored for at least 10 years. It is also very much desirable that the system holding these logs should adapt to the format evolving in all of these 10 years.

假设银行存储其所有交易的审计日志。日志具有特定格式,需要存储至少 10 年。保存这些日志的系统也非常需要适应这 10 年来不断发展的格式。

Schema for such entries would not change too often, let us say twice an year on an average but each schema would have a large number of entries. If we do not keep track of the schemas, then after a while, we will need to consult very old code to figure out the fields present at that time and keep on adding if-else statements for processing different formats. With a schema store of all these formats, we can use the schema-evolution feature to automatically convert one kind of format into the other (Avro does this automatically if you provide it with older and newer schemas). This saves the applications from adding lot of if-else statements in their code and also makes it more manageable as we readily know what are all the formats we have by looking at the set of schemas stored (Schemas are generally stored in a separate storage and the data only has an ID pointing to its schema).

此类条目的架构不会经常更改,让我们说平均每年两次,但每个架构都有大量条目。如果我们不跟踪模式,那么过一段时间,我们将需要查阅非常旧的代码来找出当时存在的字段,并继续添加 if-else 语句来处理不同的格式。有了所有这些格式的模式存储,我们可以使用模式演化功能自动将一种格式转换为另一种格式(如果您为它提供新旧模式,Avro 会自动执行此操作)。

Another advantage of schema evolution is that producers of new format can safely produce objects with new schema without waiting for the downstream consumers to change first. The downstream consumers can have the logic built in to simply suspend processing unless they have visibility of the new schema associated with a new format. This automatic suspension is great to keep the system online and adapt the processing logic for the new schema.

模式演化的另一个优点是新格式的生产者可以安全地生产具有新模式的对象,而无需等待下游消费者先更改。下游消费者可以内置逻辑来简单地暂停处理,除非他们可以看到与新格式关联的新模式。这种自动暂停非常适合保持系统在线并为新模式调整处理逻辑。

So in summary, schema evolution helps the newer clients read older formats by making use of automatic format conversion and also helps the older clients suspend processing in a graceful manner till they have been enabled to understand newer formats.

因此,总而言之,模式演化通过使用自动格式转换帮助新客户端读取旧格式,并帮助旧客户端以优雅的方式暂停处理,直到它们能够理解新格式。