使用 Java 将 Json 对象转换为 Parquet 格式而不转换为 AVRO(不使用 Spark、Hive、Pig、Impala)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39858856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)
提问by vijju
I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.
我有一个场景,其中使用 Java 将作为 Json 对象存在的消息转换为 Apache Parquet 格式。任何示例代码或示例都会有所帮助。就我发现的将消息转换为 Parquet 而言,正在使用 Hive、Pig、Spark。我需要转换为 Parquet,而不只通过 Java 涉及这些。
回答by blue
To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.
要将 JSON 数据文件转换为 Parquet,您需要一些内存中的表示。Parquet 没有自己的 Java 对象集;相反,它重用来自其他格式的对象,如 Avro 和 Thrift。这个想法是 Parquet 与您的应用程序可能已经使用的对象本地工作。
To convert your JSON, you need to convert the records to Avro in-memory objectsand pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.
要转换您的 JSON,您需要将记录转换为 Avro内存对象并将它们传递给 Parquet,但您不需要将文件转换为 Avro,然后再转换为 Parquet。
Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.
已经为您完成了到 Avro 对象的转换,请参阅Kite 的 JsonUtil,并且可以用作文件阅读器。转换方法需要 Avro 模式,但您可以使用相同的库从 JSON 数据推断 Avro 模式。
To write those records, you just need to use ParquetAvroWriter
. The whole setup looks like this:
要写入这些记录,您只需要使用ParquetAvroWriter
. 整个设置如下所示:
Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
fs.open(source), jsonSchema, Record.class)) {
reader.initialize();
try (ParquetWriter<Record> writer = AvroParquetWriter
.<Record>builder(outputPath)
.withConf(new Configuration)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(jsonSchema)
.build()) {
for (Record record : reader) {
writer.write(record);
}
}
}