如何在 Java 或 Scala 中从/向镶木地板文件读取和写入 Map<String, Object>?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30565510/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:12:24  来源:igfitidea点击:

How to read and write Map<String, Object> from/to parquet file in Java or Scala?

javascalaavroparquet

提问by okigan

Looking for a concise example on how to read and write Map<String, Object>from/to parquet file in Java or Scala?

正在寻找有关如何Map<String, Object>在 Java 或 Scala 中读取/写入parquet 文件的简明示例?

Here is expected structure, using com.fasterxml.Hymanson.databind.ObjectMapperas serializer in Java (i.e. looking for equivalent using parquet):

这是预期的结构,com.fasterxml.Hymanson.databind.ObjectMapper在 Java 中用作序列化程序(即寻找等效的使用镶木地板):

public static Map<String, Object> read(InputStream inputStream) throws IOException {
    ObjectMapper objectMapper = new ObjectMapper();

    return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() {

    });
}

public static void write(OutputStream outputStream, Map<String, Object> map) throws IOException {
    ObjectMapper objectMapper = new ObjectMapper();

    objectMapper.writeValue(outputStream, map);        
}

回答by Sercan Ozdemir

i'm not quite good about parquet but, from here:

我不太擅长镶木地板,但是,从这里开始

Schema schema = new Schema.Parser().parse(Resources.getResource("map.avsc").openStream());

    File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
    tmp.deleteOnExit();
    tmp.delete();
    Path file = new Path(tmp.getPath());

    AvroParquetWriter<GenericRecord> writer = 
        new AvroParquetWriter<GenericRecord>(file, schema);

    // Write a record with an empty map.
    ImmutableMap emptyMap = new ImmutableMap.Builder<String, Integer>().build();
    GenericData.Record record = new GenericRecordBuilder(schema)
        .set("mymap", emptyMap).build();
    writer.write(record);
    writer.close();

    AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
    GenericRecord nextRecord = reader.read();

    assertNotNull(nextRecord);
    assertEquals(emptyMap, nextRecord.get("mymap"));

In your situation change ImmutableMap(Google Collections) with a default Map as below:

在您的情况下ImmutableMap,使用默认地图更改(Google Collections),如下所示:

Schema schema = new Schema.Parser().parse( Resources.getResource( "map.avsc" ).openStream() );

        File tmp = File.createTempFile( getClass().getSimpleName(), ".tmp" );
        tmp.deleteOnExit();
        tmp.delete();
        Path file = new Path( tmp.getPath() );

        AvroParquetWriter<GenericRecord> writer = new AvroParquetWriter<GenericRecord>( file, schema );

        // Write a record with an empty map.
        Map<String,Object> emptyMap = new HashMap<String, Object>();

        // not empty any more
        emptyMap.put( "SOMETHING", new SOMETHING() );
        GenericData.Record record = new GenericRecordBuilder( schema ).set( "mymap", emptyMap ).build();
        writer.write( record );
        writer.close();

        AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>( file );
        GenericRecord nextRecord = reader.read();

        assertNotNull( nextRecord );
        assertEquals( emptyMap, nextRecord.get( "mymap" ) );

I didn't test the code, but give it a try..

我没有测试代码,但试一试..

回答by Dishant Kamble

I doubt there is a solution to this readily available. When you talk about Maps, its still possible to create a AvroSchema out of it provided the values of the maps is a primitive type, or a complexType which inturn has primitive type fields.

我怀疑是否有现成的解决方案。当您谈论 Maps 时,仍然可以从中创建 AvroSchema,只要 Maps 的值是原始类型,或具有原始类型字段的 complexType。

In your case,

在你的情况下,

  • If you have a Map => which will create schema with values of map being int.
  • If you have a Map,
    • a. CustomObject has fields int, float, char ... (i.e. any primitive type) the schema generation will be valid and can then be used to successfully convert to parquet.
    • b. CustomObject has fields which are non primitive, the schema generated will be malformed and the resulting ParquetWritter will fail.
  • 如果你有一个 Map =>,它将创建一个 map 值为 int 的模式。
  • 如果你有地图,
    • 一个。CustomObject 具有字段 int、float、char ...(即任何原始类型)模式生成将是有效的,然后可用于成功转换为镶木地板。
    • 湾 CustomObject 具有非原始字段,生成的架构将格式错误并且生成的 ParquetWritter 将失败。

To resolve this issue, you can try to convert your object into a JsonObjectand then use the Apache Spark libraries to convert it to Parquet.

要解决此问题,您可以尝试将对象转换为 a JsonObject,然后使用 Apache Spark 库将其转换为 Parquet。

回答by rahul

Apache Drill is your answer!

Apache Drill 就是你的答案!

Convert to parquet : You can use the CTAS(create table as) feature in drill. By default drill creates a folder with parquet files after executing the below query. You can substitute any query and drill writes the output of you query into parquet files

转换为镶木地板:您可以在钻孔中使用 CTAS(创建表为)功能。默认情况下,在执行以下查询后,drill 会创建一个包含镶木地板文件的文件夹。您可以替换任何查询并将查询的输出写入 parquet 文件

create table file_parquet as select * from dfs.`/data/file.json`;

Convert from parquet : We also use the CTAS feature here, however we request drill to use a different format for writing the output

从镶木地板转换:我们在这里也使用 CTAS 功能,但是我们要求钻取使用不同的格式来写入输出

alter session set `store.format`='json';
create table file_json as select * from dfs.`/data/file.parquet`;

Refer to http://drill.apache.org/docs/create-table-as-ctas-command/for more information

有关更多信息,请参阅http://drill.apache.org/docs/create-table-as-ctas-command/