在java中创建镶木地板文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39728854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 21:24:09  来源:igfitidea点击:

create parquet files in java

javaparquet

提问by Imbar M.

Is there a way to create parquet files from java?

有没有办法从java创建镶木地板文件?

I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill.

我在内存中有数据(java 类),我想将它写入一个镶木地板文件,以便稍后从 apache-drill 中读取它。

Is there an simple way to do this, like inserting data into a sql table?

有没有一种简单的方法可以做到这一点,比如将数据插入到 sql 表中?

GOT IT

知道了

Thanks for the help.

谢谢您的帮助。

Combining the answers and this link, I was able to create a parquet file and read it back with drill.

结合答案和这个链接,我能够创建一个镶木地板文件并用钻头读回它。

采纳答案by MaxNevermind

ParquetWriter's constructors are deprecated(1.8.1) but not ParquetWriter itself, you can still create ParquetWriter by extending abstract Builder subclass inside of it.

ParquetWriter 的构造函数已被弃用(1.8.1)但 ParquetWriter 本身不是,您仍然可以通过在其中扩展抽象 Builder 子类来创建 ParquetWriter。

Here an example from parquet creators themselves ExampleParquetWriter:

这是镶木地板创作者自己的一个例子ExampleParquetWriter

  public static class Builder extends ParquetWriter.Builder<Group, Builder> {
    private MessageType type = null;
    private Map<String, String> extraMetaData = new HashMap<String, String>();

    private Builder(Path file) {
      super(file);
    }

    public Builder withType(MessageType type) {
      this.type = type;
      return this;
    }

    public Builder withExtraMetaData(Map<String, String> extraMetaData) {
      this.extraMetaData = extraMetaData;
      return this;
    }

    @Override
    protected Builder self() {
      return this;
    }

    @Override
    protected WriteSupport<Group> getWriteSupport(Configuration conf) {
      return new GroupWriteSupport(type, extraMetaData);
    }

  }

If you don't want to use Group and GroupWriteSupport(bundled in Parquet but purposed just as an example of data-model implementation) you can go with Avro, Protocol Buffers, or Thrift in-memory data models. Here is an example using writing Parquet using Avro:

如果您不想使用 Group 和 GroupWriteSupport(捆绑在 Parquet 中,但仅用作数据模型实现的示例),您可以使用 Avro、Protocol Buffers 或 Thrift 内存数据模型。以下是使用 Avro 编写 Parquet 的示例:

try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
        .<GenericData.Record>builder(fileToWrite)
        .withSchema(schema)
        .withConf(new Configuration())
        .withCompressionCodec(CompressionCodecName.SNAPPY)
        .build()) {
    for (GenericData.Record record : recordsToWrite) {
        writer.write(record);
    }
}   

You will need these dependencies:

您将需要这些依赖项:

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-avro</artifactId>
    <version>1.8.1</version>
</dependency>

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.8.1</version>
</dependency>

Full example here.

完整示例在这里

回答by Zoltan

A few possible ways to do it:

几种可能的方法来做到这一点:

  • Use the Java Parquet library to write Parquet directly from your code.
  • Connect to Hive or Impala using JDBC and insert the data using SQL. Please note that if you insert rows one by one it will result in separate files for each individual record and will totally ruin the performance. You should insert lotsof rows at once, which is not trivial, so I don't recommend this approach.
  • Save the data to a delimited text file, then do the following steps in either Hive or Impala:
    • Define a table over the text file to allow Hive/Impala to read the data. Let's call this table text_table. See Impala's Create Table Statementfor details.
    • Create a new table with identical columns but specifying Parquet as its file format. Let's call this table parquet_table.
    • Finally do an insert into parquet_table select * from text_tableto copy all data from the text file to the parquet table.
  • 使用 Java Parquet 库直接从您的代码编写 Parquet。
  • 使用 JDBC 连接到 Hive 或 Impala 并使用 SQL 插入数据。请注意,如果您一行一行地插入行,它将导致每个单独的记录都有单独的文件,并且会完全破坏性能。您应该一次插入很多行,这很重要,所以我不推荐这种方法。
  • 将数据保存到带分隔符的文本文件,然后在 Hive 或 Impala 中执行以下步骤:
    • 在文本文件上定义一个表以允许 Hive/Impala 读取数据。我们称这张桌子为text_table。有关详细信息,请参阅 Impala 的创建表语句
    • 创建一个具有相同列但指定 Parquet 作为其文件格式的新表。我们称这张桌子为parquet_table
    • 最后做一个insert into parquet_table select * from text_table将文本文件中的所有数据复制到镶木地板表中。