在java中创建镶木地板文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39728854/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
create parquet files in java
提问by Imbar M.
Is there a way to create parquet files from java?
有没有办法从java创建镶木地板文件?
I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill.
我在内存中有数据(java 类),我想将它写入一个镶木地板文件,以便稍后从 apache-drill 中读取它。
Is there an simple way to do this, like inserting data into a sql table?
有没有一种简单的方法可以做到这一点,比如将数据插入到 sql 表中?
GOT IT
知道了
Thanks for the help.
谢谢您的帮助。
Combining the answers and this link, I was able to create a parquet file and read it back with drill.
结合答案和这个链接,我能够创建一个镶木地板文件并用钻头读回它。
采纳答案by MaxNevermind
ParquetWriter's constructors are deprecated(1.8.1) but not ParquetWriter itself, you can still create ParquetWriter by extending abstract Builder subclass inside of it.
ParquetWriter 的构造函数已被弃用(1.8.1)但 ParquetWriter 本身不是,您仍然可以通过在其中扩展抽象 Builder 子类来创建 ParquetWriter。
Here an example from parquet creators themselves ExampleParquetWriter:
这是镶木地板创作者自己的一个例子ExampleParquetWriter:
public static class Builder extends ParquetWriter.Builder<Group, Builder> {
private MessageType type = null;
private Map<String, String> extraMetaData = new HashMap<String, String>();
private Builder(Path file) {
super(file);
}
public Builder withType(MessageType type) {
this.type = type;
return this;
}
public Builder withExtraMetaData(Map<String, String> extraMetaData) {
this.extraMetaData = extraMetaData;
return this;
}
@Override
protected Builder self() {
return this;
}
@Override
protected WriteSupport<Group> getWriteSupport(Configuration conf) {
return new GroupWriteSupport(type, extraMetaData);
}
}
If you don't want to use Group and GroupWriteSupport(bundled in Parquet but purposed just as an example of data-model implementation) you can go with Avro, Protocol Buffers, or Thrift in-memory data models. Here is an example using writing Parquet using Avro:
如果您不想使用 Group 和 GroupWriteSupport(捆绑在 Parquet 中,但仅用作数据模型实现的示例),您可以使用 Avro、Protocol Buffers 或 Thrift 内存数据模型。以下是使用 Avro 编写 Parquet 的示例:
try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
.<GenericData.Record>builder(fileToWrite)
.withSchema(schema)
.withConf(new Configuration())
.withCompressionCodec(CompressionCodecName.SNAPPY)
.build()) {
for (GenericData.Record record : recordsToWrite) {
writer.write(record);
}
}
You will need these dependencies:
您将需要这些依赖项:
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>1.8.1</version>
</dependency>
Full example here.
完整示例在这里。
回答by Zoltan
A few possible ways to do it:
几种可能的方法来做到这一点:
- Use the Java Parquet library to write Parquet directly from your code.
- Connect to Hive or Impala using JDBC and insert the data using SQL. Please note that if you insert rows one by one it will result in separate files for each individual record and will totally ruin the performance. You should insert lotsof rows at once, which is not trivial, so I don't recommend this approach.
- Save the data to a delimited text file, then do the following steps in either Hive or Impala:
- Define a table over the text file to allow Hive/Impala to read the data. Let's call this table
text_table
. See Impala's Create Table Statementfor details. - Create a new table with identical columns but specifying Parquet as its file format. Let's call this table
parquet_table
. - Finally do an
insert into parquet_table select * from text_table
to copy all data from the text file to the parquet table.
- Define a table over the text file to allow Hive/Impala to read the data. Let's call this table
- 使用 Java Parquet 库直接从您的代码编写 Parquet。
- 使用 JDBC 连接到 Hive 或 Impala 并使用 SQL 插入数据。请注意,如果您一行一行地插入行,它将导致每个单独的记录都有单独的文件,并且会完全破坏性能。您应该一次插入很多行,这很重要,所以我不推荐这种方法。
- 将数据保存到带分隔符的文本文件,然后在 Hive 或 Impala 中执行以下步骤:
- 在文本文件上定义一个表以允许 Hive/Impala 读取数据。我们称这张桌子为
text_table
。有关详细信息,请参阅 Impala 的创建表语句。 - 创建一个具有相同列但指定 Parquet 作为其文件格式的新表。我们称这张桌子为
parquet_table
。 - 最后做一个
insert into parquet_table select * from text_table
将文本文件中的所有数据复制到镶木地板表中。
- 在文本文件上定义一个表以允许 Hive/Impala 读取数据。我们称这张桌子为