Java 如何将数据附加到现有的镶木地板文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39234391/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to append data to an existing parquet file
提问by Krishas
I'm using the following code to create ParquetWriter and to write records to it.
我正在使用以下代码创建 ParquetWriter 并向其写入记录。
ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE);
final GenericRecord record = new GenericData.Record(avroSchema);
parquetWriter.write(record);
But it only allows to create new files(at the specfied path). Is there a way to append data to an existing parquet file (at path)? Caching parquetWriter is not feasible in my case.
但它只允许创建新文件(在指定的路径)。有没有办法将数据附加到现有的镶木地板文件(在路径)?在我的情况下,缓存 parquetWriter 是不可行的。
回答by vgunnu
Parquet is a columnar file, It optimizes writing all columns together. If any edit it requires to rewrite the file.
Parquet 是一个柱状文件,它优化了将所有列一起写入。如果有任何编辑,则需要重写文件。
From Wiki
来自维基
A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. For our example table, the data would be stored in this fashion:
面向列的数据库将一列的所有值序列化在一起,然后是下一列的值,依此类推。对于我们的示例表,数据将以这种方式存储:
10:001,12:002,11:003,22:004;
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
40000:001,50000:002,44000:003,55000:004;
Some links
一些链接
https://en.wikipedia.org/wiki/Column-oriented_DBMS
回答by bluszcz
There is a Spark API SaveMode called append: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.htmlwhich I believe solves your problem.
有一个名为 append 的 Spark API SaveMode:https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html我相信它可以解决您的问题。
Example of use:
使用示例:
df.write.mode('append').parquet('parquet_data_file')