使用 Python 编写 Parquet 文件的方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32940416/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Methods for writing Parquet files using Python?
提问by octagonC
I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it.
我很难找到一个允许使用 Python 编写 Parquet 文件的库。如果我可以将 Snappy 或类似的压缩机制与它结合使用,我会加分。
Thus far the only method I have found is using Spark with the pyspark.sql.DataFrame
Parquet support.
到目前为止,我发现的唯一方法是使用带有pyspark.sql.DataFrame
Parquet 支持的Spark 。
I have some scripts that need to write Parquet files that are not Spark jobs. Is there any approach to writing Parquet files in Python that doesn't involve pyspark.sql
?
我有一些脚本需要编写不是 Spark 作业的 Parquet 文件。有没有什么方法可以在 Python 中编写不涉及的 Parquet 文件pyspark.sql
?
采纳答案by rkrzr
Update (March 2017):There are currently 2libraries capable of writingParquet files:
更新(2017 年 3 月):目前有2 个库能够编写Parquet 文件:
Both of them are still under heavy development it seems and they come with a number of disclaimers (no support for nested data e.g.), so you will have to check whether they support everything you need.
它们似乎仍在大量开发中,并且带有许多免责声明(例如不支持嵌套数据),因此您必须检查它们是否支持您需要的一切。
OLD ANSWER:
旧答案:
As of 2.2016 there seems to be NO python-only library capable of writingParquet files.
截至 2016 年,似乎没有能够编写Parquet 文件的纯 python 库。
If you only need to readParquet files there is python-parquet.
如果您只需要读取Parquet 文件,则可以使用python-parquet。
As a workaround you will have to rely on some other process like e.g. pyspark.sql
(which uses Py4J and runs on the JVM and can thus not be used directly from your average CPython program).
作为一种解决方法,您将不得不依赖其他一些进程,例如pyspark.sql
(它使用 Py4J 并在 JVM 上运行,因此不能直接从您的普通 CPython 程序中使用)。
回答by Muayyad Alsadi
fastparquetdoes have write support, here is a snippet to write data to a file
fastparquet确实有写支持,这是一个将数据写入文件的片段
from fastparquet import write
write('outfile.parq', df)
回答by Grant Shannon
using fastparquet
you can write a pandas df
to parquet either withsnappy
or gzip
compression as follows:
使用fastparquet
您可以编写一个熊猫df
来使用snappy
或gzip
压缩进行镶木地板,如下所示:
make sure you have installed the following:
确保您已安装以下内容:
$ conda install python-snappy
$ conda install fastparquet
do imports
做进口
import pandas as pd
import snappy
import fastparquet
assume you have the following pandas df
假设您有以下熊猫 df
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
send df
to parquet with snappy
compression
df
通过snappy
压缩发送到镶木地板
df.to_parquet('df.snap.parquet',compression='snappy')
send df
to parquet with gzip
compression
df
通过gzip
压缩发送到镶木地板
df.to_parquet('df.gzip.parquet',compression='gzip')
check:
查看:
read parquet back into pandas df
将镶木地板读回熊猫 df
pd.read_parquet('df.snap.parquet')
or
或者
pd.read_parquet('df.gzip.parquet')
output:
输出:
col1 col2
0 1 3
1 2 4
回答by Kushagra Verma
pyspark
seems to be the best alternative right now for writing out parquet with python. It may seem like using a sword in place of needle, but thats how it is at the moment.
pyspark
现在似乎是用 python 写出镶木地板的最佳选择。看起来像是用剑代替针,但现在就是这样。
- It supports most compression types like lzo, snappy. Zstd support should come into it soon.
- Has complete schema support (nested, structs, etc)
- 它支持大多数压缩类型,如 lzo、snappy。Zstd 支持应该很快就会出现。
- 具有完整的模式支持(嵌套、结构等)
Simply do, pip install pyspark
and you are good to go.
简单地做,pip install pyspark
你就可以开始了。
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
回答by DataFramed
Simple method to write pandas dataframe to parquet.
将熊猫数据框写入镶木地板的简单方法。
Assuming, df
is the pandas dataframe. We need to import following libraries.
假设,df
是熊猫数据框。我们需要导入以下库。
import pyarrow as pa
import pyarrow.parquet as pq
First, write the datafrmae df
into a pyarrow
table.
首先,将数据帧df
写入pyarrow
表中。
# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df_image_0)
Second, write the table
into paraquet
file say file_name.paraquet
其次,编写table
成paraquet
文件发言权file_name.paraquet
# Parquet with Brotli compression
pq.write_table(table, 'file_name.paraquet')
NOTE: paraquet files can be further compressed while writing. Following are the popular compression formats.
注意:在写入时可以进一步压缩 parquet 文件。以下是流行的压缩格式。
- Snappy ( default, requires no argument)
- gzip
- brotli
- Snappy(默认,不需要参数)
- 压缩包
- 布罗特利
Parquet with Snappy compression
带有 Snappy 压缩的 Parquet
pq.write_table(table, 'file_name.paraquet')
Parquet with GZIP compression
带 GZIP 压缩的镶木地板
pq.write_table(table, 'file_name.paraquet', compression='GZIP')
Parquet with Brotli compression
带有 Brotli 压缩的镶木地板
pq.write_table(table, 'file_name.paraquet', compression='BROTLI')
Comparative comparision achieved with different formats of paraquet
与不同格式的百叶窗进行比较
Reference: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/
参考:https: //tech.jda.com/efficient-dataframe-storage-with-apache-parquet/