使用 Python 编写 Parquet 文件的方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32940416/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:31:14  来源:igfitidea点击:

Methods for writing Parquet files using Python?

pythonapache-sparkapache-spark-sqlparquetsnappy

提问by octagonC

I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it.

我很难找到一个允许使用 Python 编写 Parquet 文件的库。如果我可以将 Snappy 或类似的压缩机制与它结合使用,我会加分。

Thus far the only method I have found is using Spark with the pyspark.sql.DataFrameParquet support.

到目前为止,我发现的唯一方法是使用带有pyspark.sql.DataFrameParquet 支持的Spark 。

I have some scripts that need to write Parquet files that are not Spark jobs. Is there any approach to writing Parquet files in Python that doesn't involve pyspark.sql?

我有一些脚本需要编写不是 Spark 作业的 Parquet 文件。有没有什么方法可以在 Python 中编写不涉及的 Parquet 文件pyspark.sql

采纳答案by rkrzr

Update (March 2017):There are currently 2libraries capable of writingParquet files:

更新(2017 年 3 月):目前有2 个库能够编写Parquet 文件:

  1. fastparquet
  2. pyarrow
  1. 快速拼花
  2. 箭矢

Both of them are still under heavy development it seems and they come with a number of disclaimers (no support for nested data e.g.), so you will have to check whether they support everything you need.

它们似乎仍在大量开发中,并且带有许多免责声明(例如不支持嵌套数据),因此您必须检查它们是否支持您需要的一切。

OLD ANSWER:

旧答案:

As of 2.2016 there seems to be NO python-only library capable of writingParquet files.

截至 2016 年,似乎没有能够编写Parquet 文件的纯 python 库。

If you only need to readParquet files there is python-parquet.

如果您只需要读取Parquet 文件,则可以使用python-parquet

As a workaround you will have to rely on some other process like e.g. pyspark.sql(which uses Py4J and runs on the JVM and can thus not be used directly from your average CPython program).

作为一种解决方法,您将不得不依赖其他一些进程,例如pyspark.sql(它使用 Py4J 并在 JVM 上运行,因此不能直接从您的普通 CPython 程序中使用)。

回答by Muayyad Alsadi

fastparquetdoes have write support, here is a snippet to write data to a file

fastparquet确实有写支持,这是一个将数据写入文件的片段

from fastparquet import write
write('outfile.parq', df)

回答by Grant Shannon

using fastparquetyou can write a pandas dfto parquet either withsnappyor gzipcompression as follows:

使用fastparquet您可以编写一个熊猫df来使用snappygzip压缩进行镶木地板,如下所示:

make sure you have installed the following:

确保您已安装以下内容:

$ conda install python-snappy
$ conda install fastparquet

do imports

做进口

import pandas as pd 
import snappy
import fastparquet

assume you have the following pandas df

假设您有以下熊猫 df

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

send dfto parquet with snappycompression

df通过snappy压缩发送到镶木地板

df.to_parquet('df.snap.parquet',compression='snappy')

send dfto parquet with gzipcompression

df通过gzip压缩发送到镶木地板

df.to_parquet('df.gzip.parquet',compression='gzip')

check:

查看:

read parquet back into pandas df

将镶木地板读回熊猫 df

pd.read_parquet('df.snap.parquet')

or

或者

pd.read_parquet('df.gzip.parquet')

output:

输出:

   col1 col2
0   1    3
1   2    4

回答by Kushagra Verma

pysparkseems to be the best alternative right now for writing out parquet with python. It may seem like using a sword in place of needle, but thats how it is at the moment.

pyspark现在似乎是用 python 写出镶木地板的最佳选择。看起来像是用剑代替针,但现在就是这样。

  • It supports most compression types like lzo, snappy. Zstd support should come into it soon.
  • Has complete schema support (nested, structs, etc)
  • 它支持大多数压缩类型,如 lzo、snappy。Zstd 支持应该很快就会出现。
  • 具有完整的模式支持(嵌套、结构等)

Simply do, pip install pysparkand you are good to go.

简单地做,pip install pyspark你就可以开始了。

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

回答by DataFramed

Simple method to write pandas dataframe to parquet.

将熊猫数据框写入镶木地板的简单方法。

Assuming, dfis the pandas dataframe. We need to import following libraries.

假设,df是熊猫数据框。我们需要导入以下库。

import pyarrow as pa
import pyarrow.parquet as pq

First, write the datafrmae dfinto a pyarrowtable.

首先,将数据帧df写入pyarrow表中。

# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df_image_0)

Second, write the tableinto paraquetfile say file_name.paraquet

其次,编写tableparaquet文件发言权file_name.paraquet

# Parquet with Brotli compression
pq.write_table(table, 'file_name.paraquet')

NOTE: paraquet files can be further compressed while writing. Following are the popular compression formats.

注意:在写入时可以进一步压缩 parquet 文件。以下是流行的压缩格式。

  • Snappy ( default, requires no argument)
  • gzip
  • brotli
  • Snappy(默认,不需要参数)
  • 压缩包
  • 布罗特利

Parquet with Snappy compression

带有 Snappy 压缩的 Parquet

 pq.write_table(table, 'file_name.paraquet')

Parquet with GZIP compression

带 GZIP 压缩的镶木地板

pq.write_table(table, 'file_name.paraquet', compression='GZIP')

Parquet with Brotli compression

带有 Brotli 压缩的镶木地板

pq.write_table(table, 'file_name.paraquet', compression='BROTLI')

Comparative comparision achieved with different formats of paraquet

与不同格式的百叶窗进行比较

enter image description here

在此处输入图片说明

Reference: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/

参考:https: //tech.jda.com/efficient-dataframe-storage-with-apache-parquet/