Python 如何将 Parquet 文件读入 Pandas DataFrame?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33813815/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:02:02  来源:igfitidea点击:

How to read a Parquet file into Pandas DataFrame?

pythonpandasparquetblaze

提问by Daniel Mahler

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

如何在不设置集群计算基础设施(如 Hadoop 或 Spark)的情况下将中等大小的 Parquet 数据集读入内存中的 Pandas DataFrame?这只是我想在笔记本电脑上使用简单的 Python 脚本在内存中读取的适量数据。数据不驻留在 HDFS 上。它要么在本地文件系统上,要么在 S3 中。我不想启动和配置其他服务,如 Hadoop、Hive 或 Spark。

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

我认为 Blaze/Odo 会使这成为可能:Odo 文档提到了 Parquet,但这些示例似乎都在通过外部 Hive 运行时。

采纳答案by chrisaycock

pandas 0.21 introduces new functions for Parquet:

pandas 0.21为 Parquet引入了新功能

pd.read_parquet('example_pa.parquet', engine='pyarrow')

or

或者

pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

上面的链接解释了:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

这些引擎非常相似,应该读/写几乎相同的镶木地板格式文件。这些库因具有不同的底层依赖关系而不同(fastparquet 使用 numba,而 pyarrow 使用 c 库)。

回答by danielfrg

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

更新:自从我回答这个问题以来,为了更好地读写 parquet,在 Apache Arrow 上已经做了很多工作。另外:http: //wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

有一个 python parquet reader 工作得比较好:https: //github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csvfor example.

它将创建 python 对象,然后您必须将它们移动到 Pandas DataFrame,因此该过程将比pd.read_csv例如慢。

回答by WY Hsu

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe

除了 Pandas,Apache pyarrow 还提供了将 parquet 转换为 dataframe 的方法

The code is simple, just type:

代码很简单,只需输入:

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files

有关更多信息,请参阅 Apache pyarrow Reading and Writing Single Files 中的文档