Python 如何将 Parquet 文件读入 Pandas DataFrame？

Question

提问by Daniel Mahler

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

如何在不设置集群计算基础设施（如 Hadoop 或 Spark）的情况下将中等大小的 Parquet 数据集读入内存中的 Pandas DataFrame？这只是我想在笔记本电脑上使用简单的 Python 脚本在内存中读取的适量数据。数据不驻留在 HDFS 上。它要么在本地文件系统上，要么在 S3 中。我不想启动和配置其他服务，如 Hadoop、Hive 或 Spark。

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

我认为 Blaze/Odo 会使这成为可能：Odo 文档提到了 Parquet，但这些示例似乎都在通过外部 Hive 运行时。

Answer 1

采纳答案by chrisaycock

pandas 0.21 introduces new functions for Parquet:

pandas 0.21为 Parquet引入了新功能：

pd.read_parquet('example_pa.parquet', engine='pyarrow')

or

或者

pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

上面的链接解释了：

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

这些引擎非常相似，应该读/写几乎相同的镶木地板格式文件。这些库因具有不同的底层依赖关系而不同（fastparquet 使用 numba，而 pyarrow 使用 c 库）。

Answer 2

回答by danielfrg

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

更新：自从我回答这个问题以来，为了更好地读写 parquet，在 Apache Arrow 上已经做了很多工作。另外：http: //wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

有一个 python parquet reader 工作得比较好：https: //github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csvfor example.

它将创建 python 对象，然后您必须将它们移动到 Pandas DataFrame，因此该过程将比pd.read_csv例如慢。

Answer 3

回答by WY Hsu

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe

除了 Pandas，Apache pyarrow 还提供了将 parquet 转换为 dataframe 的方法

The code is simple, just type:

代码很简单，只需输入：

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files

有关更多信息，请参阅 Apache pyarrow Reading and Writing Single Files 中的文档

Python 如何将 Parquet 文件读入 Pandas DataFrame？

提问by Daniel Mahler

采纳答案by chrisaycock

回答by danielfrg

回答by WY Hsu

相关推荐

最近更新

标签

Python 如何将 Parquet 文件读入 Pandas DataFrame？

提问by Daniel Mahler

采纳答案by chrisaycock

回答by danielfrg

回答by WY Hsu

相关推荐

Python Spark Dataframe 区分名称重复的列

Python 如何在 sklearn 中使用 datasets.fetch_mldata()？

在 Python 中返回 NoneType 的函数？

Python 为 seaborn 热图上的颜色条设置最大值

相关推荐

最近更新

标签