Python 如何将 Parquet 文件读入 Pandas DataFrame?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33813815/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read a Parquet file into Pandas DataFrame?
提问by Daniel Mahler
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
如何在不设置集群计算基础设施(如 Hadoop 或 Spark)的情况下将中等大小的 Parquet 数据集读入内存中的 Pandas DataFrame?这只是我想在笔记本电脑上使用简单的 Python 脚本在内存中读取的适量数据。数据不驻留在 HDFS 上。它要么在本地文件系统上,要么在 S3 中。我不想启动和配置其他服务,如 Hadoop、Hive 或 Spark。
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
我认为 Blaze/Odo 会使这成为可能:Odo 文档提到了 Parquet,但这些示例似乎都在通过外部 Hive 运行时。
采纳答案by chrisaycock
pandas 0.21 introduces new functions for Parquet:
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
或者
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
上面的链接解释了:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
这些引擎非常相似,应该读/写几乎相同的镶木地板格式文件。这些库因具有不同的底层依赖关系而不同(fastparquet 使用 numba,而 pyarrow 使用 c 库)。
回答by danielfrg
Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
更新:自从我回答这个问题以来,为了更好地读写 parquet,在 Apache Arrow 上已经做了很多工作。另外:http: //wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
有一个 python parquet reader 工作得比较好:https: //github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csvfor example.
它将创建 python 对象,然后您必须将它们移动到 Pandas DataFrame,因此该过程将比pd.read_csv例如慢。
回答by WY Hsu
Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
除了 Pandas,Apache pyarrow 还提供了将 parquet 转换为 dataframe 的方法
The code is simple, just type:
代码很简单,只需输入:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files
有关更多信息,请参阅 Apache pyarrow Reading and Writing Single Files 中的文档

