Pandas 无法读取在 PySpark 中创建的镶木地板文件

Question

提问by Thomas

I am writing a parquet file from a Spark DataFrame the following way:

我正在通过以下方式从 Spark DataFrame 编写镶木地板文件：

df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip")

This creates a folder with multiple files in it.

这将创建一个包含多个文件的文件夹。

When I try to read this into pandas, I get the following errors, depending on which parser I use:

当我尝试将其读入 Pandas 时，出现以下错误，具体取决于我使用的解析器：

import pandas as pd
df = pd.read_parquet("path/myfile.parquet", engine="pyarrow")

PyArrow:

派箭：

File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
ArrowIOError: Invalid parquet file. Corrupt footer.

文件“pyarrow\error.pxi”，第 83 行，在 pyarrow.lib.check_status
ArrowIOError: 镶木地板文件无效。损坏的页脚。

fastparquet:

快速拼花：

File "C:\Program Files\Anaconda3\lib\site-packages\fastparquet\util.py", line 38, in default_open return open(f, mode)
PermissionError: [Errno 13] Permission denied: 'path/myfile.parquet'

文件“C:\Program Files\Anaconda3\lib\site-packages\fastparquet\util.py”，第 38 行，在 default_open 中返回 open(f, mode)
PermissionError: [Errno 13] 权限被拒绝: 'path/myfile.parquet'

I am using the following versions:

我正在使用以下版本：

Spark 2.4.0
Pandas 0.23.4
pyarrow 0.10.0
fastparquet 0.2.1

火花 2.4.0
Pandas 0.23.4
pyarrow 0.10.0
快速拼花 0.2.1

I tried gzip as well as snappy compression. Both do not work. I of course made sure that I have the file in a location where Python has permissions to read/write.

我尝试了 gzip 以及 snappy 压缩。两者都不起作用。我当然确保我将文件放在 Python 有权读/写的位置。

It would already help if somebody was able to reproduce this error.

如果有人能够重现此错误，那将会有所帮助。

Answer 1

采纳答案by Thomas

Since this still seems to be an issue even with newer pandas versions, I wrote some functions to circumvent this as part of a larger pyspark helpers library:

由于即使使用较新的 Pandas 版本，这似乎仍然是一个问题，因此我编写了一些函数来规避此问题，作为更大的 pyspark helpers 库的一部分：

import pandas as pd
import datetime

def read_parquet_folder_as_pandas(path, verbosity=1):
  files = [f for f in os.listdir(path) if f.endswith("parquet")]

  if verbosity > 0:
    print("{} parquet files found. Beginning reading...".format(len(files)), end="")
    start = datetime.datetime.now()

  df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
  df = pd.concat(df_list, ignore_index=True)

  if verbosity > 0:
    end = datetime.datetime.now()
    print(" Finished. Took {}".format(end-start))
  return df


def read_parquet_as_pandas(path, verbosity=1):
  """Workaround for pandas not being able to read folder-style parquet files.
  """
  if os.path.isdir(path):
    if verbosity>1: print("Parquet file is actually folder.")
    return read_parquet_folder_as_pandas(path, verbosity)
  else:
    return pd.read_parquet(path)

This assumes that the relevant files in the parquet "file", which is actually a folder, end with ".parquet". This works for parquet files exported by databricks and might work with others as well (untested, happy about feedback in the comments).

这假设镶木地板“文件”（实际上是一个文件夹）中的相关文件以“.parquet”结尾。这适用于由 databricks 导出的镶木地板文件，也可能适用于其他人（未经测试，对评论中的反馈感到满意）。

The function read_parquet_as_pandas()can be used if it is not known beforehand whether it is a folder or not.

read_parquet_as_pandas()如果事先不知道它是否是文件夹，则可以使用该功能。

Answer 2

回答by martinarroyo

The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives the filename). This is not something supported by Pandas, which expects a file, not a path.

问题在于 Spark 由于其分布式特性而对文件进行分区（每个执行程序在接收文件名的目录中写入一个文件）。这不是 Pandas 支持的东西，它需要一个文件，而不是一个路径。

You can circumvent this issue in different ways:

您可以通过不同方式规避此问题：

Reading the file with an alternative utility, such as the pyarrow.parquet.ParquetDataset, and then convert that to Pandas (I did not test this code).
```
arrow_df = pyarrow.parquet.ParquetDataset('path/myfile.parquet')
pandas_df = arrow_df.to_pandas()
```
Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python

使用替代实用程序读取文件，例如pyarrow.parquet.ParquetDataset，然后将其转换为 Pandas（我没有测试此代码）。
```
arrow_df = pyarrow.parquet.ParquetDataset('path/myfile.parquet')
pandas_df = arrow_df.to_pandas()
```
另一种方法是分别读取单独的片段，然后将它们连接起来，正如这个答案所建议的：读取文件夹中的多个镶木地板文件并使用 python 写入单个 csv 文件

Pandas 无法读取在 PySpark 中创建的镶木地板文件

提问by Thomas

采纳答案by Thomas

回答by martinarroyo

相关推荐

最近更新

标签

Pandas 无法读取在 PySpark 中创建的镶木地板文件

提问by Thomas

采纳答案by Thomas

回答by martinarroyo

相关推荐

如何避免解码为 str：在 Pandas 中需要类似字节的对象错误？

pandas “类型错误：单例数组不能被视为有效集合”使用 sklearn train_test_split

Python-pandas：一个系列的真值不明确

获取 Pandas 中具有特定值的单元格的行和列

相关推荐

最近更新

标签