如何使用 Pandas 编写分区的 Parquet 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/52934265/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:06:14  来源:igfitidea点击:

How to write a partitioned Parquet file using Pandas

pythonpandasparquetpyarrow

提问by Ivan

I'm trying to write a Pandas dataframe to a partitioned file:

我正在尝试将 Pandas 数据帧写入分区文件:

df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])

TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'

From the documentation I expected that the partition_colswould be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?

从文档中我预计partition_cols将作为 kwargs 传递给 pyarrow 库。如何使用 Pandas 将分区文件写入本地磁盘?

回答by ostrokach

Pandas DataFrame.to_parquetis a thin wrapper over table = pa.Table.from_pandas(...)and pq.write_table(table, ...)(see pandas.parquet.py#L120), and pq.write_tabledoes not support writing partitioned datasets. You should use pq.write_to_datasetinstead.

PandasDataFrame.to_parquet是对table = pa.Table.from_pandas(...)和的薄包装pq.write_table(table, ...)(请参阅参考资料pandas.parquet.py#L120),并且pq.write_table不支持写入分区数据集。你应该pq.write_to_dataset改用。

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)

pq.write_to_dataset(
    table,
    root_path='output.parquet',
    partition_cols=['partone', 'parttwo'],
)

For more info, see pyarrow documentation.

有关更多信息,请参阅pyarrow 文档

In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.

通常,在读取/写入镶木地板文件时,我总是直接使用 PyArrow API,因为 Pandas 包装器的功能相当有限。

回答by sharadlahoti

You need to update to Pandas version 0.24 or above. The functionality of partition_cols is added from that version onwards.

您需要更新到 Pandas 0.24 或更高版本。从该版本开始添加 partition_cols 的功能。