pandas 使用 pyarrow 如何附加到镶木地板文件？

Question

提问by Merlin

How do you append/update to a parquetfile with pyarrow?

你如何附加/更新到一个parquet文件pyarrow？

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


 table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
 table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})


pq.write_table(table2, './dataNew/pqTest2.parquet')
#append pqTest2 here?

There is nothing I found in the docs about appending parquet files. And, Can you use pyarrowwith multiprocessing to insert/update the data.

我在文档中没有找到关于附加镶木地板文件的内容。并且，您可以使用pyarrow多处理来插入/更新数据吗？

Answer 1

采纳答案by Ibraheem Ibraheem

I ran into the same issue and I think I was able to solve it using the following:

我遇到了同样的问题，我想我可以使用以下方法解决它：

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


chunksize=10000 # this is the number of lines

pqwriter = None
for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):
    table = pa.Table.from_pandas(df)
    # for the first chunk of records
    if i == 0:
        # create a parquet write object giving it an output file
        pqwriter = pq.ParquetWriter('sample.parquet', table.schema)            
    pqwriter.write_table(table)

# close the parquet writer
if pqwriter:
    pqwriter.close()

Answer 2

回答by yardstick17

In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.

在您的情况下，列名不一致，我使三个示例数据帧的列名保持一致，以下代码对我有用。

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def append_to_parquet_table(dataframe, filepath=None, writer=None):
    """Method writes/append dataframes in parquet format.

    This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked
    with writer, it appends dataframe to the already written pyarrow table.

    :param dataframe: pd.DataFrame to be written in parquet format.
    :param filepath: target file location for parquet file.
    :param writer: ParquetWriter object to write pyarrow tables in parquet format.
    :return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame
        in the pyarrow Table
    """
    table = pa.Table.from_pandas(dataframe)
    if writer is None:
        writer = pq.ParquetWriter(filepath, table.schema)
    writer.write_table(table=table)
    return writer


if __name__ == '__main__':

    table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    writer = None
    filepath = '/tmp/verify_pyarrow_append.parquet'
    table_list = [table1, table2, table3]

    for table in table_list:
        writer = append_to_parquet_table(table, filepath, writer)

    if writer:
        writer.close()

    df = pd.read_parquet(filepath)
    print(df)

Output:

输出：

   one  three  two
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz

Answer 3

回答by Wes McKinney

Generally speaking, Parquet datasets consist of multiple files, so you append by writing an additional file into the same directory where the data belongs to. It would be useful to have the ability to concatenate multiple files easily. I opened https://issues.apache.org/jira/browse/PARQUET-1154to make this possible to do easily in C++ (and therefore Python)

一般来说，Parquet 数据集由多个文件组成，因此您可以通过将附加文件写入数据所属的同一目录中来追加。能够轻松连接多个文件会很有用。我打开了https://issues.apache.org/jira/browse/PARQUET-1154以便在 C++（以及 Python）中轻松实现这一点

pandas 使用 pyarrow 如何附加到镶木地板文件？

提问by Merlin

采纳答案by Ibraheem Ibraheem

回答by yardstick17

回答by Wes McKinney

相关推荐

最近更新

标签

pandas 使用 pyarrow 如何附加到镶木地板文件？

提问by Merlin

采纳答案by Ibraheem Ibraheem

回答by yardstick17

回答by Wes McKinney

相关推荐

在同一 Pandas 数据框中交换行

Pandas：对多列求和并在多列中获得结果

在 Pandas 数据框中查找从点到行的欧几里德距离

pandas Dask连接的简单方法（水平，轴= 1，列）

相关推荐

最近更新

标签