pandas 读取文件夹中的多个镶木地板文件并使用python写入单个csv文件

Question

提问by Pri31

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.

我是 python 的新手，我有一个场景，其中有多个文件名按顺序排列的镶木地板文件。例如：par_file1、par_file2、par_file3 等等，一个文件夹中最多 100 个文件。

I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.

我需要从 file1 开始按顺序读取这些镶木地板文件并将其写入单个 csv 文件。写入文件 1 的内容后，文件 2 的内容应附加到没有标题的同一个 csv 中。请注意，所有文件都具有相同的列名，并且只有数据被拆分为多个文件。

I learnt to convert single parquet to csv file using pyarrow with the following code:

我学会了使用 pyarrow 和以下代码将单个镶木地板转换为 csv 文件：

import pandas as pd    
df = pd.read_parquet('par_file.parquet')    
df.to_csv('csv_file.csv')

But I could'nt extend this to loop for multiple parquet files and append to single csv. Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.

但是我无法将其扩展为循环多个镶木地板文件并附加到单个 csv。大Pandas有没有办法做到这一点？或任何其他方式来做到这一点会有很大帮助。谢谢你。

Answer 1

采纳答案by Yo_Chris

If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using globand explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a')Hope this helps.

如果您要将文件复制到本地机器并运行您的代码，您可以执行以下操作。下面的代码假定您在与 parquet 文件相同的目录中运行代码。它还假定文件的命名与您上面提供的一样：“顺序。例如：par_file1、par_file2、par_file3 等等，一个文件夹中最多 100 个文件。” 如果您需要搜索文件，则需要使用获取文件名glob并明确提供要保存 csv 的路径：open(r'this\is\your\path\to\csv_file.csv', 'a')希望这会有所帮助。

import pandas as pd

# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
    print('Reading par_file1.parquet')
    df = pd.read_parquet('par_file1.parquet')
    df.to_csv(csv_file, index=False)
    print('par_file1.parquet appended to csv_file.csv\n')
    csv_file.close()

# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
    files.append(f'par_file{i}.parquet')

# open files and append to csv_file.csv
for f in files:
    print(f'Reading {f}')
    df = pd.read_parquet(f)
    with open('csv_file.csv','a') as file:
        df.to_csv(file, header=False, index=False)
        print(f'{f} appended to csv_file.csv\n')

You can remove the print statements if you want.

如果需要，您可以删除打印语句。

Tested in python 3.6using pandas 0.23.3

经测试python 3.6使用pandas 0.23.3

Answer 2

回答by PMende

I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it's not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.

我遇到了这个问题，想看看大Pandas是否可以本机读取分区的镶木地板数据集。我不得不说当前的答案是不必要的冗长（使其难以解析）。我还认为不断打开/关闭文件句柄然后根据大小扫描到它们的末尾并不是特别有效。

A better alternative would be to read all the parquet files into a single DataFrame, and write it once:

更好的选择是将所有镶木地板文件读入单个 DataFrame，并写入一次：

from pathlib import Path
import pandas as pd

data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
    pd.read_parquet(parquet_file)
    for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')

Alternatively, if you reallywant to just append to the file:

或者，如果您真的只想附加到文件：

data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
    df = pd.read_parquet(parquet_path)
    write_header = i == 0 # write header only on the 0th file
    write_mode = 'w' if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
    df.to_csv('csv_file.csv', mode=write_mode, header=write_header)

A final alternative for appending each file that opens the target CSV file in "a+"mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven't actuallytested it):

附加每个文件的最后一种选择，"a+"在开始时以模式打开目标 CSV 文件，将文件句柄扫描到每次写入/附加的文件末尾（我相信这有效，但尚未实际测试）：

data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
    for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
        df = pd.read_parquet(parquet_path)
        write_header = i == 0 # write header only on the 0th file
        df.to_csv(csv_handle, header=write_header)

Answer 3

回答by Puttur Kamath

This helped me to load all parquet files into one data frame

这帮助我将所有镶木地板文件加载到一个数据框中

import glob
 files = glob.glob("*.snappy.parquet")
 data = [pd.read_parquet(f,engine='fastparquet') for f in files]
 merged_data = pd.concat(data,ignore_index=True)

Answer 4

回答by loknar

a small change for those trying to read remote files, which helps to read it faster (direct read_parquet for remote files was doing this much slower for me):

对于那些试图读取远程文件的人来说，这是一个小的变化，这有助于更快地读取它（远程文件的直接 read_parquet 对我来说要慢得多）：

import io
merged = []
# remote_reader = ... <- init some remote reader, for example AzureDLFileSystem()
for f in files:
    with remote_reader.open(f, 'rb') as f_reader:
        merged.append(remote_reader.read())
merged = pd.concat((pd.read_parquet(io.BytesIO(file_bytes)) for file_bytes in merged))

Adds a little temporary memory overhead though.

不过会增加一些临时内存开销。

pandas 读取文件夹中的多个镶木地板文件并使用python写入单个csv文件

提问by Pri31

采纳答案by Yo_Chris

回答by PMende

回答by Puttur Kamath

回答by loknar

相关推荐

最近更新

标签

pandas 读取文件夹中的多个镶木地板文件并使用python写入单个csv文件

提问by Pri31

采纳答案by Yo_Chris

回答by PMende

回答by Puttur Kamath

回答by loknar

相关推荐

pandas 使用 python 和 matplotlib 的时间线条形图

什么是 Pandas 中 dataframe.loc() 的 Numpy 等价物

在 Python Pandas 中同时融化多个列

pandas Seaborn.relplot() 中的 `hue` 参数在给定数值数据时会跳过一个整数吗？

相关推荐

最近更新

标签