pandas 读取文件夹中的多个镶木地板文件并使用python写入单个csv文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51696655/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read multiple parquet files in a folder and write to single csv file using python
提问by Pri31
I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.
我是 python 的新手,我有一个场景,其中有多个文件名按顺序排列的镶木地板文件。例如:par_file1、par_file2、par_file3 等等,一个文件夹中最多 100 个文件。
I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.
我需要从 file1 开始按顺序读取这些镶木地板文件并将其写入单个 csv 文件。写入文件 1 的内容后,文件 2 的内容应附加到没有标题的同一个 csv 中。请注意,所有文件都具有相同的列名,并且只有数据被拆分为多个文件。
I learnt to convert single parquet to csv file using pyarrow with the following code:
我学会了使用 pyarrow 和以下代码将单个镶木地板转换为 csv 文件:
import pandas as pd
df = pd.read_parquet('par_file.parquet')
df.to_csv('csv_file.csv')
But I could'nt extend this to loop for multiple parquet files and append to single csv. Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.
但是我无法将其扩展为循环多个镶木地板文件并附加到单个 csv。大Pandas有没有办法做到这一点?或任何其他方式来做到这一点会有很大帮助。谢谢你。
采纳答案by Yo_Chris
If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using glob
and explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a')
Hope this helps.
如果您要将文件复制到本地机器并运行您的代码,您可以执行以下操作。下面的代码假定您在与 parquet 文件相同的目录中运行代码。它还假定文件的命名与您上面提供的一样:“顺序。例如:par_file1、par_file2、par_file3 等等,一个文件夹中最多 100 个文件。” 如果您需要搜索文件,则需要使用获取文件名glob
并明确提供要保存 csv 的路径:open(r'this\is\your\path\to\csv_file.csv', 'a')
希望这会有所帮助。
import pandas as pd
# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
print('Reading par_file1.parquet')
df = pd.read_parquet('par_file1.parquet')
df.to_csv(csv_file, index=False)
print('par_file1.parquet appended to csv_file.csv\n')
csv_file.close()
# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
files.append(f'par_file{i}.parquet')
# open files and append to csv_file.csv
for f in files:
print(f'Reading {f}')
df = pd.read_parquet(f)
with open('csv_file.csv','a') as file:
df.to_csv(file, header=False, index=False)
print(f'{f} appended to csv_file.csv\n')
You can remove the print statements if you want.
如果需要,您可以删除打印语句。
Tested in python 3.6
using pandas 0.23.3
经测试python 3.6
使用pandas 0.23.3
回答by PMende
I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it's not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.
我遇到了这个问题,想看看大Pandas是否可以本机读取分区的镶木地板数据集。我不得不说当前的答案是不必要的冗长(使其难以解析)。我还认为不断打开/关闭文件句柄然后根据大小扫描到它们的末尾并不是特别有效。
A better alternative would be to read all the parquet files into a single DataFrame, and write it once:
更好的选择是将所有镶木地板文件读入单个 DataFrame,并写入一次:
from pathlib import Path
import pandas as pd
data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')
Alternatively, if you reallywant to just append to the file:
或者,如果您真的只想附加到文件:
data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
write_mode = 'w' if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
df.to_csv('csv_file.csv', mode=write_mode, header=write_header)
A final alternative for appending each file that opens the target CSV file in "a+"
mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven't actuallytested it):
附加每个文件的最后一种选择,"a+"
在开始时以模式打开目标 CSV 文件,将文件句柄扫描到每次写入/附加的文件末尾(我相信这有效,但尚未实际测试):
data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
df.to_csv(csv_handle, header=write_header)
回答by Puttur Kamath
This helped me to load all parquet files into one data frame
这帮助我将所有镶木地板文件加载到一个数据框中
import glob
files = glob.glob("*.snappy.parquet")
data = [pd.read_parquet(f,engine='fastparquet') for f in files]
merged_data = pd.concat(data,ignore_index=True)
回答by loknar
a small change for those trying to read remote files, which helps to read it faster (direct read_parquet for remote files was doing this much slower for me):
对于那些试图读取远程文件的人来说,这是一个小的变化,这有助于更快地读取它(远程文件的直接 read_parquet 对我来说要慢得多):
import io
merged = []
# remote_reader = ... <- init some remote reader, for example AzureDLFileSystem()
for f in files:
with remote_reader.open(f, 'rb') as f_reader:
merged.append(remote_reader.read())
merged = pd.concat((pd.read_parquet(io.BytesIO(file_bytes)) for file_bytes in merged))
Adds a little temporary memory overhead though.
不过会增加一些临时内存开销。