在 Pandas 中使用多处理读取 csv 文件的最简单方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36587211/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Easiest way to read csv files with multiprocessing in Pandas
提问by Han Zhengzu
Here is my question.
With bunch of .csv files(or other files). Pandas is an easy way to read them and save into Dataframeformat. But when the amount of files was huge, I want to read the files with multiprocessing to save some time.
这是我的问题。
使用一堆 .csv 文件(或其他文件)。Pandas 是一种读取它们并保存为Dataframe格式的简单方法。但是当文件量很大时,我想通过多处理读取文件以节省一些时间。
My early attempt
我的早期尝试
I manually divide the files into different path. Using severally:
我手动将文件分成不同的路径。分别使用:
os.chdir("./task_1)
files = os.listdir('.')
files.sort()
for file in files:
filename,extname = os.path.splitext(file)
if extname == '.csv':
f = pd.read_csv(file)
df = (f.VALUE.as_matrix()).reshape(75,90)
And then combine them.
然后将它们组合起来。
How to run them with poolto achieve my problem?
Any advice would be appreciate!
如何运行它们pool来解决我的问题?
任何建议将不胜感激!
回答by zemekeneng
Using Pool:
使用Pool:
import os
import pandas as pd
from multiprocessing import Pool
# wrap your csv importer in a function that can be mapped
def read_csv(filename):
'converts a filename to a pandas dataframe'
return pd.read_csv(filename)
def main():
# get a list of file names
files = os.listdir('.')
file_list = [filename for filename in files if filename.split('.')[1]=='csv']
# set up your pool
with Pool(processes=8) as pool: # or whatever your hardware can support
# have your pool map the file names to dataframes
df_list = pool.map(read_csv, file_list)
# reduce the list of dataframes to a single dataframe
combined_df = pd.concat(df_list, ignore_index=True)
if __name__ == '__main__':
main()
回答by Boud
回答by elefun
回答by famaral42
I am not getting map/map_asyncto work, but managed to work with apply_async.
我没有让map/map_async工作,但设法与apply_async一起工作。
Two possible ways (I have no idea which one is better):
两种可能的方式(我不知道哪一种更好):
- A) Concat at the end
- B) Concat during
- A)最后连接
- B) 连接期间
I find globeasy to listand fitlerfiles from a directory
我发现glob易于从目录中列出和拟合文件
from glob import glob
import pandas as pd
from multiprocessing import Pool
folder = "./task_1/" # note the "/" at the end
file_list = glob(folder+'*.xlsx')
def my_read(filename):
f = pd.read_csv(filename)
return (f.VALUE.as_matrix()).reshape(75,90)
#DF_LIST = [] # A) end
DF = pd.DataFrame() # B) during
def DF_LIST_append(result):
#DF_LIST.append(result) # A) end
global DF # B) during
DF = pd.concat([DF,result], ignore_index=True) # B) during
pool = Pool(processes=8)
for file in file_list:
pool.apply_async(my_read, args = (file,), callback = DF_LIST_append)
pool.close()
pool.join()
#DF = pd.concat(DF_LIST, ignore_index=True) # A) end
print(DF.shape)

