在 Pandas 中使用多处理读取 csv 文件的最简单方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36587211/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:02:10  来源:igfitidea点击:

Easiest way to read csv files with multiprocessing in Pandas

pythoncsvpandasmultiprocessing

提问by Han Zhengzu

Here is my question.
With bunch of .csv files(or other files). Pandas is an easy way to read them and save into Dataframeformat. But when the amount of files was huge, I want to read the files with multiprocessing to save some time.

这是我的问题。
使用一堆 .csv 文件(或其他文件)。Pandas 是一种读取它们并保存为Dataframe格式的简单方法。但是当文件量很大时,我想通过多处理读取文件以节省一些时间。

My early attempt

我的早期尝试

I manually divide the files into different path. Using severally:

我手动将文件分成不同的路径。分别使用:

os.chdir("./task_1)
files = os.listdir('.')
files.sort()
for file in files:
    filename,extname = os.path.splitext(file)
    if extname == '.csv':
        f = pd.read_csv(file)
        df = (f.VALUE.as_matrix()).reshape(75,90)   

And then combine them.

然后将它们组合起来。

How to run them with poolto achieve my problem?
Any advice would be appreciate!

如何运行它们pool来解决我的问题?
任何建议将不胜感激!

回答by zemekeneng

Using Pool:

使用Pool

import os
import pandas as pd 
from multiprocessing import Pool

# wrap your csv importer in a function that can be mapped
def read_csv(filename):
    'converts a filename to a pandas dataframe'
    return pd.read_csv(filename)


def main():

    # get a list of file names
    files = os.listdir('.')
    file_list = [filename for filename in files if filename.split('.')[1]=='csv']

    # set up your pool
    with Pool(processes=8) as pool: # or whatever your hardware can support

        # have your pool map the file names to dataframes
        df_list = pool.map(read_csv, file_list)

        # reduce the list of dataframes to a single dataframe
        combined_df = pd.concat(df_list, ignore_index=True)

if __name__ == '__main__':
    main()

回答by Boud

dasklibrary is designed to address not only but certainly your issue.

dask图书馆旨在不仅解决您的问题,而且肯定会解决您的问题。

回答by elefun

If you aren't against using another library, you could use Graphlab's sframe. This creates an object similar to data frames which is very fast to read data if performance is a big issue.

如果您不反对使用其他库,则可以使用Graphlab的 sframe。这将创建一个类似于数据帧的对象,如果性能是一个大问题,它可以非常快速地读取数据。

回答by famaral42

I am not getting map/map_asyncto work, but managed to work with apply_async.

我没有让map/map_async工作,但设法与apply_async一起工作。

Two possible ways (I have no idea which one is better):

两种可能的方式(我不知道哪一种更好):

  • A) Concat at the end
  • B) Concat during
  • A)最后连接
  • B) 连接期间

I find globeasy to listand fitlerfiles from a directory

我发现glob易于从目录中列出拟合文件

from glob import glob
import pandas as pd
from multiprocessing import Pool

folder = "./task_1/" # note the "/" at the end
file_list = glob(folder+'*.xlsx')

def my_read(filename):
    f = pd.read_csv(filename)
    return (f.VALUE.as_matrix()).reshape(75,90)

#DF_LIST = [] # A) end
DF = pd.DataFrame() # B) during

def DF_LIST_append(result):
    #DF_LIST.append(result) # A) end
    global DF # B) during
    DF = pd.concat([DF,result], ignore_index=True) # B) during

pool = Pool(processes=8)

for file in file_list:
    pool.apply_async(my_read, args = (file,), callback = DF_LIST_append)

pool.close()
pool.join()

#DF = pd.concat(DF_LIST, ignore_index=True) # A) end

print(DF.shape)