在 Pandas 中使用多处理读取 csv 文件的最简单方法

Question

提问by Han Zhengzu

Here is my question.
With bunch of .csv files(or other files). Pandas is an easy way to read them and save into Dataframeformat. But when the amount of files was huge, I want to read the files with multiprocessing to save some time.

这是我的问题。
使用一堆 .csv 文件（或其他文件）。Pandas 是一种读取它们并保存为Dataframe格式的简单方法。但是当文件量很大时，我想通过多处理读取文件以节省一些时间。

My early attempt

我的早期尝试

I manually divide the files into different path. Using severally:

我手动将文件分成不同的路径。分别使用：

os.chdir("./task_1)
files = os.listdir('.')
files.sort()
for file in files:
    filename,extname = os.path.splitext(file)
    if extname == '.csv':
        f = pd.read_csv(file)
        df = (f.VALUE.as_matrix()).reshape(75,90)

And then combine them.

然后将它们组合起来。

How to run them with poolto achieve my problem?
Any advice would be appreciate!

如何运行它们pool来解决我的问题？
任何建议将不胜感激！

Answer 1

回答by zemekeneng

Using Pool:

使用Pool：

import os
import pandas as pd 
from multiprocessing import Pool

# wrap your csv importer in a function that can be mapped
def read_csv(filename):
    'converts a filename to a pandas dataframe'
    return pd.read_csv(filename)


def main():

    # get a list of file names
    files = os.listdir('.')
    file_list = [filename for filename in files if filename.split('.')[1]=='csv']

    # set up your pool
    with Pool(processes=8) as pool: # or whatever your hardware can support

        # have your pool map the file names to dataframes
        df_list = pool.map(read_csv, file_list)

        # reduce the list of dataframes to a single dataframe
        combined_df = pd.concat(df_list, ignore_index=True)

if __name__ == '__main__':
    main()

Answer 2

回答by Boud

dasklibrary is designed to address not only but certainly your issue.

dask图书馆旨在不仅解决您的问题，而且肯定会解决您的问题。

Answer 3

回答by elefun

If you aren't against using another library, you could use Graphlab's sframe. This creates an object similar to data frames which is very fast to read data if performance is a big issue.

如果您不反对使用其他库，则可以使用Graphlab的 sframe。这将创建一个类似于数据帧的对象，如果性能是一个大问题，它可以非常快速地读取数据。

Answer 4

回答by famaral42

I am not getting map/map_asyncto work, but managed to work with apply_async.

我没有让map/map_async工作，但设法与apply_async一起工作。

Two possible ways (I have no idea which one is better):

两种可能的方式（我不知道哪一种更好）：

A) Concat at the end
B) Concat during

A)最后连接
B) 连接期间

I find globeasy to listand fitlerfiles from a directory

我发现glob易于从目录中列出和拟合文件

from glob import glob
import pandas as pd
from multiprocessing import Pool

folder = "./task_1/" # note the "/" at the end
file_list = glob(folder+'*.xlsx')

def my_read(filename):
    f = pd.read_csv(filename)
    return (f.VALUE.as_matrix()).reshape(75,90)

#DF_LIST = [] # A) end
DF = pd.DataFrame() # B) during

def DF_LIST_append(result):
    #DF_LIST.append(result) # A) end
    global DF # B) during
    DF = pd.concat([DF,result], ignore_index=True) # B) during

pool = Pool(processes=8)

for file in file_list:
    pool.apply_async(my_read, args = (file,), callback = DF_LIST_append)

pool.close()
pool.join()

#DF = pd.concat(DF_LIST, ignore_index=True) # A) end

print(DF.shape)

在 Pandas 中使用多处理读取 csv 文件的最简单方法

提问by Han Zhengzu

My early attempt

我的早期尝试

回答by zemekeneng

回答by Boud

回答by elefun

回答by famaral42

相关推荐

最近更新

标签

在 Pandas 中使用多处理读取 csv 文件的最简单方法

提问by Han Zhengzu

My early attempt

我的早期尝试

回答by zemekeneng

回答by Boud

回答by elefun

回答by famaral42

相关推荐

Python：在 Pandas lambda 表达式中使用函数

如何使用我的 Pandas 数据框创建一个显示组值 sum() 的数据透视表？

Pandas 将字典连接到数据框

pandas 如何使用pandas将excel文件数据转换为numpy数组？

相关推荐

最近更新

标签