pandas python中的并行处理

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36054321/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:53:14  来源:igfitidea点击:

parallel processing in pandas python

pythonpandas

提问by surya

I have 5,000,000 rows in my dataframe. In my code, I am using iterrows() which is taking too much time. To get the required output, I have to iterate through all the rows . So I wanted to know whether I can parallelize the code in pandas.

我的数据框中有 5,000,000 行。在我的代码中,我使用的 iterrows() 花费了太多时间。为了获得所需的输出,我必须遍历所有行。所以我想知道我是否可以并行化pandas中的代码。

回答by jtitusj

Here's a webpage I found that might help: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html

这是我发现可能有帮助的网页:http: //gouthamanbalaraman.com/blog/distributed-processing-pandas.html

And here's the code for multiprocessing found in that page:

这是在该页面中找到的多处理代码:

import pandas as pd
import multiprocessing as mp

LARGE_FILE = "D:\my_large_file.txt"
CHUNKSIZE = 100000 # processing 100,000 rows at a time

def process_frame(df):
    # process data frame
    return len(df)

if __name__ == '__main__':
    reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
    pool = mp.Pool(4) # use 4 processes

    funclist = []
    for df in reader:
        # process each data frame
        f = pool.apply_async(process_frame,[df])
        funclist.append(f)

    result = 0
    for f in funclist:
        result += f.get(timeout=10) # timeout in 10 seconds

    print "There are %d rows of data"%(result)