python 使用多处理工人池
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1586754/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using multiprocessing pool of workers
提问by G?khan Sever
I have the following code written to make my lazy second CPU core working. What the code does basically is first find the desired "sea" files in the directory hierarchy and later execute set of external scripts to process these binary "sea" files to produce 50 to 100 text and binary files in number. As the title of the question suggest in a paralleled fashion to increase the processing speed.
我编写了以下代码来使我的懒惰的第二个 CPU 内核工作。代码所做的基本上是首先在目录层次结构中找到所需的“海”文件,然后执行一组外部脚本来处理这些二进制“海”文件,以产生 50 到 100 个文本和二进制文件。正如问题的标题建议以并行方式提高处理速度。
This question originates from the long discussion that we have been having on IPython users list titled as "Cannot start ipcluster". Starting with my experimentation on IPython's parallel processing functionalities.
这个问题源于我们在 IPython 用户列表上进行的名为“无法启动 ipcluster”的长时间讨论。从我对 IPython 并行处理功能的实验开始。
The issue is I can't get this code running correctly. If the folders that contain "sea" files only houses "sea" files the script finishes its execution without fully performing external script runs. (Say I have 30-50 external scripts to run, but my multiprocessing enabled script exhaust only after executing the first script in these external script chain.) Interestingly, if I run this script on an already processed folder (which is "sea" files processed beforehand and output files are already in that folder) then it runs, but this time I get speed-ups at about 2.4 to 2.7X with respect to linear processing timings. It is not very expected since I only have a Core 2 Duo 2.5 Ghz CPU in my laptop. Although I have a CUDA powered GPU it has nothing to do with my current parallel computing struggle :)
问题是我无法正确运行此代码。如果包含“sea”文件的文件夹仅包含“sea”文件,则脚本会在不完全执行外部脚本运行的情况下完成其执行。(假设我有 30-50 个外部脚本要运行,但我的启用多处理的脚本仅在执行这些外部脚本链中的第一个脚本后才会耗尽。)有趣的是,如果我在已处理的文件夹(即“海”文件)上运行此脚本预先处理并且输出文件已经在该文件夹中)然后它运行,但是这次我在线性处理时间方面获得了大约 2.4 到 2.7 倍的加速。这不是很预期,因为我的笔记本电脑中只有一个 Core 2 Duo 2.5 Ghz CPU。虽然我有一个 CUDA 驱动的 GPU,但它与我目前的并行计算斗争无关:)
What do you think might be source of this issue?
你认为这个问题的根源是什么?
Thank you for all comments and suggestions.
感谢您的所有意见和建议。
#!/usr/bin/env python
from multiprocessing import Pool
from subprocess import call
import os
def find_sea_files():
file_list, path_list = [], []
init = os.getcwd()
for root, dirs, files in os.walk('.'):
dirs.sort()
for file in files:
if file.endswith('.sea'):
file_list.append(file)
os.chdir(root)
path_list.append(os.getcwd())
os.chdir(init)
return file_list, path_list
def process_all(pf):
os.chdir(pf[0])
call(['postprocessing_saudi', pf[1]])
if __name__ == '__main__':
pool = Pool(processes=2) # start 2 worker processes
files, paths = find_sea_files()
pathfile = [[paths[i],files[i]] for i in range(len(files))]
pool.map(process_all, pathfile)
采纳答案by Eric Lubow
I would start with getting a better feeling for what is going on with the worker process. The multiprocessing module comes with logging for its subprocesses if you need. Since you have simplified the code to narrow down the problem, I would just debug with a few print statements, like so (or you can PrettyPrint the pfarray):
我会首先更好地了解工作进程正在发生的事情。如果需要,多处理模块会为其子进程提供日志记录。由于您已经简化了代码以缩小问题的范围,因此我只需使用一些打印语句进行调试,就像这样(或者您可以 PrettyPrint pf数组):
def process_all(pf):
print "PID: ", os.getpid()
print "Script Dir: ", pf[0]
print "Script: ", pf[1]
os.chdir(pf[0])
call(['postprocessing_saudi', pf[1]])
if __name__ == '__main__':
pool = Pool(processes=2)
files, paths = find_sea_files()
pathfile = [[paths[i],files[i]] for i in range(len(files))]
pool.map(process_all, pathfile, 1) # Ensure the chunk size is 1
pool.close()
pool.join()
The version of Python that I have accomplished this with 2.6.4.
我用 2.6.4 完成的 Python 版本。
回答by UsAaR33
There are several things I can think of:
我能想到的有几件事:
1) Have you printed out the pathfiles? Are you sure that they are all properly generated?
1)你有没有打印出路径文件?你确定它们都是正确生成的吗?
a) I ask as your os.walk is a bit interesting; the dirs.sort() should be ok, but seems quite unncessarily. os.chdir() in general shouldn't be used; the restoration shouldbe alright, but in general you should just be appending root to init.
a) 我问你的 os.walk 有点意思;dirs.sort() 应该没问题,但似乎没有必要。通常不应使用 os.chdir();恢复应该没问题,但一般来说你应该只是将 root 附加到 init。
2) I've seen multiprocessing on python2.6 have problems spawning subporcesses from pools. (I specifically had a script use multiprocessing to spawn subprocesses. Those subprocesses then could not correctly use multiprocessing (the pool locked up)). Try python2.5 w/ the mulitprocessing backport.
2)我已经看到python2.6上的多处理在从池中产生子进程时出现问题。(我特别有一个脚本使用多处理来产生子进程。然后这些子进程无法正确使用多处理(池被锁定))。尝试使用多处理 backport 的 python2.5。
3) Try picloud's cloud.mp module (which wraps multiprocessing, but handles pools a tad differently) and see if that works.
3)尝试picloud的 cloud.mp 模块(它包装了多处理,但处理池的方式略有不同),看看是否有效。
You would do
你会做
cloud.mp.join(cloud.mp.map(process_all, pathfile))
(Disclaimer: I am one of the developers of PiCloud)
(免责声明:我是 PiCloud 的开发者之一)