python python中的多进程或线程?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1226584/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
multiprocess or threading in python?
提问by Ryan
I have a python application that grabs a collection of data and for each piece of data in that collection it performs a task. The task takes some time to complete as there is a delay involved. Because of this delay, I don't want each piece of data to perform the task subsequently, I want them to all happen in parallel. Should I be using multiprocess? or threading for this operation?
我有一个 python 应用程序,它获取一组数据,并为该集合中的每条数据执行一项任务。该任务需要一些时间才能完成,因为存在延迟。由于这种延迟,我不希望每条数据随后执行任务,我希望它们都并行发生。我应该使用多进程吗?或线程此操作?
I attempted to use threading but had some trouble, often some of the tasks would never actually fire.
我尝试使用线程,但遇到了一些麻烦,通常有些任务永远不会真正触发。
回答by Christopher
If you are truly compute bound, using the multiprocessing moduleis probably the lightest weight solution (in terms of both memory consumption and implementation difficulty.)
如果您确实是计算密集型的,那么使用multiprocessing 模块可能是最轻量级的解决方案(就内存消耗和实现难度而言)。
If you are I/O bound, using the threading modulewill usually give you good results. Make sure that you use thread safe storage (like the Queue) to hand data to your threads. Or else hand them a single piece of data that is unique to them when they are spawned.
如果您受 I/O 限制,使用线程模块通常会给您带来不错的结果。确保使用线程安全存储(如队列)将数据传递给线程。或者,在它们生成时,将它们独有的单个数据交给它们。
PyPyis focused on performance. It has a number of features that can help with compute-bound processing. They also have support for Software Transactional Memory, although that is not yet production quality. The promise is that you can use simpler parallel or concurrent mechanisms than multiprocessing (which has some awkward requirements.)
PyPy专注于性能。它具有许多有助于计算绑定处理的功能。他们还支持软件事务内存,尽管这还不是生产质量。承诺是您可以使用比多处理更简单的并行或并发机制(这有一些尴尬的要求。)
Stackless Pythonis also a nice idea. Stackless has portability issues as indicated above. Unladen Swallowwas promising, but is now defunct. Pystonis another (unfinished) Python implementation focusing on speed. It is taking an approach different to PyPy, which may yield better (or just different) speedups.
Stackless Python也是一个不错的主意。如上所述,Stackless 存在可移植性问题。Unladen Swallow很有前途,但现在已经不复存在了。Pyston是另一个(未完成的)Python 实现,专注于速度。它采用与 PyPy 不同的方法,这可能会产生更好(或只是不同)的加速。
回答by Davide Muzzarelli
Tasks runs like sequentially but you have the illusion that are run in parallel. Tasks are good when you use for file or connection I/O and because are lightweights.
任务按顺序运行,但您会产生并行运行的错觉。当您用于文件或连接 I/O 时,任务很好,因为是轻量级的。
Multiprocess with Pool may be the right solution for you because processes runs in parallel so are very good with intensive computing because each process run in one CPU (or core).
带池的多进程可能是适合您的解决方案,因为进程并行运行,因此非常适合密集计算,因为每个进程都在一个 CPU(或内核)中运行。
Setup multiprocess may be very easy:
设置多进程可能非常简单:
from multiprocessing import Pool
def worker(input_item):
output = do_some_work()
return output
pool = Pool() # it make one process for each CPU (or core) of your PC. Use "Pool(4)" to force to use 4 processes, for example.
list_of_results = pool.map(worker, input_list) # Launch all automatically
回答by S.Lott
For small collections of data, simply create subprocesses with subprocess.Popen.
对于小型数据集合,只需使用subprocess.Popen创建子流程。
Each subprocess can simply get it's piece of data from stdin or from command-line arguments, do it's processing, and simply write the result to an output file.
每个子进程都可以简单地从标准输入或命令行参数中获取它的数据,进行处理,然后将结果简单地写入输出文件。
When the subprocesses have all finished (or timed out), you simply merge the output files.
当子进程全部完成(或超时)时,您只需合并输出文件。
Very simple.
很简单的。
回答by Mark Rushakoff
You might consider looking into Stackless Python. If you have control over the function that takes a long time, you can just throw some stackless.schedule()
s in there (saying yield to the next coroutine), or else you can set Stackless to preemptive multitasking.
您可能会考虑研究Stackless Python。如果你可以控制需要很长时间的函数,你可以stackless.schedule()
在那里扔一些s(说yield to next coroutine),或者你可以将 Stackless 设置为 preemptive multitasking。
In Stackless, you don't have threads, but taskletsor greenletswhich are essentially very lightweight threads. It works great in the sense that there's a pretty good framework with very little setup to get multitasking going.
在无堆栈,你没有线程,但任务蕾或greenlets本质上属于非常轻量级线程。从某种意义上说,它非常有用,因为它有一个非常好的框架,只需很少的设置就可以进行多任务处理。
However, Stackless hinders portability because you have to replace a few of the standard Python libraries -- Stackless removes reliance on the C stack. It's very portable if the next user also has Stackless installed, but that will rarely be the case.
但是,Stackless 阻碍了可移植性,因为您必须替换一些标准 Python 库——Stackless 消除了对 C 堆栈的依赖。如果下一个用户也安装了 Stackless,它是非常便携的,但这种情况很少发生。
回答by ire_and_curses
Using CPython's threading model will not give you any performance improvement, because the threads are not actually executed in parallel, due to the way garbage collection is handled. Multiprocess would allow parallel execution. Obviously in this case you have to have multiple cores available to farm out your parallel jobs to.
使用 CPython 的线程模型不会给您带来任何性能改进,因为由于垃圾收集的处理方式,线程实际上并不是并行执行的。多进程将允许并行执行。显然,在这种情况下,您必须有多个内核可用于将并行作业分配给其他内核。
There is much more information available in this related question.
在这个相关问题中有更多信息可用。
回答by Gattster
回答by nos
If you can easily partition and separate the data you have, it sounds like you should just do that partitioning externally, and feed them to several processes of your program. (i.e. several processes instead of threads)
如果您可以轻松地对您拥有的数据进行分区和分离,那么听起来您应该只在外部进行分区,然后将它们提供给程序的多个进程。(即多个进程而不是线程)
回答by Eloff
IronPython has real multithreading, unlike CPython and it's GIL. So depending on what you're doing it may be worth looking at. But it sounds like your use case is better suited to the multiprocessing module.
IronPython 具有真正的多线程,与 CPython 不同,它是 GIL。因此,根据您在做什么,它可能值得一看。但听起来您的用例更适合多处理模块。
To the guy who recommends stackless python, I'm not an expert on it, but it seems to me that he's talking about software "multithreading", which is actually not parallel at all (still runs in one physical thread, so cannot scale to multiple cores.) It's merely an alternative way to structure asynchronous (but still single-threaded, non-parallel) application.
对于推荐 stackless python 的人,我不是这方面的专家,但在我看来,他在谈论软件“多线程”,实际上根本不是并行的(仍然在一个物理线程中运行,因此无法扩展到多核。)它只是构建异步(但仍然是单线程,非并行)应用程序的另一种方法。