python 如何限制python中的活动线程数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1787397/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I limit the number of active threads in python?
提问by thornomad
Am new to python and making some headway with threading
- am doing some music file conversion and want to be able to utilize the multiple cores on my machine (one active conversion thread per core).
我是 python 的新手,并取得了一些进展threading
- 正在做一些音乐文件转换,并希望能够利用我机器上的多个内核(每个内核一个活动转换线程)。
class EncodeThread(threading.Thread):
# this is hacked together a bit, but should give you an idea
def run(self):
decode = subprocess.Popen(["flac","--decode","--stdout",self.src],
stdout=subprocess.PIPE)
encode = subprocess.Popen(["lame","--quiet","-",self.dest],
stdin=decode.stdout)
encode.communicate()
# some other code puts these threads with various src/dest pairs in a list
for proc in threads: # `threads` is my list of `threading.Thread` objects
proc.start()
Everything works, all the files get encoded, bravo! ... however, all the processes spawn immediately, yet I only want to run two at a time (one for each core). As soon as one is finished, I want it to move on to the next on the list until it is finished, then continue with the program.
一切正常,所有文件都被编码,太棒了!...然而,所有进程都会立即产生,但我只想一次运行两个(每个内核一个)。一旦完成,我希望它移动到列表中的下一个,直到完成,然后继续该程序。
How do I do this?
我该怎么做呢?
(I've looked at the thread pool and queue functions but I can't find a simple answer.)
(我查看了线程池和队列函数,但找不到简单的答案。)
Edit:maybe I should add that each of my threads is using subprocess.Popen
to run a separate command line decoder(flac) piped to stdout which is fed into a command line encoder(lame/mp3).
编辑:也许我应该补充一点,我的每个线程都subprocess.Popen
用来运行一个单独的命令行解码器(flac),该解码器通过管道传输到标准输出,然后输入到命令行编码器(lame/mp3)。
回答by Andre Holzner
If you want to limit the number of parallel threads, use a semaphore:
如果要限制并行线程的数量,请使用信号量:
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
class EncodeThread(threading.Thread):
def run(self):
threadLimiter.acquire()
try:
<your code here>
finally:
threadLimiter.release()
Start all threads at once. All but maximumNumberOfThreads
will wait in threadLimiter.acquire()
and a waiting thread will only continue once another thread goes through threadLimiter.release()
.
一次启动所有线程。All butmaximumNumberOfThreads
将等待,threadLimiter.acquire()
等待的线程只会在另一个线程通过时继续threadLimiter.release()
。
回答by S.Lott
"Each of my threads is using subprocess.Popen
to run a separate command line [process]".
“我的每个线程都subprocess.Popen
用于运行单独的命令行 [进程]”。
Why have a bunch of threads manage a bunch of processes? That's exactly what an OS does that for you. Why micro-manage what the OS already manages?
为什么一堆线程管理一堆进程?这正是操作系统为您所做的。为什么要对操作系统已经管理的内容进行微观管理?
Rather than fool around with threads overseeing processes, just fork off processes. Your process table probably can't handle 2000 processes, but it can handle a few dozen (maybe a few hundred) pretty easily.
与其用线程来监督进程,不如把进程分叉出来。您的进程表可能无法处理 2000 个进程,但它可以轻松处理几十个(也许几百个)进程。
You want to have morework than your CPU's can possibly handle queued up. The real question is one of memory -- not processes or threads. If the sum of all the active data for all the processes exceeds physical memory, then data has to be swapped, and that will slow you down.
您希望有更多的工作,而不是您的 CPU 可能处理的排队。真正的问题是内存之一——而不是进程或线程。如果所有进程的所有活动数据的总和超过物理内存,则必须交换数据,这会减慢您的速度。
If your processes have a fairly small memory footprint, you can have lots and lots running. If your processes have a large memory footprint, you can't have very many running.
如果您的进程具有相当小的内存占用,您可以运行很多很多。如果您的进程占用大量内存,则不能运行太多进程。
回答by Jim Dennis
Short answer: don't use threads.
简短回答:不要使用线程。
For a working example, you can look at something I've recently tossed together at work. It's a little wrapper around ssh
which runs a configurable number of Popen()
subprocesses. I've posted it at: Bitbucket: classh (Cluster Admin's ssh Wrapper).
作为一个有效的例子,你可以看看我最近在工作中扔在一起的东西。它是一个小包装ssh
,运行可配置数量的Popen()
子进程。我已将其发布在:Bitbucket: classh (Cluster Admin's ssh Wrapper)。
As noted, I don't use threads; I just spawn off the children, loop over them calling their .poll()
methods and checking for timeouts (also configurable) and replenish the pool as I gather the results. I've played with different sleep()
values and in the past I've written a version (before the subprocessmodule was added to Python) which used the signalmodule (SIGCHLD and SIGALRM) and the os.fork()and os.execve()functions --- which my on pipe and file descriptor plumbing, etc).
如前所述,我不使用线程;我只是产生孩子,循环调用他们的.poll()
方法并检查超时(也可配置)并在我收集结果时补充池。我玩过不同的sleep()
值,过去我写了一个版本(在将subprocess模块添加到 Python 之前),它使用了信号模块(SIGCHLD 和 SIGALRM)以及os.fork()和os.execve()功能 --- 我的管道和文件描述符管道等)。
In my case I'm incrementally printing results as I gather them ... and remembering all of them to summarize at the end (when all the jobs have completed or been killed for exceeding the timeout).
在我的情况下,我在收集结果时逐渐打印结果......并记住所有结果在最后进行总结(当所有作业已完成或因超过超时而被终止时)。
I ran that, as posted, on a list of 25,000 internal hosts (many of which are down, retired, located internationally, not accessible to my test account etc). It completed the job in just over two hours and had no issues. (There were about 60 of them that were timeouts due to systems in degenerate/thrashing states -- proving that my timeout handling works correctly).
正如发布的那样,我在 25,000 台内部主机(其中许多已关闭、已停用、位于国际上、我的测试帐户无法访问等)的列表上运行了它。它在短短两个多小时内完成了工作,没有任何问题。(其中大约有 60 个由于系统处于退化/颠簸状态而超时——证明我的超时处理工作正常)。
So I know this model works reliably. Running 100 current ssh
processes with this code doesn't seem to cause any noticeable impact. (It's a moderately old FreeBSD box). I used to run the old (pre-subprocess) version with 100 concurrent processes on my old 512MB laptop without problems, too).
所以我知道这个模型工作可靠。ssh
使用此代码运行 100 个当前进程似乎不会造成任何明显影响。(这是一个中等陈旧的 FreeBSD 盒子)。我曾经在我的旧 512MB 笔记本电脑上运行具有 100 个并发进程的旧(pre- subprocess)版本也没有问题)。
(BTW: I plan to clean this up and add features to it; feel free to contribute or to clone off your own branch of it; that's what Bitbucket.org is for).
(顺便说一句:我计划清理它并为其添加功能;随意贡献或克隆您自己的分支;这就是 Bitbucket.org 的用途)。
回答by Edmund
If you're using the default "cpython" version then this won't help you, because only one thread can execute at a time; look up Global Interpreter Lock. Instead, I'd suggest looking at the multiprocessing
modulein Python 2.6 -- it makes parallel programming a cinch. You can create a Pool
object with 2*num_threads
processes, and give it a bunch of tasks to do. It will execute up to 2*num_threads
tasks at a time, until all are done.
如果您使用默认的“cpython”版本,那么这对您没有帮助,因为一次只能执行一个线程;查找全局解释器锁。相反,我建议查看Python 2.6中的multiprocessing
模块——它使并行编程变得轻而易举。您可以Pool
使用2*num_threads
流程创建一个对象,并为其分配一系列任务。它将一次执行最多2*num_threads
任务,直到全部完成。
At work I have recently migrated a bunch of Python XML tools (a differ, xpath grepper, and bulk xslt transformer) to use this, and have had very nice results with two processes per processor.
在工作中,我最近迁移了一堆 Python XML 工具(一个不同的、xpath grepper 和批量 xslt 转换器)来使用它,并且每个处理器有两个进程的结果非常好。
回答by jkp
It looks to me that what you want is a pool of some sort, and in that pool you would like the have n threads where n == the number of processors on your system. You would then have another thread whose only job was to feed jobs into a queue which the worker threads could pick up and process as they became free (so for a dual code machine, you'd have three threads but the main thread would be doing very little).
在我看来,您想要的是某种类型的池,并且在该池中您希望有 n 个线程,其中 n == 系统上的处理器数量。然后,您将有另一个线程,其唯一的工作是将作业送入一个队列,工作线程可以在空闲时拾取和处理该队列(因此对于双代码机器,您将拥有三个线程,但主线程将执行很少)。
As you are new to Python though I'll assume you don't know about the GILand it's side-effects with regard to threading. If you read the article I linked you will soon understand why traditional multithreading solutions are not always the best in the Python world. Instead you should consider using the multiprocessingmodule (new in Python 2.6, in 2.5 you can use this backport) to achieve the same effect. It side-steps the issue of the GIL by using multiple processes as if they were threads within the same application. There are some restrictions about how you share data (you are working in different memory spaces) but actually this is no bad thing: they just encourage good practice such as minimising the contact points between threads (or processes in this case).
由于您是 Python 的新手,但我假设您不了解GIL以及它在线程方面的副作用。如果您阅读我链接的文章,您很快就会明白为什么传统的多线程解决方案在 Python 世界中并不总是最好的。相反,您应该考虑使用multiprocessing模块(Python 2.6 中的新功能,在 2.5 中您可以使用此向后移植) 来达到同样的效果。它通过使用多个进程来避免 GIL 的问题,就好像它们是同一应用程序中的线程一样。关于如何共享数据(您在不同的内存空间中工作)存在一些限制,但实际上这并不是坏事:它们只是鼓励良好的实践,例如最小化线程(或在这种情况下的进程)之间的接触点。
In your case you are probably intersted in using a pool as specified here.
回答by inspectorG4dget
I am not an expert in this, but I have read something about "Lock"s. This articlemight help you out
我不是这方面的专家,但我读过一些关于“锁”的内容。这篇文章或许能帮到你
Hope this helps
希望这可以帮助