我在做什么的 Python 多处理进程或池?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18176178/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:04:30  来源:igfitidea点击:

Python Multiprocessing Process or Pool for what I am doing?

pythonmultithreadingasynchronousmultiprocessing

提问by dman

I'm new to multiprocessing in Python and trying to figure out if I should use Pool or Process for calling two functions async. The two functions I have make curl calls and parse the information into a 2 separate lists. Depending on the internet connection, each function could take about 4 seconds each. I realize that the bottleneck is in the ISP connection and multiprocessing won't speed it up much, but it would be nice to have them both kick off async. Plus, this is a great learning experience for me to get into python's multi-processing because I will be using it more later.

我是 Python 中的多处理新手,并试图弄清楚是否应该使用 Pool 或 Process 来异步调用两个函数。我有两个函数调用 curl 并将信息解析为 2 个单独的列表。根据互联网连接,每个功能可能需要大约 4 秒。我意识到瓶颈在于 ISP 连接,多处理不会加快速度,但让它们都启动异步会很好。另外,这对我来说是一个很好的学习体验,因为我将在以后更多地使用它。

I have read Python multiprocessing.Pool: when to use apply, apply_async or map?and it was useful, but still had my own questions.

我已阅读Python multiprocessing.Pool:何时使用 apply、apply_async 或 map?它很有用,但仍然有我自己的问题。

So one way I could do it is:

所以我可以做到的一种方法是:

def foo():
    pass

def bar():
    pass

p1 = Process(target=foo, args=())
p2 = Process(target=bar, args=())

p1.start()
p2.start()
p1.join()
p2.join()

Questions I have for this implementation is: 1) Since join blocks until calling process is completed...does this mean p1 process has to finish before p2 process is kicked off? I always understood the .join() be the same as pool.apply() and pool.apply_sync().get() where the parent process can not launch another process(task) until the current one running is completed.

我对这个实现的问题是:1)由于 join 阻塞直到调用进程完成......这是否意味着 p1 进程必须在 p2 进程启动之前完成?我一直认为 .join() 与 pool.apply() 和 pool.apply_sync().get() 相同,其中父进程在当前运行完成之前无法启动另一个进程(任务)。

The other alternative would be something like:

另一种选择是这样的:

def foo():
    pass

def bar():
    pass
pool = Pool(processes=2)             
p1 = pool.apply_async(foo)
p1 = pool.apply_async(bar)

Questions I have for this implementation would be: 1) Do I need a pool.close(), pool.join()? 2) Would pool.map() make them all complete before I could get results? And if so, are they still ran asynch? 3) How would pool.apply_async() differ from doing each process with pool.apply() 4) How would this differ from the previous implementation with Process?

我对这个实现的问题是:1) 我需要一个 pool.close(), pool.join() 吗?2) 在我得到结果之前,pool.map() 会让它们全部完成吗?如果是这样,它们是否仍然异步运行?3) pool.apply_async() 与使用 pool.apply() 执行每个进程有何不同 4) 这与之前使用 Process 的实现有何不同?

采纳答案by lmjohns3

The two scenarios you listed accomplish the same thing but in slightly different ways.

您列出的两个场景完成相同的事情,但方式略有不同。

The first scenario starts two separate processes (call them P1 and P2) and starts P1 running fooand P2 running bar, and then waits until both processes have finished their respective tasks.

第一个场景启动两个独立的进程(称为 P1 和 P2)并启动 P1 runningfoo和 P2 running bar,然后等待两个进程完成各自的任务。

The second scenario starts two processes (call them Q1 and Q2) and first starts fooon either Q1 or Q2, and then starts baron either Q1 or Q2. Then the code waits until both function calls have returned.

第二个场景启动两个进程(称为 Q1 和 Q2),首先从fooQ1 或 Q2 开始,然后从barQ1 或 Q2 开始。然后代码等待直到两个函数调用都返回。

So the net result is actually the same, but in the first case you're guaranteed to run fooand baron different processes.

所以,最终的结果实际上是一样的,但在第一种情况下你保证运行foobar在不同的进程。

As for the specific questions you had about concurrency, the .join()method on a Processdoes indeed block until the process has finished, but because you called .start()on both P1 and P2 (in your first scenario) before joining, then both processes will run asynchronously. The interpreter will, however, wait until P1 finishes before attempting to wait for P2 to finish.

至于你对并发的具体问题,.join()aProcess上的方法确实会阻塞,直到进程完成,但因为你.start()在加入之前调用了 P1 和 P2(在你的第一个场景中),那么两个进程将异步运行。然而,解释器将等待 P1 完成,然后再尝试等待 P2 完成。

For your questions about the pool scenario, you should technically use pool.close()but it kind of depends on what you might need it for afterwards (if it just goes out of scope then you don't need to close it necessarily). pool.map()is a completely different kind of animal, because it distributes a bunch of arguments to the same function (asynchronously), across the pool processes, and then waits until all function calls have completed before returning the list of results.

对于有关池方案的问题,您应该在技术上使用,pool.close()但这取决于您之后可能需要它做什么(如果它超出范围,那么您不一定需要关闭它)。pool.map()是一种完全不同的动物,因为它将一堆参数分发给同一个函数(异步),跨池进程,然后等待所有函数调用完成,然后返回结果列表。

回答by Maciej Gol

Since you're fetching data from curl calls you are IO-bound. In such case grequestsmight come in handy. These are really neither processes nor threads but coroutines - lightweight threads. This would allow you to send asynchronously HTTP requests, and then use multiprocessing.Poolto speed up the CPU-bound part.

由于您从 curl 调用中获取数据,因此您受 IO 限制。在这种情况下,grequests可能会派上用场。这些实际上既不是进程也不是线程,而是协程——轻量级线程。这将允许您异步发送 HTTP 请求,然后用于multiprocessing.Pool加速 CPU 绑定部分。

1) Since join blocks until calling process is completed...does this mean p1 process has to finish before p2 process is kicked off?

1) 由于 join 阻塞直到调用进程完成...这是否意味着 p1 进程必须在 p2 进程启动之前完成?

Yes, p2.join()is called after p1.join()has returned meaning p1has finished.

是的,p2.join()p1.join()返回意思p1完成后调用。

1) Do I need a pool.close(), pool.join()

1) 我需要一个 pool.close(), pool.join()

You could end up with orphaned processes without doing close()and join()(if the processes serve indefinetly)

您可能会在不做close()和的情况下最终得到孤立的进程join()(如果进程不确定地服务)

2) Would pool.map() make them all complete before I could get results? And if so, are they still ran asynch?

2) 在我得到结果之前,pool.map() 会让它们全部完成吗?如果是这样,它们是否仍然异步运行?

They are ran asynchronously, but the map()is blocked until all tasks are done.

它们是异步运行的,但是在map()所有任务完成之前会被阻塞。

3) How would pool.apply_async() differ from doing each process with pool.apply()

3) pool.apply_async() 与使用 pool.apply() 执行每个进程有何不同

pool.apply()is blocking, so basically you would do the processing synchronously.

pool.apply()是阻塞的,所以基本上你会同步进行处理。

4) How would this differ from the previous implementation with Process

4) 这与之前的 Process 实现有何不同

Chances are a worker is done with foobefore you apply barso you might end up with a single worker doing all the work. Also, if one of your workers dies Poolautomatically spawns a new one (you'd need to reapply the task).

有可能foo在您申请之前已经完成了一名工人,bar因此您最终可能会由一名工人完成所有工作。此外,如果您的一名工人死亡会Pool自动产生一名新工人(您需要重新申请任务)。

To sum up: I would rather go with Pool- it's perfect for producer-consumer cases and takes care of all the task-distributing logic.

总结一下:我宁愿去Pool- 它非常适合生产者 - 消费者的情况,并处理所有任务分配逻辑。