Python Joblib 并行多 CPU 比单 CPU 慢

Question

提问by mhabiger

I've just started using the Joblib module and I'm trying to understand how the Parallel function works. Below is an example of where parallelizing leads to longer runtimes but I don't understand why. My runtime on 1 cpu was 51 sec vs. 217 secs on 2 cpu.

我刚刚开始使用 Joblib 模块，我正在尝试了解 Parallel 函数的工作原理。下面是一个示例，说明并行化会导致更长的运行时间，但我不明白为什么。我在 1 cpu 上的运行时间是 51 秒，而在 2 cpu 上是 217 秒。

My assumption was that running the loop in parallel would copy lists a and b to each processor. Then dispatch item_n to one cpu and item_n+1 to the other cpu, execute the function and then write the results back to a list (in order). Then grab the next 2 items and so on. I'm obviously missing something.

我的假设是并行运行循环会将列表 a 和 b 复制到每个处理器。然后将 item_n 分派到一个 cpu，将 item_n+1 分派到另一个 cpu，执行该函数，然后将结果写回列表（按顺序）。然后抓住接下来的 2 个项目，依此类推。我显然错过了一些东西。

Is this a poor example or use of joblib? Did I simply structure the code wrong?

这是一个糟糕的例子还是joblib的使用？我是不是简单地构造了错误的代码？

Here is the example:

这是示例：

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed

## Create pairs of points for line segments
a = zip(np.random.rand(5000,2),np.random.rand(5000,2))

b = zip(np.random.rand(300,2),np.random.rand(300,2))

## Check if one line segment contains another. 
def check_paths(path, paths):
    for other_path in paths:
        res='no cross'
        chck = Path(other_path)
        if chck.contains_path(path)==1:
            res= 'cross'
            break
    return res

res = Parallel(n_jobs=2) (delayed(check_paths) (Path(points), a) for points in b)

Answer 1

采纳答案by Nabla

In short: I cannot reproduce your problem. If you are on Windows you should use a protector for your main loop: documentation of joblib.Parallel. The only problem I see is much data copying overhead, but your numbers seem unrealistic to be caused by that.

简而言之：我无法重现您的问题。如果您使用的是Windows，你应该使用一个保护你的主循环：的文档joblib.Parallel。我看到的唯一问题是大量数据复制开销，但您的数字似乎不切实际。

In long, here are my timings with your code:

总而言之，这是我使用您的代码的时间：

On my i7 3770k (4 cores, 8 threads) I get the following results for different n_jobs:

在我的 i7 3770k（4 核，8 线程）上，我得到以下不同结果n_jobs：

For-loop: Finished in 33.8521318436 sec
n_jobs=1: Finished in 33.5527760983 sec
n_jobs=2: Finished in 18.9543449879 sec
n_jobs=3: Finished in 13.4856410027 sec
n_jobs=4: Finished in 15.0832719803 sec
n_jobs=5: Finished in 14.7227740288 sec
n_jobs=6: Finished in 15.6106669903 sec

So there is a gain in using multiple processes. However although I have four cores the gain already saturates at three processes. So I guess the execution time is actually limited by memory access rather than processor time.

因此，使用多个进程是有好处的。然而，虽然我有四个内核，但增益已经在三个过程中饱和了。所以我猜执行时间实际上受内存访问而不是处理器时间的限制。

You should notice that the arguments for each single loop entry are copied to the process executing it. This means you copy afor each element in b. That is ineffective. So instead access the global a. (Parallelwill fork the process, copying all global variables to the newly spawned processes, so ais accessible). This gives me the following code (with timing and main loop guard as the documentation of joblibrecommends:

您应该注意到每个循环条目的参数都被复制到执行它的进程中。这意味着你复制a的每个元素b。那是无效的。所以改为访问全局a. （Parallel将 fork 进程，将所有全局变量复制到新生成的进程，因此a可以访问）。这给了我以下代码（使用时间和主循环保护作为joblib推荐的文档：

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys

## Check if one line segment contains another. 

def check_paths(path):
    for other_path in a:
        res='no cross'
        chck = Path(other_path)
        if chck.contains_path(path)==1:
            res= 'cross'
            break
    return res

if __name__ == '__main__':
    ## Create pairs of points for line segments
    a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
    b = zip(np.random.rand(300,2),np.random.rand(300,2))

    now = time.time()
    if len(sys.argv) >= 2:
        res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
    else:
        res = [check_paths(Path(points)) for points in b]
    print "Finished in", time.time()-now , "sec"

Timing results:

计时结果：

 n_jobs=1: Finished in 34.2845709324 sec
 n_jobs=2: Finished in 16.6254048347 sec
 n_jobs=3: Finished in 11.219119072 sec
 n_jobs=4: Finished in 8.61683392525 sec
 n_jobs=5: Finished in 8.51907801628 sec
 n_jobs=6: Finished in 8.21842098236 sec
 n_jobs=7: Finished in 8.21816396713 sec
 n_jobs=8: Finished in 7.81841087341 sec

The saturation now slightly moved to n_jobs=4which is the value to be expected.

饱和度现在略微移动到n_jobs=4预期值。

check_pathsdoes several redundant calculations that can easily be eliminated. Firstly for all elements in other_paths=athe line Path(...)is executed in every call. Precalculate that. Secondly the string res='no cross'is written is each loop turn, although it may only change once (followed by a break and return). Move the line in front of the loop. Then the code looks like this:

check_paths做了几个可以轻松消除的冗余计算。首先在所有元素other_paths=a的线Path(...)在每次调用执行。预先计算一下。其次，字符串res='no cross'在每次循环时写入，尽管它可能只更改一次（然后是中断和返回）。将线移动到循环前面。然后代码看起来像这样：

import numpy as np
from matplotlib.path import Path
from joblib import Parallel, delayed
import time
import sys

## Check if one line segment contains another. 

def check_paths(path):
    #global a
    #print(path, a[:10])
    res='no cross'
    for other_path in a:
        if other_path.contains_path(path)==1:
            res= 'cross'
            break
    return res

if __name__ == '__main__':
    ## Create pairs of points for line segments
    a = zip(np.random.rand(5000,2),np.random.rand(5000,2))
    a = [Path(x) for x in a]

    b = zip(np.random.rand(300,2),np.random.rand(300,2))

    now = time.time()
    if len(sys.argv) >= 2:
        res = Parallel(n_jobs=int(sys.argv[1])) (delayed(check_paths) (Path(points)) for points in b)
    else:
        res = [check_paths(Path(points)) for points in b]
    print "Finished in", time.time()-now , "sec"

with timings:

与时间：

n_jobs=1: Finished in 5.33742594719 sec
n_jobs=2: Finished in 2.70858597755 sec
n_jobs=3: Finished in 1.80810618401 sec
n_jobs=4: Finished in 1.40814709663 sec
n_jobs=5: Finished in 1.50854086876 sec
n_jobs=6: Finished in 1.50901818275 sec
n_jobs=7: Finished in 1.51030707359 sec
n_jobs=8: Finished in 1.51062297821 sec

A side node on your code, although I haven't really followed its purpose as this was unrelated to your question, contains_pathwill only return Trueif this path completely contains the given path.(see documentation). Therefore your function will basically always return no crossgiven the random input.

您代码中的一个侧节点，虽然我没有真正遵循它的目的，因为这与您的问题无关，但contains_path只会返回Trueif this path completely contains the given path.（请参阅文档）。因此，no cross给定随机输入，您的函数基本上总是会返回。

Answer 2

回答by Gael Varoquaux

In addition to the above answer, and for future reference, there are two aspects to this question, and joblib's recent evolutions helps with both.

除了上述答案之外，为了将来参考，这个问题还有两个方面，joblib 最近的演变对这两方面都有帮助。

Parallel pool creation overhead: The problem here is that creating a parallel pool is costly. It's was especially costly here, as the code not protected by the "main" was run in each job at creation of the Parallel object. In the latest joblib (still beta), Parallel can be used as a context managerto limit the number of time a pool is created, and thus the impact of this overhead.

并行池创建开销：这里的问题是创建并行池的成本很高。这里的成本特别高，因为在创建 Parallel 对象时，每个作业中都运行不受“ main”保护的代码。在最新的 joblib（仍然是 beta）中，Parallel 可以用作上下文管理器来限制创建池的次数，从而限制这种开销的影响。

Dispatching overhead: it is important to keep in mind that dispatching an item of the for loop has an overhead (much bigger than iterating a for loop without parallel). Thus, if these individual computation items are very fast, this overhead will dominate the computation. In the latest joblib, joblib will trace the execution time of each job and start bunching them if they are very fast. This strongly limits the impact of the dispatch overhead in most cases (see the PRfor bench and discussion).

调度开销：重要的是要记住，调度 for 循环的项目有开销（比没有并行的迭代 for 循环要大得多）。因此，如果这些单独的计算项非常快，则此开销将主导计算。在最新的 joblib 中，joblib 会跟踪每个作业的执行时间，如果它们非常快就开始打包它们。在大多数情况下，这极大地限制了调度开销的影响（参见PR的工作台和讨论）。

Disclaimer: I am the original author of joblib (just saying to warn against potential conflicts of interest in my answer, although here I think that it is irrelevant).

免责声明：我是 joblib 的原作者（只是在我的答案中警告潜在的利益冲突，尽管在这里我认为它无关紧要）。

Python Joblib 并行多 CPU 比单 CPU 慢

提问by mhabiger

采纳答案by Nabla

回答by Gael Varoquaux

相关推荐

最近更新

标签

Python Joblib 并行多 CPU 比单 CPU 慢

提问by mhabiger

采纳答案by Nabla

回答by Gael Varoquaux

相关推荐

如何一次加载无限滚动中的所有条目以解析python中的HTML

Python 绘制空心星号方块

Python 在 numpy 数组中格式化浮点数

计算 CSV Python 中有多少行？

相关推荐

最近更新

标签