bash SLURM 中的 --ntasks 或 -n tasks 有什么作用?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39186698/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 15:05:10  来源:igfitidea点击:

What does the --ntasks or -n tasks does in SLURM?

bashslurm

提问by Charlie Parker

I was using SLURMto use some computing cluster and it had the -ntasksor -n. I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html):

我正在使用SLURM来使用一些计算集群,它具有-ntasks-n. 我显然已经阅读了它的文档(http://slurm.schedmd.com/sbatch.html):

sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.

sbatch 不启动任务,它请求分配资源并提交批处理脚本。此选项建议 Slurm 控制器在分配内运行的作业步骤将启动最多数量的任务并提供足够的资源。默认为每个节点一个任务,但请注意 --cpus-per-task 选项将更改此默认值。

the specific part I do not understand what it means is:

具体的部分我不明白是什么意思:

run within the allocation will launch a maximum of number tasks and to provide for sufficient resources.

在分配内运行将启动最多数量的任务并提供足够的资源。

I have a few questions:

我有几个问题:

  1. I guess my first question is what does the word "task" mean and the difference is with the word "job" in the SLURM context. I usually think of a job as the running the bash script under sbatch as in sbatch my_batch_job.sh. Not sure what task means.
  2. If I equate the word task with job then I thought it would have ran the same identical bash script multiple times according to the argument to -n, --ntasks=<number>. However, I obviously tested it out in the cluster, ran a echo hellowith --ntask=9and I expected sbatch would echo hello 9 times to stdout (which is collected in slurm-job_id.out, but to my surprise, there was a single execution of my echo hello script Then what does this command even do? It seems it does nothing or at least I can't see whats suppose to be doing.
  1. 我想我的第一个问题是“任务”这个词是什么意思,区别在于 SLURM 上下文中的“工作”这个词。我通常认为工作是在 .sbatch 下运行 bash 脚本sbatch my_batch_job.sh。不知道任务是什么意思。
  2. 如果我将 task 一词与 job 等同起来,那么我认为它会根据-n, --ntasks=<number>. 但是,我显然在集群中对其进行了测试,运行了一个echo hellowith--ntask=9并且我预计 sbatch 会向标准输出 echo hello 9 次(它收集在slurm-job_id.out.命令甚至做什么?它似乎什么也没做,或者至少我看不到应该在做什么。


I do know the -a, --array=<indexes>option exists for multiple jobs. That is a different topic. I simply want to know what --ntasksis suppose to do, ideally with an example so that I can test it out in the cluster.

我确实知道-a, --array=<indexes>存在多个工作的选项。那是一个不同的话题。我只是想知道--ntasks应该做什么,最好是通过一个示例,以便我可以在集群中对其进行测试。

回答by Alexis Lucattini

The --ntasksparameter is useful if you have commands that you want to run in parallel within the same batch script. This may be two separate commands separated by an &or two commands used in a bash pipe (|).

--ntasks如果您有要在同一个批处理脚本中并行运行的命令,则该参数很有用。这可能是两个单独的命令,由&bash 管道 ( |) 中使用的一个或两个命令分隔。

For example

例如

Using the default ntasks=1

使用默认的 ntasks=1

#!/bin/bash

#SBATCH --ntasks=1

srun sleep 10 & 
srun sleep 12 &
wait

Will throw the warning:

会抛出警告:

Job step creation temporarily disabled, retrying

作业步骤创建暂时禁用,重试

The number of tasks by default was specified to one, and therefore the second task cannot start until the first task has finished. This job will finish in around 22 seconds. To break this down:

默认情况下,任务数指定为一个,因此在第一个任务完成之前,第二个任务无法启动。这项工作将在大约 22 秒内完成。分解一下:

sacct -j515058 --format=JobID,Start,End,Elapsed,NCPUS

        JobID               Start                 End    Elapsed      NCPUS
------------ ------------------- ------------------- ---------- ----------
515058       2018-12-13T20:51:44 2018-12-13T20:52:06   00:00:22          1
515058.batch 2018-12-13T20:51:44 2018-12-13T20:52:06   00:00:22          1
515058.0     2018-12-13T20:51:44 2018-12-13T20:51:56   00:00:12          1
515058.1     2018-12-13T20:51:56 2018-12-13T20:52:06   00:00:10          1

Here task 0 started and finished (in 12 seconds) followed by task 1 (in 10 seconds). To make a total user time of 22 seconds.

这里任务 0 开始和完成(12 秒),然后是任务 1(10 秒)。使总用户时间为 22 秒。

To run both of these commands simultaneously:

同时运行这两个命令:

#!/bin/bash

#SBATCH --ntasks=2

srun --ntasks=1 sleep 10 & 
srun --ntasks=1 sleep 12 &
wait

Running the same sacct command as specified above

运行与上面指定的相同的 sacct 命令

    sacct -j 515064 --format=JobID,Start,End,Elapsed,NCPUS
    JobID               Start                 End    Elapsed      NCPUS
    ------------ ------------------- ------------------- ---------- ----------
    515064       2018-12-13T21:34:08 2018-12-13T21:34:20   00:00:12          2
    515064.batch 2018-12-13T21:34:08 2018-12-13T21:34:20   00:00:12          2
    515064.0     2018-12-13T21:34:08 2018-12-13T21:34:20   00:00:12          1
    515064.1     2018-12-13T21:34:08 2018-12-13T21:34:18   00:00:10          1

Here the total job taking 12 seconds. There is no risk of jobs waiting for resources as the number of tasks has been specified in the batch script and therefore the job has the resources to run this many commands at once.

这里的总工作时间为 12 秒。没有作业等待资源的风险,因为在批处理脚本中指定了任务数量,因此作业具有一次运行这么多命令的资源。

Each task inherits the parameters specified for the batch script. This is why --ntasks=1needs to be specified for each srun task, otherwise each task uses --ntasks=2and so the second command will not run until the first task has finished.

每个任务都继承为批处理脚本指定的参数。这就是为什么--ntasks=1需要为每个 srun 任务指定的原因,否则每个任务都会使用--ntasks=2,因此第二个命令在第一个任务完成之前不会运行。

Another caveat of the tasks inheriting the batch parameters is if --export=NONEis specified as a batch parameter. In this case --export=ALLshould be specified for each srun command otherwise environment variables set within the sbatch script are not inherited by the srun command.

继承批处理参数的任务的另一个警告是 if--export=NONE被指定为批处理参数。在这种情况下,--export=ALL应该为每个 srun 命令指定,否则在 sbatch 脚本中设置的环境变量不会被 srun 命令继承。

Additional notes:
When using bash pipes, it may be necessary to specify --nodes=1 to prevent commands either side of the pipes running on separate nodes.
When using &to run commands simultaneously, the waitis vital. In this case, without the waitcommand, task 0 would cancel itself, given task 1 completed successfully.

附加说明:
当使用 bash 管道时,可能需要指定 --nodes=1 以防止在不同节点上运行管道任一侧的命令。
&用于同时运行命令时,这wait是至关重要的。在这种情况下,如果没有该wait命令,任务 0 将自行取消,因为任务 1 已成功完成。

回答by mel

The "--ntasks" options specifies how many instances of your command are executed. For a common cluster setup and if you start your command with "srun" this corresponds to the number of MPI ranks.

“--ntasks”选项指定执行命令的实例数。对于常见的集群设置,如果您以“srun”启动命令,这对应于 MPI 等级的数量。

In contrast the option "--cpus-per-task" specify how many CPUs each task can use.

相反,选项“--cpus-per-task”指定每个任务可以使用多少个 CPU。

Your output surprises me as well. Have you launched your command in the script or via srun? Does you script look like:

你的输出也让我感到惊讶。您是在脚本中还是通过 srun 启动了您的命令?你的脚本看起来像:

#!/bin/bash
#SBATCH --ntasks=8
## more options
echo hello

This should always output only a single line, because the script is only executed on the submitting node not the worker.

这应该总是只输出一行,因为脚本只在提交节点上执行,而不是在工作节点上执行。

If your script look like

如果你的脚本看起来像

#!/bin/bash
#SBATCH --ntasks=8
## more options
srun echo hello

srun causes the script to run your command on the worker nodes and as a result you should get 8 lines of hello.

srun 使脚本在工作节点上运行您的命令,因此您应该得到 8 行 hello.