使用最大进程数并行化 Bash 脚本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38160/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 17:35:11  来源:igfitidea点击:

Parallelize Bash script with maximum number of processes

bash

提问by thelsdj

Lets say I have a loop in Bash:

假设我在 Bash 中有一个循环:

for foo in `some-command`
do
   do-something $foo
done

do-somethingis cpu bound and I have a nice shiny 4 core processor. I'd like to be able to run up to 4 do-something's at once.

do-something受 cpu 限制,我有一个漂亮的闪亮 4 核处理器。我希望能够同时运行多达 4 个do-something

The naive approach seems to be:

天真的方法似乎是:

for foo in `some-command`
do
   do-something $foo &
done

This will run alldo-somethings at once, but there are a couple downsides, mainly that do-something may also have some significant I/O which performing allat once might slow down a bit. The other problem is that this code block returns immediately, so no way to do other work when all the do-somethings are finished.

这将一次运行所有的do-somethings,但有一些缺点,主要是做某事也可能有一些重要的 I/O,一次执行所有的可能会慢一点。另一个问题是这个代码块立即返回,所以当所有的do-somethings都完成时没有办法做其他工作。

How would you write this loop so there are always X do-somethings running at once?

您将如何编写此循环,以便始终同时do-something运行X ?

采纳答案by Fritz G. Mehner

Depending on what you want to do xargs also can help (here: converting documents with pdf2ps):

根据您想要做什么, xargs 也可以提供帮助(此处:使用 pdf2ps 转换文档):

cpus=$( ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w )

find . -name \*.pdf | xargs --max-args=1 --max-procs=$cpus  pdf2ps

From the docs:

从文档:

--max-procs=max-procs
-P max-procs
       Run up to max-procs processes at a time; the default is 1.
       If max-procs is 0, xargs will run as many processes as  possible  at  a
       time.  Use the -n option with -P; otherwise chances are that only one
       exec will be done.

回答by Ole Tange

With GNU Parallel http://www.gnu.org/software/parallel/you can write:

使用 GNU Parallel http://www.gnu.org/software/parallel/你可以写:

some-command | parallel do-something

GNU Parallel also supports running jobs on remote computers. This will run one per CPU core on the remote computers - even if they have different number of cores:

GNU Parallel 还支持在远程计算机上运行作业。这将在远程计算机上为每个 CPU 内核运行一个 - 即使它们具有不同数量的内核:

some-command | parallel -S server1,server2 do-something

A more advanced example: Here we list of files that we want my_script to run on. Files have extension (maybe .jpeg). We want the output of my_script to be put next to the files in basename.out (e.g. foo.jpeg -> foo.out). We want to run my_script once for each core the computer has and we want to run it on the local computer, too. For the remote computers we want the file to be processed transferred to the given computer. When my_script finishes, we want foo.out transferred back and we then want foo.jpeg and foo.out removed from the remote computer:

一个更高级的例子:这里我们列出了我们希望 my_script 运行的文件。文件有扩展名(可能是 .jpeg)。我们希望将 my_script 的输出放在 basename.out 中的文件旁边(例如 foo.jpeg -> foo.out)。我们希望为计算机拥有的每个核心运行一次 my_script,我们也希望在本地计算机上运行它。对于远程计算机,我们希望将要处理的文件传输到给定的计算机。当 my_script 完成时,我们希望将 foo.out 传回,然后我们希望从远程计算机中删除 foo.jpeg 和 foo.out:

cat list_of_files | \
parallel --trc {.}.out -S server1,server2,: \
"my_script {} > {.}.out"

GNU Parallel makes sure the output from each job does not mix, so you can use the output as input for another program:

GNU Parallel 确保每个作业的输出不会混合,因此您可以将输出用作另一个程序的输入:

some-command | parallel do-something | postprocess

See the videos for more examples: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

有关更多示例,请参阅视频:https: //www.youtube.com/playlist?list=PL284C9FF2488BC6D1

回答by bstark

maxjobs=4
parallelize () {
        while [ $# -gt 0 ] ; do
                jobcnt=(`jobs -p`)
                if [ ${#jobcnt[@]} -lt $maxjobs ] ; then
                        do-something  &
                        shift  
                else
                        sleep 1
                fi
        done
        wait
}

parallelize arg1 arg2 "5 args to third job" arg4 ...

回答by skolima

Instead of a plain bash, use a Makefile, then specify number of simultaneous jobs with make -jXwhere X is the number of jobs to run at once.

使用 Makefile 代替普通的 bash,然后指定同时make -jX运行的作业数,其中 X 是一次运行的作业数。

Or you can use wait("man wait"): launch several child processes, call wait- it will exit when the child processes finish.

或者你可以使用wait(" man wait"): 启动几个子进程,调用wait- 当子进程完成时它会退出。

maxjobs = 10

foreach line in `cat file.txt` {
 jobsrunning = 0
 while jobsrunning < maxjobs {
  do job &
  jobsrunning += 1
 }
wait
}

job ( ){
...
}

If you need to store the job's result, then assign their result to a variable. After waityou just check what the variable contains.

如果您需要存储作业的结果,请将其结果分配给一个变量。在wait您检查变量包含的内容之后。

回答by Grumbel

Here an alternative solution that can be inserted into .bashrc and used for everyday one liner:

这是一种可以插入 .bashrc 并用于日常单行的替代解决方案:

function pwait() {
    while [ $(jobs -p | wc -l) -ge  ]; do
        sleep 1
    done
}

To use it, all one has to do is put &after the jobs and a pwait call, the parameter gives the number of parallel processes:

要使用它,只需将&作业放在作业和 pwait 调用之后,参数给出并行进程的数量:

for i in *; do
    do_something $i &
    pwait 10
done

It would be nicer to use waitinstead of busy waiting on the output of jobs -p, but there doesn't seem to be an obvious solution to wait till any of the given jobs is finished instead of a all of them.

使用wait而不是忙于等待 的输出会更好jobs -p,但似乎没有明显的解决方案来等待任何给定的作业完成而不是所有作业完成。

回答by tessein

Maybe try a parallelizing utility instead rewriting the loop? I'm a big fan of xjobs. I use xjobs all the time to mass copy files across our network, usually when setting up a new database server. http://www.maier-komor.de/xjobs.html

也许尝试并行​​化实用程序而不是重写循环?我是 xjobs 的忠实粉丝。我一直使用 xjobs 在我们的网络上大量复制文件,通常是在设置新的数据库服务器时。 http://www.maier-komor.de/xjobs.html

回答by lhunath

While doing this right in bashis probably impossible, you can do a semi-right fairly easily. bstarkgave a fair approximation of right but his has the following flaws:

虽然正确地做到这bash一点可能是不可能的,但你可以很容易地做到半正确。 bstark给了一个公平的近似权利,但他有以下缺陷:

  • Word splitting: You can't pass any jobs to it that use any of the following characters in their arguments: spaces, tabs, newlines, stars, question marks. If you do, things will break, possibly unexpectedly.
  • It relies on the rest of your script to not background anything. If you do, or later you add something to the script that gets sent in the background because you forgot you weren't allowed to use backgrounded jobs because of his snippet, things will break.
  • 分词:您不能将任何在参数中使用以下任何字符的作业传递给它:空格、制表符、换行符、星号、问号。如果你这样做,事情会破裂,可能出乎意料。
  • 它依赖于你的脚本的其余部分来不背景任何东西。如果您这样做,或者稍后您在后台发送的脚本中添加了一些内容,因为您忘记了由于他的代码片段而不允许使用后台作业,那么事情就会中断。

Another approximation which doesn't have these flaws is the following:

没有这些缺陷的另一个近似如下:

scheduleAll() {
    local job i=0 max=4 pids=()

    for job; do
        (( ++i % max == 0 )) && {
            wait "${pids[@]}"
            pids=()
        }

        bash -c "$job" & pids+=("$!")
    done

    wait "${pids[@]}"
}

Note that this one is easily adaptable to also check the exit code of each job as it ends so you can warn the user if a job fails or set an exit code for scheduleAllaccording to the amount of jobs that failed, or something.

请注意,这个很容易适应,还可以在每个作业结束时检查其退出代码,因此您可以在作业失败时警告用户,或者scheduleAll根据失败的作业数量设置退出代码,或其他。

The problem with this code is just that:

这段代码的问题在于:

  • It schedules four (in this case) jobs at a time and then waits for all four to end. Some might be done sooner than others which will cause the next batch of four jobs to wait until the longest of the previous batch is done.
  • 它一次调度四个(在这种情况下)作业,然后等待所有四个作业结束。有些可能比其他的完成得更快,这将导致下一批四个作业要等到上一批中最长的一个完成。

A solution that takes care of this last issue would have to use kill -0to poll whether any of the processes have disappeared instead of the waitand schedule the next job. However, that introduces a small new problem: you have a race condition between a job ending, and the kill -0checking whether it's ended. If the job ended and another process on your system starts up at the same time, taking a random PID which happens to be that of the job that just finished, the kill -0won't notice your job having finished and things will break again.

解决最后一个问题的解决方案必须用于kill -0轮询是否有任何进程消失,而不是轮询wait并安排下一个作业。但是,这引入了一个新的小问题:在作业结束和kill -0检查作业是否结束之间存在竞争条件。如果作业结束并且您系统上的另一个进程同时启动,采用随机 PID 恰好是刚刚完成的作业的 PID,kill -0则不会注意到您的作业已完成,事情将再次中断。

A perfect solution isn't possible in bash.

bash.

回答by Idelic

If you're familiar with the makecommand, most of the time you can express the list of commands you want to run as a a makefile. For example, if you need to run $SOME_COMMAND on files *.input each of which produces *.output, you can use the makefile

如果您熟悉make命令,大多数情况下您可以将要运行的命令列表表示为生成文件。例如,如果您需要对每个生成 *.output 的文件 *.input 运行 $SOME_COMMAND,您可以使用 makefile

INPUT  = a.input b.input
OUTPUT = $(INPUT:.input=.output)

%.output : %.input
    $(SOME_COMMAND) $< $@

all: $(OUTPUT)

and then just run

然后就跑

make -j<NUMBER>

to run at most NUMBER commands in parallel.

最多并行运行 NUMBER 个命令。

回答by ilnar

function for bash:

bash 的功能:

parallel ()
{
    awk "BEGIN{print \"all: ALL_TARGETS\n\"}{print \"TARGET_\"NR\":\n\t@-\"$0\"\n\"}END{printf \"ALL_TARGETS:\";for(i=1;i<=NR;i++){printf \" TARGET_%d\",i};print\"\n\"}" | make $@ -f - all
}

using:

使用:

cat my_commands | parallel -j 4

回答by Jon Ericson

The project I work on uses the waitcommand to control parallel shell (ksh actually) processes. To address your concerns about IO, on a modern OS, it's possible parallel execution will actually increase efficiency. If all processes are reading the same blocks on disk, only the first process will have to hit the physical hardware. The other processes will often be able to retrieve the block from OS's disk cache in memory. Obviously, reading from memory is several orders of magnitude quicker than reading from disk. Also, the benefit requires no coding changes.

我从事的项目使用wait命令来控制并行 shell(实际上是 ksh)进程。为了解决您对 IO 的担忧,在现代操作系统上,并行执行实际上可能会提高效率。如果所有进程都在读取磁盘上的相同块,则只有第一个进程必须访问物理硬件。其他进程通常能够从内存中操作系统的磁盘缓存中检索块。显然,从内存读取比从磁盘读取快几个数量级。此外,好处是不需要更改编码。