如何在 bash 限制进程数中并行化 for 循环

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38774355/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 14:59:02  来源:igfitidea点击:

How to parallelize for-loop in bash limiting number of processes

bashfor-loopparallel-processing

提问by strathallan

I have a bash script similar to:

我有一个类似于以下内容的 bash 脚本:

NUM_PROCS=
NUM_ITERS=

for ((i=0; i<$NUM_ITERS; i++)); do
    python foo.py $i arg2 &
done

What's the most straightforward way to limit the number of parallel processes to NUM_PROCS? I'm looking for a solution that doesn't require packages/installations/modules (like GNU Parallel) if possible.

将并行进程数限制为 NUM_PROCS 的最直接方法是什么?如果可能,我正在寻找一种不需要包/安装/模块(如 GNU Parallel)的解决方案。

When I tried Charles Duffy's latest approach, I got the following error from bash -x:

当我尝试 Charles Duffy 的最新方法时,我从 bash -x 收到以下错误:

+ python run.py args 1
+ python run.py ... 3
+ python run.py ... 4
+ python run.py ... 2
+ read -r line
+ python run.py ... 1
+ read -r line
+ python run.py ... 4
+ read -r line
+ python run.py ... 2
+ read -r line
+ python run.py ... 3
+ read -r line
+ python run.py ... 0
+ read -r line

... continuing with other numbers between 0 and 5, until too many processes were started for the system to handle and the bash script was shut down.

... 继续使用 0 到 5 之间的其他数字,直到系统启动了太多进程无法处理并且 bash 脚本被关闭。

回答by chepner

bash4.4 will have an interesting new type of parameter expansion that simplifies Charles Duffy's answer.

bash4.4 将有一种有趣的新型参数扩展,可以简化 Charles Duffy 的回答。

#!/bin/bash

num_procs=
num_iters=
num_jobs="\j"  # The prompt escape for number of jobs currently running
for ((i=0; i<num_iters; i++)); do
  while (( ${num_jobs@P} >= num_procs )); do
    wait -n
  done
  python foo.py "$i" arg2 &
done

回答by Charles Duffy

As a very simple implementation, depending on a version of bash new enough to have wait -n(to wait until only the next job exits, as opposed to waiting for alljobs):

作为一个非常简单的实现,取决于足够新的 bash 版本wait -n(等待只有下一个作业退出,而不是等待所有作业):

#!/bin/bash
#      ^^^^ - NOT /bin/sh!

num_procs=
num_iters=

declare -A pids=( )

for ((i=0; i<num_iters; i++)); do
  while (( ${#pids[@]} >= num_procs )); do
    wait -n
    for pid in "${!pids[@]}"; do
      kill -0 "$pid" &>/dev/null || unset "pids[$pid]"
    done
  done
  python foo.py "$i" arg2 & pids["$!"]=1
done

If running on a shell without wait -n, one can (very inefficiently) replace it with a command such as sleep 0.2, to poll every 1/5th of a second.

如果在没有 的外壳上运行wait -n,则可以(非常低效)用诸如sleep 0.2, 之类的命令替换它,每 1/5 秒轮询一次。



Since you're actually reading input from a file, another approach is to start N subprocesses, each of processes only lines where (linenum % N == threadnum):

由于您实际上是从文件中读取输入,因此另一种方法是启动 N 个子进程,每个进程只包含以下行(linenum % N == threadnum)

num_procs=
infile=
for ((i=0; i<num_procs; i++)); do
  (
    while read -r line; do
      echo "Thread $i: processing $line"
    done < <(awk -v num_procs="$num_procs" -v i="$i" \
                 'NR % num_procs == i { print }' <"$infile")
  ) &
done
wait # wait for all the $num_procs subprocesses to finish

回答by that other guy

GNU, macOS/OSX, FreeBSD and NetBSD can all do this with xargs -P, no bash versions or package installs required. Here's 4 processes at a time:

GNU、macOS/OSX、FreeBSD 和 NetBSD 都可以做到这一点xargs -P,无需 bash 版本或软件包安装。一次有 4 个进程:

printf "%s
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
" {1..10} | xargs -0 -I @ -P 4 python foo.py @ arg2

回答by Ole Tange

Are you aware that if you are allowed to write and run your own scripts, then you can also use GNU Parallel? In essence it is a Perl script in one single file.

您是否知道如果允许您编写和运行自己的脚本,那么您也可以使用 GNU Parallel?本质上,它是单个文件中的 Perl 脚本。

From the README:

从自述文件:

= Minimal installation =

If you just need parallel and do not have 'make' installed (maybe the system is old or Microsoft Windows):

wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/

= 最少安装 =

如果您只需要并行并且没有安装“make”(可能系统是旧的或 Microsoft Windows):

seq  | parallel -j python foo.py {} arg2
parallel --embed >newscript

parallel --embed(available since 20180322) even makes it possible to distribute GNU Parallel as part of a shell script (i.e. no extra files needed):

parallel --embed(自 20180322 起可用)甚至可以将 GNU Parallel 作为 shell 脚本的一部分进行分发(即不需要额外的文件):

NUM_PROCS=
NUM_ITERS=

for ((i=0; i<$NUM_ITERS; i++)); do
    python foo.py $i arg2 &
    let 'i>=NUM_PROCS' && wait -n # wait for one process at a time once we've spawned $NUM_PROC workers
done
wait # wait for all remaining workers

Then edit the end of newscript.

然后编辑newscript.

回答by rtx13

A relatively simple way to accomplish this with only two additional lines of code. Explanation is inline.

只需添加两行代码即可完成此操作的一种相对简单的方法。说明是内联的。

##代码##