在 bash 中并行运行有限数量的子进程?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6593531/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 00:18:24  来源:igfitidea点击:

Running a limited number of child processes in parallel in bash?

bashparallel-processing

提问by Niels Basjes

I have a large set of files for which some heavy processing needs to be done. This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run. My current usecase is to start a hadoop job on the input data, but I've had this same problem in other cases before.

我有一大堆文件需要做一些繁重的处理。这种单线程处理使用几百 MiB 的 RAM(在用于启动作业的机器上)并且需要几分钟才能运行。我当前的用例是在输入数据上启动 hadoop 作业,但我之前在其他情况下也遇到过同样的问题。

In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.

为了充分利用可用的 CPU 能力,我希望能够并行运行多个这些任务。

However a very simple example shell script like this will trash the system performance due to excessive load and swapping:

然而,像这样的一个非常简单的示例 shell 脚本会由于过多的负载和交换而破坏系统性能:

find . -type f | while read name ; 
do 
   some_heavy_processing_command ${name} &
done

So what I want is essentially similar to what "gmake -j4" does.

所以我想要的本质上类似于“gmake -j4”所做的。

I know bash supports the "wait" command but that only waits untill all child processes have completed. In the past I've created scripting that does a 'ps' command and then grep the child processes out by name (yes, i know ... ugly).

我知道 bash 支持“wait”命令,但它只会等到所有子进程都完成。过去,我创建了执行“ps”命令的脚本,然后按名称 grep 子进程(是的,我知道......丑陋)。

What is the simplest/cleanest/best solution to do what I want?

做我想做的最简单/最干净/最好的解决方案是什么?



Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bashThe "xargs --max-procs=4" works like a charm. (So I voted to close my own question)

编辑:感谢 Frederik:是的,这确实是How to limit number of threads used in a function in bash in bashThe "xargs --max-procs=4" 就像一个魅力。(所以我投票结束了我自己的问题)

采纳答案by Dunes

#! /usr/bin/env bash

set -o monitor 
# means: run background processes in a separate processes...
trap add_next_job CHLD 
# execute add_next_job when we receive a child complete signal

todo_array=($(find . -type f)) # places output into an array

index=0
max_jobs=2

function add_next_job {
    # if still jobs to do then add one
    if [[ $index -lt ${#todo_array[*]} ]]
    # apparently stackoverflow doesn't like bash syntax
    # the hash in the if is not a comment - rather it's bash awkward way of getting its length
    then
        echo adding job ${todo_array[$index]}
        do_job ${todo_array[$index]} & 
        # replace the line above with the command you want
        index=$(($index+1))
    fi
}

function do_job {
    echo "starting job "
    sleep 2
}

# add initial set of jobs
while [[ $index -lt $max_jobs ]]
do
    add_next_job
done

# wait for all jobs to complete
wait
echo "done"

Having said that Fredrik makes the excellent point that xargs does exactly what you want...

话虽如此,弗雷德里克提出了 xargs 完全符合您的要求...

回答by BruceH

I know I'm late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)

我知道我迟到了这个答案,但我想我会发布一个替代方案,恕我直言,使脚本的主体更清晰、更简单。(显然,您可以更改值 2 和 5 以适合您的场景。)

function max2 {
   while [ `jobs | wc -l` -ge 2 ]
   do
      sleep 5
   done
}

find . -type f | while read name ; 
do 
   max2; some_heavy_processing_command ${name} &
done
wait

回答by Ole Tange

With GNU Parallel it becomes simpler:

使用 GNU Parallel,它变得更简单:

find . -type f | parallel  some_heavy_processing_command {}

Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

了解更多:https: //www.youtube.com/playlist?list=PL284C9FF2488BC6D1

回答by TrueY

I think I found a more handy solution using make:

我想我使用make找到了一个更方便的解决方案:

#!/usr/bin/make -f

THIS := $(lastword $(MAKEFILE_LIST))
TARGETS := $(shell find . -name '*.sh' -type f)

.PHONY: all $(TARGETS)

all: $(TARGETS)

$(TARGETS):
        some_heavy_processing_command $@

$(THIS): ; # Avoid to try to remake this makefile

Call it as e.g. 'test.mak', and add execute rights. If You call ./test.makit will call the some_heavy_processing_commandone-by-one. But You can call as ./test.mak -j 4, then it will run four subprocesses at once. Also You can use it on a more sophisticated way: run as ./test.mak -j 5 -l 1.5, then it will run maximum 5 sub-processes while the system load is under 1.5, but it will limit the number of processes if the system load exceeds 1.5.

将其命名为例如“test.mak”,并添加执行权限。如果你调用./test.mak它会some_heavy_processing_command一一调用。但是你可以调用 as ./test.mak -j 4,那么它会同时运行四个子进程。您也可以以更复杂的方式使用它:run as ./test.mak -j 5 -l 1.5,那么它会在系统负载低于 1.5 时最多运行 5 个子进程,但如果系统负载超过 1.5,它将限制进程数。

It is more flexible than xargs, and makeis part of the standard distribution, not like parallel.

它比xargs更灵活,并且make是标准发行版的一部分,而不像parallel.

回答by masseo

This code worked quite well for me.

这段代码对我来说效果很好。

I noticed one issue in which the script couldn't end. If you run into a case where the script wont end due to max_jobs being greater than the number of elements in the array, the script will never quit.

我注意到脚本无法结束的一个问题。如果遇到脚本因 max_jobs 大于数组中的元素数而无法结束的情况,脚本将永远不会退出。

To prevent the above scenario, I've added the following right after the "max_jobs" declaration.

为了防止出现上述情况,我在“max_jobs”声明之后添加了以下内容。

if [ $max_jobs -gt ${#todo_array[*]} ];
    then
           # there are more elements found in the array than max jobs, setting max jobs to #of array elements"
            max_jobs=${#todo_array[*]}
 fi

回答by Jeff Kaufman

Another option:

另外一个选项:

PARALLEL_MAX=...
function start_job() {
  while [ $(ps --no-headers -o pid --ppid=$$ | wc -l) -gt $PARALLEL_MAX ]; do
    sleep .1  # Wait for background tasks to complete.                         
  done
  "$@" &
}
start_job some_big_command1
start_job some_big_command2
start_job some_big_command3
start_job some_big_command4
...

回答by user2709129

Here is a very good function I used to control the maximum # of jobs from bash or ksh. NOTE: the - 1 in the pgrep subtracts the wc -l subprocess.

这是一个非常好的函数,我用来控制来自 bash 或 ksh 的最大作业数量。注意:pgrep 中的 -1 减去 wc -l 子进程。

function jobmax
{
    typeset -i MAXJOBS=
    sleep .1
    while (( ($(pgrep -P $$ | wc -l) - 1) >= $MAXJOBS ))
    do
        sleep .1
    done
}

nproc=5
for i in {1..100}
do
    sleep 1 &
    jobmax $nproc
done
wait # Wait for the rest