在 bash 中并行运行有限数量的子进程？

Question

提问by Niels Basjes

I have a large set of files for which some heavy processing needs to be done. This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run. My current usecase is to start a hadoop job on the input data, but I've had this same problem in other cases before.

我有一大堆文件需要做一些繁重的处理。这种单线程处理使用几百 MiB 的 RAM（在用于启动作业的机器上）并且需要几分钟才能运行。我当前的用例是在输入数据上启动 hadoop 作业，但我之前在其他情况下也遇到过同样的问题。

In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.

为了充分利用可用的 CPU 能力，我希望能够并行运行多个这些任务。

However a very simple example shell script like this will trash the system performance due to excessive load and swapping:

然而，像这样的一个非常简单的示例 shell 脚本会由于过多的负载和交换而破坏系统性能：

find . -type f | while read name ; 
do 
   some_heavy_processing_command ${name} &
done

So what I want is essentially similar to what "gmake -j4" does.

所以我想要的本质上类似于“gmake -j4”所做的。

I know bash supports the "wait" command but that only waits untill all child processes have completed. In the past I've created scripting that does a 'ps' command and then grep the child processes out by name (yes, i know ... ugly).

我知道 bash 支持“wait”命令，但它只会等到所有子进程都完成。过去，我创建了执行“ps”命令的脚本，然后按名称 grep 子进程（是的，我知道......丑陋）。

What is the simplest/cleanest/best solution to do what I want?

做我想做的最简单/最干净/最好的解决方案是什么？

Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bashThe "xargs --max-procs=4" works like a charm. (So I voted to close my own question)

编辑：感谢 Frederik：是的，这确实是How to limit number of threads used in a function in bash in bashThe "xargs --max-procs=4" 就像一个魅力。（所以我投票结束了我自己的问题）

Answer 1

采纳答案by Dunes

#! /usr/bin/env bash

set -o monitor 
# means: run background processes in a separate processes...
trap add_next_job CHLD 
# execute add_next_job when we receive a child complete signal

todo_array=($(find . -type f)) # places output into an array

index=0
max_jobs=2

function add_next_job {
    # if still jobs to do then add one
    if [[ $index -lt ${#todo_array[*]} ]]
    # apparently stackoverflow doesn't like bash syntax
    # the hash in the if is not a comment - rather it's bash awkward way of getting its length
    then
        echo adding job ${todo_array[$index]}
        do_job ${todo_array[$index]} & 
        # replace the line above with the command you want
        index=$(($index+1))
    fi
}

function do_job {
    echo "starting job "
    sleep 2
}

# add initial set of jobs
while [[ $index -lt $max_jobs ]]
do
    add_next_job
done

# wait for all jobs to complete
wait
echo "done"

Having said that Fredrik makes the excellent point that xargs does exactly what you want...

话虽如此，弗雷德里克提出了 xargs 完全符合您的要求...

Answer 2

回答by BruceH

I know I'm late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)

我知道我迟到了这个答案，但我想我会发布一个替代方案，恕我直言，使脚本的主体更清晰、更简单。（显然，您可以更改值 2 和 5 以适合您的场景。）

function max2 {
   while [ `jobs | wc -l` -ge 2 ]
   do
      sleep 5
   done
}

find . -type f | while read name ; 
do 
   max2; some_heavy_processing_command ${name} &
done
wait

Answer 3

回答by Ole Tange

With GNU Parallel it becomes simpler:

使用 GNU Parallel，它变得更简单：

find . -type f | parallel  some_heavy_processing_command {}

Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

了解更多：https: //www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Answer 4

回答by TrueY

I think I found a more handy solution using make:

我想我使用make找到了一个更方便的解决方案：

#!/usr/bin/make -f

THIS := $(lastword $(MAKEFILE_LIST))
TARGETS := $(shell find . -name '*.sh' -type f)

.PHONY: all $(TARGETS)

all: $(TARGETS)

$(TARGETS):
        some_heavy_processing_command $@

$(THIS): ; # Avoid to try to remake this makefile

Call it as e.g. 'test.mak', and add execute rights. If You call ./test.makit will call the some_heavy_processing_commandone-by-one. But You can call as ./test.mak -j 4, then it will run four subprocesses at once. Also You can use it on a more sophisticated way: run as ./test.mak -j 5 -l 1.5, then it will run maximum 5 sub-processes while the system load is under 1.5, but it will limit the number of processes if the system load exceeds 1.5.

将其命名为例如“test.mak”，并添加执行权限。如果你调用./test.mak它会some_heavy_processing_command一一调用。但是你可以调用 as ./test.mak -j 4，那么它会同时运行四个子进程。您也可以以更复杂的方式使用它：run as ./test.mak -j 5 -l 1.5，那么它会在系统负载低于 1.5 时最多运行 5 个子进程，但如果系统负载超过 1.5，它将限制进程数。

It is more flexible than xargs, and makeis part of the standard distribution, not like parallel.

它比xargs更灵活，并且make是标准发行版的一部分，而不像parallel.

Answer 5

回答by masseo

This code worked quite well for me.

这段代码对我来说效果很好。

I noticed one issue in which the script couldn't end. If you run into a case where the script wont end due to max_jobs being greater than the number of elements in the array, the script will never quit.

我注意到脚本无法结束的一个问题。如果遇到脚本因 max_jobs 大于数组中的元素数而无法结束的情况，脚本将永远不会退出。

To prevent the above scenario, I've added the following right after the "max_jobs" declaration.

为了防止出现上述情况，我在“max_jobs”声明之后添加了以下内容。

if [ $max_jobs -gt ${#todo_array[*]} ];
    then
           # there are more elements found in the array than max jobs, setting max jobs to #of array elements"
            max_jobs=${#todo_array[*]}
 fi

Answer 6

回答by Jeff Kaufman

Another option:

另外一个选项：

PARALLEL_MAX=...
function start_job() {
  while [ $(ps --no-headers -o pid --ppid=$$ | wc -l) -gt $PARALLEL_MAX ]; do
    sleep .1  # Wait for background tasks to complete.                         
  done
  "$@" &
}
start_job some_big_command1
start_job some_big_command2
start_job some_big_command3
start_job some_big_command4
...

Answer 7

回答by user2709129

Here is a very good function I used to control the maximum # of jobs from bash or ksh. NOTE: the - 1 in the pgrep subtracts the wc -l subprocess.

这是一个非常好的函数，我用来控制来自 bash 或 ksh 的最大作业数量。注意：pgrep 中的 -1 减去 wc -l 子进程。

function jobmax
{
    typeset -i MAXJOBS=
    sleep .1
    while (( ($(pgrep -P $$ | wc -l) - 1) >= $MAXJOBS ))
    do
        sleep .1
    done
}

nproc=5
for i in {1..100}
do
    sleep 1 &
    jobmax $nproc
done
wait # Wait for the rest

在 bash 中并行运行有限数量的子进程？

提问by Niels Basjes

采纳答案by Dunes

回答by BruceH

回答by Ole Tange

回答by TrueY

回答by masseo

回答by Jeff Kaufman

回答by user2709129

相关推荐

最近更新

标签

在 bash 中并行运行有限数量的子进程？

提问by Niels Basjes

采纳答案by Dunes

回答by BruceH

回答by Ole Tange

回答by TrueY

回答by masseo

回答by Jeff Kaufman

回答by user2709129

相关推荐

bash 合并两个文件

bash 如何让 Emacs 使用我的 .bashrc 文件？

bash 如何同时grep和cut

bash 如何在 Shell Scripting 中进行日期计算？

相关推荐

最近更新

标签