bash 使用从bash中的文件读取的数组并行化while循环

Question

提问by Einar

I have a while loop in Bash handled like this:

我在 Bash 中处理了一个 while 循环，如下所示：

while IFS=$'\t' read -r -a line;
do
    myprogram ${line[0]} ${line[1]} ${line[0]}_vs_${line[1]}.result;
done < fileinput

It reads from a file with this structure, for reference:

它从具有此结构的文件中读取，以供参考：

foo   bar
baz   foobar

and so on (tab-delimited).

等等（制表符分隔）。

I would like to parallelize this loop (since the entries are a lot and processing can be slow) using GNU parallel, however the examples are not clear on how I would assign each line to the array, like I do here.

我想使用 GNU 并行并行化这个循环（因为条目很多并且处理可能很慢），但是示例并不清楚我如何将每一行分配给数组，就像我在这里做的那样。

What would be a possible solution (alternatives to GNU parallel work as well)?

什么是可能的解决方案（GNU 并行工作的替代方案）？

Answer 1

采纳答案by Ole Tange

From https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Use-a-table-as-input:

来自https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Use-a-table-as-input：

"""
Content of table_file.tsv:

"""
table_file.tsv 的内容：

foo<TAB>bar
baz <TAB> quux

To run:

跑步：

cmd -o bar -i foo
cmd -o quux -i baz

you can run:

你可以运行：

parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1}

"""

So in your case it will be:

所以在你的情况下，它将是：

cat fileinput | parallel --colsep '\t' myprogram {1} {2} {1}_vs_{2}.result

Answer 2

回答by Hubbitus

I'd like @chepner hack. And it seems not so tricky accomplish similar behaviour with limiting number of parallel executions:

我想要@chepner hack。并且通过限制并行执行的数量来完成类似的行为似乎并不那么棘手：

while IFS=$'\t' read -r f1 f2;
do
    myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" &

    # At most as number of CPU cores
    [ $( jobs | wc -l ) -ge $( nproc ) ] && wait
done < fileinput

wait

It limit execution at max of number of CPU cores present on system. You may easily vary that by replace $( nproc )by desired amount.

它限制在系统上存在的最大 CPU 内核数下执行。您可以轻松地通过替换$( nproc )所需的数量来改变它。

Meantime you should understand what it is not honest distribution. So, it not start new thread just after one finished. Instead it just wait finishing all, after start max amount. So summary throughput may be slightly less than with parallel. Especially if run time of your program may vary in big range. If time spent on each invocation is almost same then summary time also should be roughly equivalent.

同时你应该明白什么是不诚实的分配。因此，它不会在完成后立即启动新线程。相反，它只是在开始最大金额后等待完成所有内容。所以汇总吞吐量可能比并行略低。特别是如果您的程序的运行时间可能在很大范围内变化。如果每次调用所花费的时间几乎相同，那么汇总时间也应该大致相等。

Answer 3

回答by chepner

parallelisn't strictly necessary here; just start all the processes in the background, then wait for them to complete. The array is also unnecessary, as you can give readmore than one variable to populate:

parallel在这里不是绝对必要的；只需在后台启动所有进程，然后等待它们完成即可。该数组也是不必要的，因为您可以提供read多个变量来填充：

while IFS=$'\t' read -r f1 f2;
do
    myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" &
done < fileinput
wait

This does start a single job for everyitem in your list, whereas parallelcan limit the number of jobs running at once. You can accomplish the same in bash, but it's tricky.

这确实为列表中的每个项目启动了一个作业，而parallel可以限制一次运行的作业数量。您可以在中完成相同的操作bash，但这很棘手。

bash 使用从bash中的文件读取的数组并行化while循环

提问by Einar

采纳答案by Ole Tange

回答by Hubbitus

回答by chepner

相关推荐

最近更新

标签

bash 使用从bash中的文件读取的数组并行化while循环

提问by Einar

采纳答案by Ole Tange

回答by Hubbitus

回答by chepner

相关推荐

为什么我的 bash 命令会出现这个“错误替换”错误？

bash 用于记录 Linux 进程的 CPU 和内存使用情况的 Shell 脚本

如何在 Mac OSX Mountain Lion 中升级 Bash 并将其设置为正确的路径？

bash 使用 find 和 curl 将目录内容上传到 Sonatype Nexus 存储库

相关推荐

最近更新

标签