bash 使用从bash中的文件读取的数组并行化while循环
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16591290/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parallelizing a while loop with arrays read from a file in bash
提问by Einar
I have a while loop in Bash handled like this:
我在 Bash 中处理了一个 while 循环,如下所示:
while IFS=$'\t' read -r -a line;
do
myprogram ${line[0]} ${line[1]} ${line[0]}_vs_${line[1]}.result;
done < fileinput
It reads from a file with this structure, for reference:
它从具有此结构的文件中读取,以供参考:
foo bar
baz foobar
and so on (tab-delimited).
等等(制表符分隔)。
I would like to parallelize this loop (since the entries are a lot and processing can be slow) using GNU parallel, however the examples are not clear on how I would assign each line to the array, like I do here.
我想使用 GNU 并行并行化这个循环(因为条目很多并且处理可能很慢),但是示例并不清楚我如何将每一行分配给数组,就像我在这里做的那样。
What would be a possible solution (alternatives to GNU parallel work as well)?
什么是可能的解决方案(GNU 并行工作的替代方案)?
采纳答案by Ole Tange
From https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Use-a-table-as-input:
来自https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Use-a-table-as-input:
"""
Content of table_file.tsv:
"""
table_file.tsv 的内容:
foo<TAB>bar
baz <TAB> quux
To run:
跑步:
cmd -o bar -i foo
cmd -o quux -i baz
you can run:
你可以运行:
parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1}
"""
"""
So in your case it will be:
所以在你的情况下,它将是:
cat fileinput | parallel --colsep '\t' myprogram {1} {2} {1}_vs_{2}.result
回答by Hubbitus
I'd like @chepner hack. And it seems not so tricky accomplish similar behaviour with limiting number of parallel executions:
我想要@chepner hack。并且通过限制并行执行的数量来完成类似的行为似乎并不那么棘手:
while IFS=$'\t' read -r f1 f2;
do
myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" &
# At most as number of CPU cores
[ $( jobs | wc -l ) -ge $( nproc ) ] && wait
done < fileinput
wait
It limit execution at max of number of CPU cores present on system. You may easily vary that by replace $( nproc )by desired amount.
它限制在系统上存在的最大 CPU 内核数下执行。您可以轻松地通过替换$( nproc )所需的数量来改变它。
Meantime you should understand what it is not honest distribution. So, it not start new thread just after one finished. Instead it just wait finishing all, after start max amount. So summary throughput may be slightly less than with parallel. Especially if run time of your program may vary in big range. If time spent on each invocation is almost same then summary time also should be roughly equivalent.
同时你应该明白什么是不诚实的分配。因此,它不会在完成后立即启动新线程。相反,它只是在开始最大金额后等待完成所有内容。所以汇总吞吐量可能比并行略低。特别是如果您的程序的运行时间可能在很大范围内变化。如果每次调用所花费的时间几乎相同,那么汇总时间也应该大致相等。
回答by chepner
parallelisn't strictly necessary here; just start all the processes in the background, then wait for them to complete. The array is also unnecessary, as you can give readmore than one variable to populate:
parallel在这里不是绝对必要的;只需在后台启动所有进程,然后等待它们完成即可。该数组也是不必要的,因为您可以提供read多个变量来填充:
while IFS=$'\t' read -r f1 f2;
do
myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" &
done < fileinput
wait
This does start a single job for everyitem in your list, whereas parallelcan limit the number of jobs running at once. You can accomplish the same in bash, but it's tricky.
这确实为列表中的每个项目启动了一个作业,而parallel可以限制一次运行的作业数量。您可以在 中完成相同的操作bash,但这很棘手。

