bash 使用 GNU Parallel 和 Split

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15144655/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 04:42:09  来源:igfitidea点击:

Using GNU Parallel With Split

bashsplitgnu-parallel

提问by Topo

I'm loading a pretty gigantic file to a postgresql database. To do this I first use splitin the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Paralleland psql copy.

我正在将一个非常庞大的文件加载到 postgresql 数据库中。为此,我首先split在文件中使用获取较小的文件(每个 30Gb),然后使用GNU Parallel和将每个较小的文件加载到数据库中psql copy

The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell splitto print the file name to std output each time it finishes writing a file so I can pipe it to Paralleland it starts loading the files at the time splitfinish writing it. Something like this:

问题是分割文件大约需要7个小时,然后它开始每个核心加载一个文件。我需要的是一种方法来告诉split它每次完成写入文件时将文件名打印到 std 输出,以便我可以将其通过管道传输到Parallel它并在split完成写入时开始加载文件。像这样的东西:

split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}

I have read the splitman pages and I can't find anything. Is there a way to do this with splitor any other tool?

我已阅读split手册页,但找不到任何内容。有没有办法用split或任何其他工具来做到这一点?

回答by Thor

You could let parallel do the splitting:

您可以让 parallel 进行拆分:

<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh

Note, that the manpage recommends using --blockover -N, this will still split the input at record separators, \nby default, e.g.:

请注意,手册页建议使用--blockover -N\n默认情况下,这仍会在记录分隔符处拆分输入,例如:

<2011.psv parallel --pipe --block 250M ./carga_postgres.sh

Testing --pipeand -N

测试--pipe-N

Here's a test that splits a sequence of 100 numbers into 5 files:

这是一个将 100 个数字序列拆分为 5 个文件的测试:

seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'

Check result:

检查结果:

wc -l /tmp/parallel_test_[1-5]

Output:

输出:

 23 /tmp/parallel_test_1
 23 /tmp/parallel_test_2
 23 /tmp/parallel_test_3
 23 /tmp/parallel_test_4
  8 /tmp/parallel_test_5
100 total

回答by Olaf Dietsche

If you use GNU split, you can do this with the --filteroption

如果您使用GNU split,则可以使用--filter选项执行此操作

‘--filter=command'
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.

'--filter=command'
使用此选项,不是简单地写入每个输出文件,而是通过管道写入每个输出文件的指定 shell 命令。命令应使用 $FILE 环境变量,该变量为每次命令调用设置为不同的输出文件名。

You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background

您可以创建一个shell脚本,它会在后台创建一个文件并在最后启动carga_postgres.sh

#! /bin/sh

cat >$FILE
./carga_postgres.sh $FILE &

and use that script as the filter

并使用该脚本作为过滤器

split -l 50000000 --filter=./filter.sh 2011.psv