bash 使用 GNU Parallel 和 Split
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15144655/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using GNU Parallel With Split
提问by Topo
I'm loading a pretty gigantic file to a postgresql database. To do this I first use splitin the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Paralleland psql copy.
我正在将一个非常庞大的文件加载到 postgresql 数据库中。为此,我首先split在文件中使用获取较小的文件(每个 30Gb),然后使用GNU Parallel和将每个较小的文件加载到数据库中psql copy。
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell splitto print the file name to std output each time it finishes writing a file so I can pipe it to Paralleland it starts loading the files at the time splitfinish writing it. Something like this:
问题是分割文件大约需要7个小时,然后它开始每个核心加载一个文件。我需要的是一种方法来告诉split它每次完成写入文件时将文件名打印到 std 输出,以便我可以将其通过管道传输到Parallel它并在split完成写入时开始加载文件。像这样的东西:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the splitman pages and I can't find anything. Is there a way to do this with splitor any other tool?
我已阅读split手册页,但找不到任何内容。有没有办法用split或任何其他工具来做到这一点?
回答by Thor
You could let parallel do the splitting:
您可以让 parallel 进行拆分:
<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh
Note, that the manpage recommends using --blockover -N, this will still split the input at record separators, \nby default, e.g.:
请注意,手册页建议使用--blockover -N,\n默认情况下,这仍会在记录分隔符处拆分输入,例如:
<2011.psv parallel --pipe --block 250M ./carga_postgres.sh
Testing --pipeand -N
测试--pipe和-N
Here's a test that splits a sequence of 100 numbers into 5 files:
这是一个将 100 个数字序列拆分为 5 个文件的测试:
seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'
Check result:
检查结果:
wc -l /tmp/parallel_test_[1-5]
Output:
输出:
23 /tmp/parallel_test_1
23 /tmp/parallel_test_2
23 /tmp/parallel_test_3
23 /tmp/parallel_test_4
8 /tmp/parallel_test_5
100 total
回答by Olaf Dietsche
If you use GNU split, you can do this with the --filteroption
如果您使用GNU split,则可以使用--filter选项执行此操作
‘--filter=command'
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.
'--filter=command'
使用此选项,不是简单地写入每个输出文件,而是通过管道写入每个输出文件的指定 shell 命令。命令应使用 $FILE 环境变量,该变量为每次命令调用设置为不同的输出文件名。
You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background
您可以创建一个shell脚本,它会在后台创建一个文件并在最后启动carga_postgres.sh
#! /bin/sh
cat >$FILE
./carga_postgres.sh $FILE &
and use that script as the filter
并使用该脚本作为过滤器
split -l 50000000 --filter=./filter.sh 2011.psv

