bash 使用 GNU Parallel 和 Split

Question

提问by Topo

I'm loading a pretty gigantic file to a postgresql database. To do this I first use splitin the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Paralleland psql copy.

我正在将一个非常庞大的文件加载到 postgresql 数据库中。为此，我首先split在文件中使用获取较小的文件（每个 30Gb），然后使用GNU Parallel和将每个较小的文件加载到数据库中psql copy。

The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell splitto print the file name to std output each time it finishes writing a file so I can pipe it to Paralleland it starts loading the files at the time splitfinish writing it. Something like this:

问题是分割文件大约需要7个小时，然后它开始每个核心加载一个文件。我需要的是一种方法来告诉split它每次完成写入文件时将文件名打印到 std 输出，以便我可以将其通过管道传输到Parallel它并在split完成写入时开始加载文件。像这样的东西：

split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}

I have read the splitman pages and I can't find anything. Is there a way to do this with splitor any other tool?

我已阅读split手册页，但找不到任何内容。有没有办法用split或任何其他工具来做到这一点？

Answer 1

回答by Thor

You could let parallel do the splitting:

您可以让 parallel 进行拆分：

<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh

Note, that the manpage recommends using --blockover -N, this will still split the input at record separators, \nby default, e.g.:

请注意，手册页建议使用--blockover -N，\n默认情况下，这仍会在记录分隔符处拆分输入，例如：

<2011.psv parallel --pipe --block 250M ./carga_postgres.sh

Testing `--pipe`and `-N`

测试`--pipe`和`-N`

Here's a test that splits a sequence of 100 numbers into 5 files:

这是一个将 100 个数字序列拆分为 5 个文件的测试：

seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'

Check result:

检查结果：

wc -l /tmp/parallel_test_[1-5]

Output:

输出：

 23 /tmp/parallel_test_1
 23 /tmp/parallel_test_2
 23 /tmp/parallel_test_3
 23 /tmp/parallel_test_4
  8 /tmp/parallel_test_5
100 total

Answer 2

回答by Olaf Dietsche

If you use GNU split, you can do this with the --filteroption

如果您使用GNU split，则可以使用--filter选项执行此操作

‘--filter=command'
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.

'--filter=command'
使用此选项，不是简单地写入每个输出文件，而是通过管道写入每个输出文件的指定 shell 命令。命令应使用 $FILE 环境变量，该变量为每次命令调用设置为不同的输出文件名。

You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background

您可以创建一个shell脚本，它会在后台创建一个文件并在最后启动carga_postgres.sh

#! /bin/sh

cat >$FILE
./carga_postgres.sh $FILE &

and use that script as the filter

并使用该脚本作为过滤器

split -l 50000000 --filter=./filter.sh 2011.psv

bash 使用 GNU Parallel 和 Split

提问by Topo

回答by Thor

Testing `--pipe`and `-N`

测试`--pipe`和`-N`

回答by Olaf Dietsche

相关推荐

最近更新

标签

bash 使用 GNU Parallel 和 Split

提问by Topo

回答by Thor

Testing --pipeand -N

测试--pipe和-N

回答by Olaf Dietsche

相关推荐

bash 如何将某些文件从 dos 格式转换为 unix

bash 在bash变量中转义点

查找文件，就地重命名 unix bash

bash ANSI 问题：“\x1B[?25h”和“\x1BE”

相关推荐

最近更新

标签

Testing `--pipe`and `-N`

测试`--pipe`和`-N`