bash 使用 COPY FROM stdin 加载表,只读取输入文件一次

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5204482/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 23:34:20  来源:igfitidea点击:

Using COPY FROM stdin to load tables, reading input file only once

databasebashpostgresqlpsycopg2

提问by Stew

I've got a large (~60 million row) fixed width source file with ~1800 records per row.

我有一个大的(约 6000 万行)固定宽度的源文件,每行约 1800 条记录。

I need to load this file into 5 different tables on an instance of Postgres 8.3.9.

我需要将此文件加载到 Postgres 8.3.9 实例上的 5 个不同表中。

My dilemma is that, because the file is so large, I'd like to have to read it only once.

我的困境是,因为文件太大,我只想读一次。

This is straightforward enough using INSERT or COPY as normal, but I'm trying to get a load speed boost by including my COPY FROM statements in a transaction that includes a TRUNCATE--avoiding logging, which is supposed to give a considerable load speed boost (according to http://www.cirrusql.com/node/3). As I understand it, you can disable logging in Postgres 9.x--but I don't have that option on 8.3.9.

这很简单,正常使用 INSERT 或 COPY,但我试图通过在包含 TRUNCATE 的事务中包含我的 COPY FROM 语句来提高加载速度——避免日志记录,这应该会显着提升加载速度(根据http://www.cirrusql.com/node/3)。据我了解,您可以在 Postgres 9.x 中禁用日志记录——但我在 8.3.9 上没有该选项。

The script below has me reading the input file twice, which I want to avoid... any ideas on how I could accomplish this by reading the input file only once? Doesn't have to be bash--I also tried using psycopg2, but couldn't figure out how to stream file output into the COPY statement as I'm doing below. I can't COPY FROM file because I need to parse it on the fly.

下面的脚本让我读取输入文件两次,我想避免这种情况......关于如何通过只读取一次输入文件来完成此操作的任何想法?不必是 bash——我也尝试过使用 psycopg2,但无法弄清楚如何将文件输出流式传输到 COPY 语句中,就像我在下面做的那样。我无法从文件中复制,因为我需要即时解析它。

#!/bin/bash

table1="copytest1"
table2="copytest2"

#note:  refers to the first argument used when invoking this script
#which should be the location of the file one wishes to have python
#parse and stream out into psql to be copied into the data tables

( echo 'BEGIN;'
  echo 'TRUNCATE TABLE ' ${table1} ';'
  echo 'COPY ' ${table1} ' FROM STDIN'
  echo "WITH NULL AS '';"
  cat  | python2.5 ~/parse_${table1}.py 
  echo '\.'
  echo 'TRUNCATE TABLE ' ${table2} ';'
  echo 'COPY ' ${table2} ' FROM STDIN'
  echo "WITH NULL AS '';"
  cat  | python2.5 ~/parse_${table2}.py 
  echo '\.'
  echo 'COMMIT;'
) | psql -U postgres -h chewy.somehost.com -p 5473 -d db_name

exit 0

Thanks!

谢谢!

采纳答案by johnbaum

You could use named pipesinstead your anonymous pipe. With this concept your python script could fill the tables through different psql processes with the corresponding data.

您可以使用命名管道代替匿名管道。有了这个概念,你的 python 脚本就可以通过不同的 psql 进程用相应的数据填充表。

Create pipes:

创建管道:

mkfifo fifo_table1
mkfifo fifo_table2

Run psql instances:

运行 psql 实例:

psql db_name < fifo_table1 &
psql db_name < fifo_table2 &

Your python script would look about so (Pseudocode):

你的python脚本看起来是这样的(伪代码):

SQL_BEGIN = """
BEGIN;
TRUNCATE TABLE %s;
COPY %s FROM STDIN WITH NULL AS '';
"""
fifo1 = open('fifo_table1', 'w')
fifo2 = open('fifo_table2', 'w')

bigfile = open('mybigfile', 'r')

print >> fifo1, SQL_BEGIN % ('table1', 'table1') #ugly, with python2.6 you could use .format()-Syntax     
print >> fifo2, SQL_BEGIN % ('table2', 'table2')      

for line in bigfile:
  # your code, which decides where the data belongs to
  # if data belongs to table1
  print >> fifo1, data
  # else
  print >> fifo2, data

print >> fifo1, 'COMMIT;'
print >> fifo2, 'COMMIT;'

fifo1.close()
fifo2.close()

Maybe this is not the most elegant solution, but it should work.

也许这不是最优雅的解决方案,但它应该有效。

回答by a_horse_with_no_name

Why use COPY for the second table? I would assume that doing a:

为什么对第二个表使用 COPY?我会假设做一个:

INSERT INTO table2 (...)
SELECT ...
FROM table1;

would be faster than using COPY.

会比使用 COPY 更快。

Edit
If you need to import different rows into different tables but from the same source file, maybe inserting everything into a staging table and then inserting the rows from there into the target tables is faster:

编辑
如果您需要将不同的行导入不同的表但来自同一个源文件,也许将所有内容插入临时表,然后将行从那里插入目标表更快:

Import the .whole* text file into one staging table:

导入 . 整个* 文本文件到一个临时表中:

COPY staging_table FROM STDIN ...;

After that step, the whole input file is in staging_table

在那一步之后,整个输入文件就在 staging_table

Then copy the rows from the staging table to the individual target tables by selecting only those that qualify for the corresponding table:

然后通过仅选择符合相应表的条件,将临时表中的行复制到各个目标表:

INSERT INTO table_1 (...)
SELECT ...
FROM staging_table
WHERE (conditions for table_1);

INSERT INTO table_2 (...)
SELECT ...
FROM staging_table
WHERE (conditions for table_2);

This is of course only feasible if you have enough space in your database to keep the staging table around.

这当然只有在您的数据库中有足够的空间来保留临时表时才可行。