postgresql 批量插入 Postgres 的最快方法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/758945/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What's the fastest way to do a bulk insert into Postgres?
提问by Ash
I need to programmatically insert 10's of millions of records into a postgres database. Presently I am executing 1000's of insert statements in a single "query".
我需要以编程方式将数百万条记录中的 10 条插入 postgres 数据库。目前,我正在单个“查询”中执行 1000 条插入语句。
Is there a better way to do this, some bulk insert statement I dont know about?
有没有更好的方法来做到这一点,一些我不知道的批量插入语句?
采纳答案by Dan Lew
PostgreSQL has a guideon how to best populate a database initially, and they suggest using the COPYcommand for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).
PostgreSQL 有一个关于如何最好地最初填充数据库的指南,他们建议使用COPY命令来批量加载行。该指南还有一些其他关于如何加快进程的好技巧,比如在加载数据之前删除索引和外键(然后将它们添加回来)。
回答by Ben Harper
There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:
除了使用 COPY 之外,还有一种替代方法,它是 Postgres 支持的多行值语法。从文档:
INSERT INTO films (code, title, did, date_prod, kind) VALUES
('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');
The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.
上面的代码插入了两行,但您可以任意扩展它,直到达到最大数量的准备好的语句标记(可能是 999 美元,但我不是 100% 确定)。有时无法使用 COPY,而对于这些情况,这是一个值得的替代品。
回答by Dana the Sane
One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.
加快速度的一种方法是在事务中显式执行多个插入或复制(比如 1000)。Postgres 的默认行为是在每个语句之后提交,因此通过批量提交,您可以避免一些开销。正如丹尼尔回答中的指南所说,您可能必须禁用自动提交才能使其正常工作。另请注意底部的评论建议将 wal_buffers 的大小增加到 16 MB 也可能有所帮助。
回答by ndpu
UNNEST
function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than using COPY
but it is useful to me in work with psycopg and python (python list
passed to cursor.execute
becomes pg ARRAY
):
UNNEST
带数组的函数可以与多行 VALUES 语法一起使用。我认为这种方法比使用慢,COPY
但它对我使用 psycopg 和 python 很有用(list
传递给的pythoncursor.execute
变为 pg ARRAY
):
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
);
without VALUES
using subselect with additional existance check:
不VALUES
使用带有额外存在检查的子选择:
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
SELECT UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
SELECT 1 FROM tablename tt
WHERE tt.fieldname1=temptable.fieldname1
);
the same syntax to bulk updates:
与批量更新相同的语法:
UPDATE tablename
SET fieldname1=temptable.data
FROM (
SELECT UNNEST(ARRAY[1,2]) AS id,
UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;
回答by Mike T
You can use COPY table TO ... WITH BINARY
which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.
您可以使用COPY table TO ... WITH BINARY
which “比 text 和 CSV 格式快一些”。仅当您有数百万行要插入并且您对二进制数据感到满意时才这样做。
Here is an example recipe in Python, using psycopg2 with binary input.
回答by wildplasser
It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.
它主要取决于数据库中的(其他)活动。像这样的操作有效地冻结了其他会话的整个数据库。另一个考虑因素是数据模型以及约束、触发器等的存在。
My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table. Then I check what can be checked: duplicates, keys that already exist in the target, etc.
我的第一种方法总是:创建一个结构类似于目标表的(临时)表(create table tmp AS select * from target where 1=0),然后首先将文件读入临时表。然后我检查可以检查的内容:重复项、目标中已经存在的键等。
Then I just do a "do insert into target select * from tmp" or similar.
然后我只是做一个“插入目标选择 * from tmp”或类似的操作。
If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)
如果失败,或花费太长时间,我会中止它并考虑其他方法(暂时删除索引/约束等)
回答by Elyor
I implemented very fast Postgresq data loader with native libpq methods. Try my package https://www.nuget.org/packages/NpgsqlBulkCopy/
我使用本地 libpq 方法实现了非常快速的 Postgresq 数据加载器。试试我的包https://www.nuget.org/packages/NpgsqlBulkCopy/
回答by Sarah Frostenson
I just encountered this issue and would recommend csvsqlfor bulk imports to Postgres. To perform a bulk insert you'd simply createdb
and then use csvsql
, which connects to your database and creates individual tables for an entire folder of CSVs.
我刚刚遇到这个问题,并会推荐csvsql批量导入 Postgres。要执行批量插入,您只需createdb
然后使用csvsql
,它会连接到您的数据库并为整个 CSV 文件夹创建单独的表。
$ createdb test
$ csvsql --db postgresql:///test --insert examples/*.csv