postgresql 批量插入数亿条记录
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2842409/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bulk insert of hundreds of millions of records
提问by Dave Jarvis
What is the fastest way to insert 237 million records into a table that has rules (for distributing data across child tables)?
将 2.37 亿条记录插入具有规则的表(用于跨子表分布数据)的最快方法是什么?
I have tried or considered:
我已经尝试或考虑过:
- Insert statements.
- Transactional inserts (
BEGIN
andCOMMIT
). - The
COPY FROM
command. - http://pgbulkload.projects.postgresql.org/
- 插入语句。
- 事务性插入(
BEGIN
和COMMIT
)。 - 该
COPY FROM
命令。 - http://pgbulkload.projects.postgresql.org/
Inserts are too slow (four days) and COPY FROM
ignores rules (and has other issues).
插入太慢(四天)并COPY FROM
忽略规则(还有其他问题)。
Example data:
示例数据:
station_id,taken,amount,category_id,flag
1,'1984-07-1',0,4,
1,'1984-07-2',0,4,
1,'1984-07-3',0,4,
1,'1984-07-4',0,4,T
Table structure (with one rule included):
表结构(包括一个规则):
CREATE TABLE climate.measurement
(
id bigserial NOT NULL,
station_id integer NOT NULL,
taken date NOT NULL,
amount numeric(8,2) NOT NULL,
category_id smallint NOT NULL,
flag character varying(1) NOT NULL DEFAULT ' '::character varying
)
WITH (
OIDS=FALSE
);
ALTER TABLE climate.measurement OWNER TO postgres;
CREATE OR REPLACE RULE i_measurement_01_001 AS
ON INSERT TO climate.measurement
WHERE date_part('month'::text, new.taken)::integer = 1 AND new.category_id = 1 DO INSTEAD INSERT INTO climate.measurement_01_001 (id, station_id, taken, amount, category_id, flag)
VALUES (new.id, new.station_id, new.taken, new.amount, new.category_id, new.flag);
The data was originally in MySQL, but must be switched to PostgreSQL for performance reasons (and to leverage the PL/R extension).
数据最初在 MySQL 中,但出于性能原因(并利用 PL/R 扩展)必须切换到 PostgreSQL。
Thank you!
谢谢!
回答by araqnid
Split your input into separate files outside the database and upload each one using COPY, rather than relying on the rule to distribute them. If the rule you give is any example, that's a trivial text transformation to apply. Also, splitting up front will let you load the split files in parallel if your disk system is up to it.
将您的输入拆分为数据库外的单独文件,并使用 COPY 上传每个文件,而不是依赖规则来分发它们。如果您给出的规则是任何示例,那么这是一个简单的文本转换应用。此外,如果您的磁盘系统能够满足要求,预先拆分将使您可以并行加载拆分文件。
Seriously, don't rely on the rule to do this distribution for a bulk load. It's practically always the case that bulk load and transactional load need different approaches, unless you're prepared to brute-force one or the other (and, usually, wait).
说真的,不要依赖规则来为批量加载执行此分发。批量加载和事务加载实际上总是需要不同的方法,除非您准备对其中一个进行暴力破解(并且通常需要等待)。
For instance, your rule uses date_part() to extract the month from the date- so in order to determine the child table, postgres needs to analyse the date string, convert it to a timestamp, and then convert the timestamp back to a calendar just to get the month field back out again. But if you're writing something to do this upfront, you can just do substr($date,5,2)
(or equivalent): which do you think will be faster?
例如,您的规则使用 date_part() 从日期中提取月份 - 所以为了确定子表,postgres 需要分析日期字符串,将其转换为时间戳,然后将时间戳转换回日历再次取回月份字段。但是,如果您正在编写一些东西来预先执行此操作,则可以直接执行substr($date,5,2)
(或等效操作):您认为哪个会更快?
It's also an opportunity to clean up the data format so COPY will accept it. Note you can specify the columns with the COPY command: if you weren't doing that with that schema and example file, you'd get errors due to the extra "id" column on the front. ("copy from ... with csv header" may have figured that out, but maybe not... the "header" option may just make it skip the first line).
这也是清理数据格式的机会,因此 COPY 将接受它。请注意,您可以使用 COPY 命令指定列:如果您没有使用该架构和示例文件执行此操作,则会由于前面的额外“id”列而出现错误。(“从 ... 带有 csv 标头复制”可能已经弄清楚了,但也许没有......“标头”选项可能只是让它跳过第一行)。
I've just loaded about 280e6 rows into a postgresql instance myself in a few hours so it's certainly not impossible. For this initial load, I've turned fsync=off; the plan is to load the backlog and then turn it back on again for regular daily loads. I had to set checkpoint_segments=40 to avoid getting checkpoint warnings in the logs. This is just being loaded onto my dev machine- I'm using a dedicated disk for the database, which is different from the disk used for xlogs (i.e. I created a tablespace on the big disk and created the database inside that tablespace). The instance has shared_buffers set to 1Gb, and checkpoint_target set to 0.5. I tried loading some of the partitions in parallel and it didn't provide much improvement, so I suspect the slow disk is being the bottleneck rather than the DB itself.
我自己刚刚在几个小时内将大约 280e6 行加载到 postgresql 实例中,所以这当然不是不可能的。对于这个初始加载,我已经关闭了 fsync=off; 计划是加载积压,然后将其重新打开以进行日常加载。我必须设置 checkpoint_segments=40 以避免在日志中收到检查点警告。这只是被加载到我的开发机器上 - 我正在为数据库使用专用磁盘,这与用于 xlogs 的磁盘不同(即我在大磁盘上创建了一个表空间并在该表空间内创建了数据库)。该实例的 shared_buffers 设置为 1Gb,checkpoint_target 设置为 0.5。我尝试并行加载一些分区,但并没有提供太大的改进,所以我怀疑慢速磁盘是瓶颈而不是数据库本身。
Just another 1.7e9 rows to go... should be finished tomorrow sometime I hope.
还有 1.7e9 行……我希望明天某个时候应该完成。
回答by Oliver Twist
- create parent table without any index, only column and there types (create table some_data (c_1 int, c_2 varchar,....))
- create sequence for new data tables enumeration
- take new id from sequence
- create new table for real data with 'like' key word (create table some_data_X like some_data)
- insert real data in some_data_X with copy in binary format
- create indexes, time constraints etc (empower your cores using multiple connections to postgresql)
- inherit parent table
- now ready to select! In such way I have achieve 400000-500000 inserts per seconds with index creation on a 10 columns (2 xeon, 24 cores, 24 Gb of memory, SSD).
- 创建没有任何索引的父表,只有列和类型(创建表 some_data (c_1 int, c_2 varchar,....))
- 为新数据表枚举创建序列
- 从序列中获取新 id
- 使用“like”关键字为真实数据创建新表(创建表 some_data_X like some_data)
- 在 some_data_X 中插入真实数据,并以二进制格式复制
- 创建索引、时间限制等(使用与 postgresql 的多个连接来增强您的核心能力)
- 继承父表
- 现在准备选择!通过这种方式,我实现了每秒 400000-500000 次插入,并在 10 列(2 个至强、24 个内核、24 Gb 内存、SSD)上创建索引。
Bonus: in separete thread remove old data (some_data_X with min X): huge circular buffer with indexing!
奖励:在单独的线程中删除旧数据(some_data_X with min X):带有索引的巨大循环缓冲区!
回答by Stephen Denne
PostgreSQL documentation contains a page on populating a database, which might help you once you've followed araqnid's advice to pre-process the input so you can use COPY
.
PostgreSQL 文档包含一个关于填充数据库的页面,一旦您按照 araqnid 的建议预处理输入以便您可以使用COPY
.