如何在 PostgreSQL 中进行大型非阻塞更新?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1113277/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I do large non-blocking updates in PostgreSQL?
提问by
I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql consoleto make these types of operations faster.
我想对 PostgreSQL 中的表进行大量更新,但我不需要在整个操作中维护事务完整性,因为我知道我正在更改的列不会在此期间被写入或读取更新。我想知道在 psql 控制台中是否有一种简单的方法可以使这些类型的操作更快。
For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:
例如,假设我有一个名为“orders”的表,有 3500 万行,我想这样做:
UPDATE orders SET status = null;
To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.
为避免被转移到题外讨论,让我们假设 3500 万列的所有状态值当前都设置为相同(非空)值,从而使索引无用。
The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like
这个语句的问题在于,它需要很长时间才能生效(完全是因为锁定),并且所有更改的行都被锁定,直到整个更新完成。此更新可能需要 5 小时,而类似于
UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);
might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.
可能需要 1 分钟。超过 3500 万行,执行上述操作并将其分成 35 个块只需要 35 分钟,并为我节省 4 小时 25 分钟。
I could break it down even further with a script (using pseudocode here):
我可以用一个脚本(这里使用伪代码)进一步分解它:
for (i = 0 to 3500) {
db_operation ("UPDATE orders SET status = null
WHERE (order_id >" + (i*1000)"
+ " AND order_id <" + ((i+1)*1000) " + ")");
}
This operation might complete in only a few minutes, rather than 35.
此操作可能会在几分钟内完成,而不是 35 分钟。
So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?
所以这归结为我真正要问的问题。我不想写一个该死的脚本来分解操作,每次我想做这样的一次大的更新时。有没有办法完全在 SQL 中完成我想要的?
回答by Erwin Brandstetter
Column / Row
列/行
... I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update.
...我不需要在整个操作中维护事务完整性,因为我知道我正在更改的列不会在更新期间被写入或读取。
Any UPDATE
in PostgreSQL's MVCC modelwrites a new version of the whole row. If concurrent transactions change anycolumn of the same row, time-consuming concurrency issues arise. Details in the manual.Knowing the same columnwon't be touched by concurrent transactions avoids somepossible complications, but not others.
任何UPDATE
在PostgreSQL的MVCC模型写入新版本的整行。如果并发事务改变了同一行的任何列,就会出现耗时的并发问题。手册中的详细信息。知道并发事务不会触及同一列可以避免一些可能的复杂情况,但不会避免其他复杂情况。
Index
指数
To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.
为避免被转移到题外讨论,让我们假设 3500 万列的所有状态值当前都设置为相同(非空)值,从而使索引无用。
When updating the whole table(or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE
.
当更新整个表(或其主要部分)时,Postgres从不使用索引。当必须读取所有或大部分行时,顺序扫描会更快。相反:索引维护意味着UPDATE
.
Performance
表现
For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:
UPDATE orders SET status = null;
例如,假设我有一个名为“orders”的表,有 3500 万行,我想这样做:
UPDATE orders SET status = null;
I understand you are aiming for a more general solution (see below). But to address the actual questionasked: This can be dealt with in a matter milliseconds, regardless of table size:
我知道您的目标是更通用的解决方案(见下文)。但是要解决实际提出的问题:无论表大小如何,这都可以在几毫秒内处理:
ALTER TABLE orders DROP column status
, ADD column status text;
When a column is added with
ADD COLUMN
, all existing rows in the table are initialized with the column's default value (NULL
if noDEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a metadata change...
当列添加有 时
ADD COLUMN
,表中所有现有行都使用列的默认值进行初始化(NULL
如果未DEFAULT
指定子句)。如果没有 DEFAULT 子句,这只是元数据更改...
And:
和:
The
DROP COLUMN
form does not physically remove the column, but simply makes it invisible to SQL operations. Subsequent insert and update operations in the table will store a null value for the column. Thus, dropping a column is quick but it will not immediately reduce the on-disk size of your table, as the space occupied by the dropped column is not reclaimed. The space will be reclaimed over time as existing rows are updated. (These statements do not apply when dropping the system oid column; that is done with an immediate rewrite.)
该
DROP COLUMN
表单不会物理删除该列,而只是使其对 SQL 操作不可见。表中的后续插入和更新操作将为该列存储一个空值。因此,删除列很快,但不会立即减少表的磁盘大小,因为删除列占用的空间不会被回收。随着现有行的更新,空间将随着时间的推移被回收。(这些语句在删除系统 oid 列时不适用;这是通过立即重写完成的。)
Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute
do the job. Requires an exclusive lockon the table which may be a problem for heavy concurrent load. Since it only takes a few milliseconds, you should still be fine.
确保您没有依赖于列的对象(外键约束、索引、视图等)。您需要删除/重新创建那些。除此之外,系统目录表上的微小操作可以pg_attribute
完成这项工作。需要对表进行排他锁,这对于并发负载较重的情况可能是一个问题。因为它只需要几毫秒,所以你应该没问题。
If you have a column default you want to keep, add it back in a separate command. Doing it in the same command would apply it to all rows immediately, voiding the effect. You could then update the existing columns in batches. Follow the documentation link and read the Notesin the manual.
如果您要保留列默认值,请将其添加回单独的命令中。在同一命令中执行此操作会立即将其应用于所有行,从而使效果无效。然后,您可以批量更新现有列。按照文档链接并阅读手册中的注释。
General solution
通用解决方案
dblink
has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.
dblink
已经在另一个答案中提到了。它允许在隐式独立连接中访问“远程”Postgres 数据库。“远程”数据库可以是当前数据库,从而实现“自治事务”:函数写入“远程”数据库的内容已提交且无法回滚。
This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.
这允许运行单个函数,该函数以较小的部分更新大表,并且每个部分都单独提交。避免为大量行增加事务开销,更重要的是,在每个部分之后释放锁。这允许并发操作在没有太多延迟的情况下进行,并减少死锁的可能性。
If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK
after an exception. Also consider SAVEPOINT
for that case.
如果您没有并发访问,这几乎没有用 - 除非ROLLBACK
在异常之后避免。还要考虑SAVEPOINT
这种情况。
Disclaimer
免责声明
First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.
首先,很多小额交易实际上更昂贵。这仅对大表有意义。甜蜜点取决于许多因素。
If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writescan move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.
如果您不确定自己在做什么:单个事务是安全的方法。为了使其正常工作,表上的并发操作必须同时进行。例如:并发写入可以将一行移动到据称已经处理的分区。或者并发读取可以看到不一致的中间状态。你被警告了。
Step-by-step instructions
分步说明
The additional module dblink needs to be installed first:
需要先安装附加模块dblink:
Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:
设置与 dblink 的连接在很大程度上取决于您的数据库集群的设置和适当的安全策略。这可能很棘手。与更多如何与 dblink 连接的相关稍后回答:
Create a FOREIGN SERVER
and a USER MAPPING
as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY
with or without some gaps.
按照那里的说明创建 aFOREIGN SERVER
和 aUSER MAPPING
以简化和简化连接(除非您已经有了)。
假设serial PRIMARY KEY
有或没有一些间隙。
CREATE OR REPLACE FUNCTION f_update_in_steps()
RETURNS void AS
$func$
DECLARE
_step int; -- size of step
_cur int; -- current ID (starting with minimum)
_max int; -- maximum ID
BEGIN
SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders;
-- 100 slices (steps) hard coded
_step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small
-- +1 to avoid endless loop for 0
PERFORM dblink_connect('myserver'); -- your foreign server as instructed above
FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max
PERFORM dblink_exec(
$$UPDATE public.orders
SET status = 'foo'
WHERE order_id >= $$ || _cur || $$
AND order_id < $$ || _cur + _step || $$
AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update
_cur := _cur + _step;
EXIT WHEN _cur > _max; -- stop when done (never loop till 200)
END LOOP;
PERFORM dblink_disconnect();
END
$func$ LANGUAGE plpgsql;
Call:
称呼:
SELECT f_update_in_steps();
You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:
您可以根据需要对任何部分进行参数化:表名、列名、值……只要确保清理标识符以避免 SQL 注入:
About avoiding empty UPDATE:
关于避免空更新:
回答by Tometzky
You should delegate this column to another table like this:
您应该将此列委托给另一个表,如下所示:
create table order_status (
order_id int not null references orders(order_id) primary key,
status int not null
);
Then your operation of setting status=NULL will be instant:
那么你设置 status=NULL 的操作将是即时的:
truncate order_status;
回答by Tometzky
First of all - are you sure that you need to update all rows?
首先 - 您确定需要更新所有行吗?
Perhaps some of the rows already have status
NULL?
也许某些行已经有status
NULL?
If so, then:
如果是这样,那么:
UPDATE orders SET status = null WHERE status is not null;
As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.
至于对更改进行分区 - 这在纯 sql 中是不可能的。所有更新都在单个事务中。
One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.
在“纯 sql”中执行此操作的一种可能方法是安装 dblink,使用 dblink 连接到同一个数据库,然后通过 dblink 发出大量更新,但对于这样一个简单的任务来说,这似乎有点过头了。
Usually just adding proper where
solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:
通常只需添加适当的即可where
解决问题。如果没有 - 只需手动对其进行分区。写一个脚本太多了——你通常可以用一个简单的单行代码来写:
perl -e '
for (my $i = 0; $i <= 3500000; $i += 1000) {
printf "UPDATE orders SET status = null WHERE status is not null
and order_id between %u and %u;\n",
$i, $i+999
}
'
I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:
为了可读性,我在这里换行,通常是单行。上述命令的输出可以直接提供给 psql:
perl -e '...' | psql -U ... -d ...
Or first to file and then to psql (in case you'd need the file later on):
或者先归档,然后是 psql(以防您稍后需要该文件):
perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql
回答by mys
I would use CTAS:
我会使用 CTAS:
begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;
回答by Martin Torhage
Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this: BEGIN; SELECT NOW(); SELECT * FROM order FOR UPDATE; SELECT NOW(); ROLLBACK;
你确定这是因为锁定?我不这么认为,还有许多其他可能的原因。要找出答案,您可以随时尝试只进行锁定。试试这个:开始;现在选择(); SELECT * FROM order FOR UPDATE; 立即选择(); 回滚;
To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.
要了解真正发生了什么,您应该首先运行 EXPLAIN(EXPLAIN UPDATE orders SET status...)和/或 EXPLAIN ANALYZE。也许您会发现您没有足够的内存来有效地执行 UPDATE。如果是这样,将 work_mem 设置为 'xxxMB'; 可能是一个简单的解决方案。
Also, tail the PostgreSQL log to see if some performance related problems occurs.
另外,跟踪 PostgreSQL 日志以查看是否发生了一些与性能相关的问题。
回答by Martin v. L?wis
Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.
Postgres 使用 MVCC(多版本并发控制),因此如果您是唯一的编写者,则可以避免任何锁定;任意数量的并发读取器都可以在表上工作,并且不会有任何锁定。
So if it really takes 5h, it must be for a different reason (e.g. that you dohave concurrent writes, contrary to your claim that you don't).
因此,如果真的需要 5 小时,那一定是出于不同的原因(例如,您确实有并发写入,这与您声称没有的说法相反)。
回答by mikl
I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.
我绝不是一名 DBA,但您经常需要更新 3500 万行的数据库设计可能会出现……问题。
A simple WHERE status IS NOT NULL
might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.
一个简单的方法WHERE status IS NOT NULL
可能会大大加快速度(假设您有状态索引)——不知道实际用例,我假设如果经常运行,3500 万行中的很大一部分可能已经具有空状态.
However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:
但是,您可以通过LOOP 语句在查询中进行循环。我会做一个小例子:
CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
i INTEGER := 0;
BEGIN
FOR i IN 0..(count/1000 + 1) LOOP
UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
RAISE NOTICE 'Count: % and i: %', count,i;
END LOOP;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
It can then be run by doing something akin to:
然后可以通过执行以下操作来运行它:
SELECT nullstatus(35000000);
You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.
您可能想要选择行数,但要注意准确的行数可能需要很长时间。PostgreSQL wiki 有一篇关于缓慢计数以及如何避免它的文章。
Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.
此外,RAISE NOTICE 部分只是为了跟踪脚本的进度。如果您不监视通知,或者不在乎,最好将其排除在外。
回答by rogerdpack
Some options that haven't been mentioned:
一些没有提到的选项:
Use the new tabletrick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (perconais an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).
使用新的表格技巧。在您的情况下,您可能需要做的是编写一些触发器来处理它,以便对原始表的更改也会传播到您的表副本,类似这样......(percona是一个例子触发方式)。另一种选择可能是“创建一个新列然后用它替换旧列”技巧,以避免锁定(不清楚是否有助于提高速度)。
Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ...
then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(
可能计算最大 ID,然后生成“您需要的所有查询”并将它们作为单个查询传递,就像update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ...
那样它可能不会做那么多的锁定,并且仍然是所有的 SQL,尽管您确实有额外的逻辑来做到这一点:(
回答by axiopisty
PostgreSQL version 11 handles this for you automatically with the Fast ALTER TABLE ADD COLUMN with a non-NULL defaultfeature. Please do upgrade to version 11 if possible.
PostgreSQL 版本 11使用具有非 NULL 默认功能的Fast ALTER TABLE ADD COLUMN自动为您处理此问题。如果可能,请升级到版本 11。
An explanation is provided in this blog post.
此博客文章中提供了解释。