postgresql 如何加速一个缓慢的 UPDATE 查询
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11086761/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to speed up a slow UPDATE query
提问by Mike Christensen
I have the following UPDATE
query:
我有以下UPDATE
查询:
UPDATE Indexer.Pages SET LastError=NULL where LastError is not null;
Right now, this query takes about 93 minutes to complete. I'd like to find ways to make this a bit faster.
现在,此查询大约需要 93 分钟才能完成。我想找到办法让这个更快一点。
The Indexer.Pages
table has around 506,000 rows, and about 490,000 of them contain a value for LastError
, so I doubt I can take advantage of any indexes here.
该Indexer.Pages
表有大约 506,000 行,其中大约 490,000 行包含 的值LastError
,所以我怀疑我是否可以利用这里的任何索引。
The table (when uncompressed) has about 46 gigs of data in it, however the majority of that data is in a text field called html
. I believe simply loading and unloading that many pages is causing the slowdown. One idea would be to make a new table with justthe Id
and the html
field, and keep Indexer.Pages
as small as possible. However, testing this theory would be a decent amount of work since I actually don't have the hard disk space to create a copy of the table. I'd have to copy it over to another machine, drop the table, then copy the data back which would probably take all evening.
该表(未压缩时)包含大约 46 演出数据,但大部分数据位于名为html
. 我相信简单地加载和卸载那么多页面会导致速度变慢。一个想法是制作一个只有theId
和html
field的新表,并保持Indexer.Pages
尽可能小。然而,测试这个理论将是一个相当大的工作量,因为我实际上没有硬盘空间来创建表的副本。我必须将它复制到另一台机器上,放下表,然后将数据复制回来,这可能需要整个晚上。
Ideas? I'm using Postgres 9.0.0.
想法?我正在使用 Postgres 9.0.0。
UPDATE:
更新:
Here's the schema:
这是架构:
CREATE TABLE indexer.pages
(
id uuid NOT NULL,
url character varying(1024) NOT NULL,
firstcrawled timestamp with time zone NOT NULL,
lastcrawled timestamp with time zone NOT NULL,
recipeid uuid,
html text NOT NULL,
lasterror character varying(1024),
missingings smallint,
CONSTRAINT pages_pkey PRIMARY KEY (id ),
CONSTRAINT indexer_pages_uniqueurl UNIQUE (url )
);
I also have two indexes:
我也有两个索引:
CREATE INDEX idx_indexer_pages_missingings
ON indexer.pages
USING btree
(missingings )
WHERE missingings > 0;
and
和
CREATE INDEX idx_indexer_pages_null
ON indexer.pages
USING btree
(recipeid )
WHERE NULL::boolean;
There are no triggers on this table, and there is one other table that has a FK constraint on Pages.PageId
.
此表上没有触发器,还有一个表对 具有 FK 约束Pages.PageId
。
采纳答案by Erwin Brandstetter
What @kgrittn posted as commentis the best answer so far. I am merely filling in details.
什么@kgrittn张贴评论是最好的答案为止。我只是填写细节。
Before you do anythingelse, you shouldupgradePostgreSQL to a current version, at least to the last security release of your major version. See guidelines on the project.
在你做任何其他事情之前,你应该将PostgreSQL升级到当前版本,至少升级到你的主要版本的最后一个安全版本。请参阅项目指南。
I also want to stress what Kevin mentioned about indexesinvolving the column LastError
. Normally, HOT updatescan recycle dead rows on a data page and make UPDATEs a lot faster - effectively removing (most of) the need for vacuuming. Related:
我还想强调 Kevin 提到的有关涉及列的索引的内容LastError
。通常,热更新可以回收数据页上的死行并使更新更快——有效地消除(大部分)对清理的需要。有关的:
If your column is used in any index in any way, HOT UPDATEs are disabled, because it would break the index(es). If that is the case, you should be able to speed up the query a lotby deleting all of these indexes before you UPDATE
and recreate them later.
如果您的列以任何方式用于任何索引,则禁用 HOT UPDATE,因为它会破坏索引。如果是这样的话,你应该能够加速查询了很多通过删除所有你以前这些指标的UPDATE
,后来重新创建。
In this context it would help to run multiple smaller UPDATEs:
If...
... the updated column is not involved in any indexes (enabling HOT updates).
... the UPDATE
is easily divided into multiple patches in multipletransactions.
... the rows in those patches are spread out over the table (physically, not logically).
... there are no other concurrent transactions keeping dead tuples from being reused.
在这种情况下,运行多个较小的 UPDATE 会有所帮助:
如果...
... 更新的列不涉及任何索引(启用 HOT 更新)。...UPDATE
在多个事务中很容易分成多个补丁。...这些补丁中的行分布在表中(物理上,而不是逻辑上)。... 没有其他并发事务阻止死元组被重用。
Then you would not need to VACCUUM
in between multiple patches, because HOT updates can reuse dead tuples directly - only dead tuples from previoustransactions, not from the same or concurrent ones. You may want to schedule a VACUUM
at the end of the operation, or just let auto-vacuuming do its job.
那么你就不需要VACCUUM
在多个补丁之间,因为 HOT 更新可以直接重用死元组 - 只有来自先前事务的死元组,而不是来自相同或并发事务的死元组。您可能希望VACUUM
在操作结束时安排一个,或者只是让自动吸尘完成它的工作。
The same could be done with any other index that is not needed for the UPDATE
- and judging from your numbers the UPDATE
is not going to use an index anyway. If you update large parts of your table, building new indexes from scratch is much faster than incrementally updating indexes with every changed row.
可以对不需要的任何其他索引进行相同的操作UPDATE
- 从您的数字来看,UPDATE
无论如何都不会使用索引。如果更新表的大部分内容,从头开始构建新索引比对每个更改的行增量更新索引要快得多。
Also, your update is not likely to break any foreign key constraints. You could try to delete & recreate those, too. This does open a time slot where referential integrity would not be enforced. If the integrity is violated during the UPDATE
you get an error when trying to recreate the FK. If you do it all within onetransaction, concurrent transactions never get to see the dropped FK, but you take a write lock on the table - same as with dropping / recreating indexes or triggers)
此外,您的更新不太可能破坏任何外键约束。您也可以尝试删除并重新创建它们。这确实打开了一个不会强制执行参照完整性的时间段。如果在此期间违反了完整性,则在UPDATE
尝试重新创建 FK 时会出现错误。如果您在一个事务中完成所有操作,并发事务永远不会看到删除的 FK,但您会在表上设置写锁——与删除/重新创建索引或触发器相同)
Lastly, disable & enable triggersthat are not needed for the update.
Be sure to do all of this in one transaction. Maybe do it in a number of smaller patches, so it does not block concurrent operations for too long.
请务必在一次交易中完成所有这些操作。也许在一些较小的补丁中进行,因此它不会阻塞并发操作太长时间。
So:
所以:
BEGIN;
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- UPDATE ...
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
COMMIT;
You cannot run VACUUM
inside a transaction block. Per documentation:
您不能VACUUM
在事务块内运行。根据文档:
VACUUM
cannot be executed inside a transaction block.
VACUUM
不能在事务块内执行。
You could split your operation into a few big chunks and run in between:
您可以将您的操作分成几个大块并在它们之间运行:
VACUUM ANALYZE tbl;
If you don't have to deal with concurrent transactions you could (even more effectively):
如果您不必处理并发事务,您可以(甚至更有效):
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- Multiple UPDATEs with logical slices of the table
-- each slice in its own transaction.
-- VACUUM ANALYZE tbl; -- optionally in between, or autovacuum kicks in
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
回答by wildplasser
UPDATE Indexer.Pages
SET LastError=NULL
;
The where clause is not needed since the NULL fields are already NULL, so it won't harm to set them to NULL again (I don't think this would affect performance significantly).
不需要 where 子句,因为 NULL 字段已经是 NULL,所以再次将它们设置为 NULL 不会有什么坏处(我认为这不会显着影响性能)。
Given your number_of_rows = 500K and your table size=46G, I conclude that your averagerowsize is 90KB. That is huge. Maybe you could move {unused, sparse} columns of your table to other tables?
鉴于您的 number_of_rows = 500K 和您的表大小 = 46G,我得出的结论是您的平均行大小为 90KB。那是巨大的。也许您可以将表的 {unused, sparse} 列移动到其他表?
回答by Gordon Linoff
Your theory is probably correct. Reading the full table (and then doing anything) is probably causing the slow-down.
你的理论可能是正确的。阅读整个表格(然后做任何事情)可能会导致速度变慢。
Why don't you just create another table that has PageId and LastError? Initialize this with the data in the table you have now (which should take less than 93 minutes). Then, use the LastError from the new table.
为什么不创建另一个具有 PageId 和 LastError 的表?使用您现在拥有的表中的数据对其进行初始化(这应该不到 93 分钟)。然后,使用新表中的 LastError。
At your leisure, you can remove LastError from your existing table.
闲暇时,您可以从现有表中删除 LastError。
By the way, I don't normally recommend keeping two copies of a column in two separate tables. In this case, though, you sound like you are stuck and need a way to proceed.
顺便说一下,我通常不建议在两个单独的表中保留一列的两个副本。但是,在这种情况下,您听起来好像被卡住了,需要一种方法来继续。