有效地更新非常大的 PostgreSQL 数据库表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/112249/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Update VERY LARGE PostgreSQL database table efficiently
提问by Christian
I have a very large database table in PostgresQL and a column like "copied". Every new row starts uncopied and will later be replicated to another thing by a background programm. There is an partial index on that table "btree(ID) WHERE replicated=0". The background programm does a select for at most 2000 entries (LIMIT 2000), works on them and then commits the changes in one transaction using 2000 prepared sql-commands.
我在 PostgresQL 中有一个非常大的数据库表和一个像“已复制”的列。每一个新行开始时都未被复制,稍后将被后台程序复制到另一件事。该表有一个部分索引“btree(ID) WHERE replicad=0”。后台程序选择最多 2000 个条目 (LIMIT 2000),处理它们,然后使用 2000 个准备好的 sql 命令在一个事务中提交更改。
Now the problem ist that I want to give the user an option to reset this replicated-value, make it all zero again.
现在的问题是我想给用户一个选项来重置这个复制值,让它再次全部为零。
An update table set replicated=0;
更新表集replicated=0;
is not possible:
不可能:
- It takes very much time
- It duplicates the size of the tabel because of MVCC
- It is done in one transaction: It either fails or goes through.
- 需要很多时间
- 由于 MVCC,它复制了表格的大小
- 它在一个事务中完成:它要么失败,要么通过。
I actually don't need transaction-features for this case: If the system goes down, it shall process only parts of it.
对于这种情况,我实际上不需要事务功能:如果系统出现故障,它将只处理其中的一部分。
Several other problems: Doing an
其他几个问题:做一个
update set replicated=0 where id >10000 and id<20000
is also bad: It does a sequential scan all over the whole table which is too slow. If it weren't doing that, it would still be slow because it would be too many seeks.
也很糟糕:它对整个表进行顺序扫描,这太慢了。如果不这样做,它仍然会很慢,因为它会搜索太多。
What I really need is a way of going through all rows, changing them and not being bound to a giant transaction.
我真正需要的是一种遍历所有行的方法,更改它们而不是绑定到一个巨大的事务。
Strangely, an
奇怪的是,一个
UPDATE table
SET replicated=0
WHERE ID in (SELECT id from table WHERE replicated= LIMIT 10000)
is also slow, although it should be a good thing: Go through the table in DISK-order...
也很慢,虽然它应该是一件好事:按磁盘顺序浏览表......
(Note that in that case there was also an index that covered this)
(请注意,在这种情况下,还有一个索引涵盖了这一点)
(An update LIMIT like Mysql is unavailable for PostgresQL)
(像 Mysql 这样的更新 LIMIT 对于 PostgresQL 是不可用的)
BTW: The real problem is more complicated and we're talking about an embedded system here that is already deployed, so remote schema changes are difficult, but possible It's PostgresQL 7.4 unfortunately.
顺便说一句:真正的问题更复杂,我们在这里谈论的是一个已经部署的嵌入式系统,因此远程模式更改很困难,但不幸的是它可能是 PostgresQL 7.4。
The amount of rows I'm talking about is e.g. 90000000. The size of the databse can be several dozend gigabytes.
我正在谈论的行数是例如 90000000。数据库的大小可以是几十 GB。
The database itself only contains 5 tables, one is a very large one. But that is not bad design, because these embedded boxes only operate with one kind of entity, it's not an ERP-system or something like that!
数据库本身只包含5张表,一张非常大。但这并不是一个糟糕的设计,因为这些嵌入式盒子只与一种实体一起运行,它不是一个 ERP 系统或类似的东西!
Any ideas?
有任何想法吗?
回答by Dan
How about adding a new table to store this replicated value (and a primary key to link each record to the main table). Then you simply add a record for every replicated item, and delete records to remove the replicated flag. (Or maybe the other way around - a record for every non-replicated record, depending on which is the common case).
添加一个新表来存储这个复制值如何(以及一个将每条记录链接到主表的主键)。然后您只需为每个复制项目添加一条记录,并删除记录以删除复制标志。(或者可能反过来 - 每个非复制记录的记录,取决于哪种情况是常见的)。
That would also simplify the case when you want to set them all back to 0, as you can just truncate the table (which zeroes the table size on disk, you don't even have to vacuum to free up the space)
当您想将它们全部设置回 0 时,这也会简化这种情况,因为您可以截断表(这将磁盘上的表大小归零,您甚至不必抽真空来释放空间)
回答by SquareCog
If you are trying to reset the whole table, not just a few rows, it is usually faster (on extremely large datasets -- not on regular tables) to simply CREATE TABLE bar AS SELECT everything, but, copied, 0 FROM foo
, and then swap the tables and drop the old one. Obviously you would need to ensure nothing gets inserted into the original table while you are doing that. You'll need to recreate that index, too.
如果您尝试重置整个表,而不仅仅是几行,通常会更快(在非常大的数据集上——而不是在常规表上)简单地CREATE TABLE bar AS SELECT everything, but, copied, 0 FROM foo
,然后交换表并删除旧表。显然,您需要确保在执行此操作时没有将任何内容插入到原始表中。您还需要重新创建该索引。
Edit: A simple improvement in order to avoid locking the table while you copy 14 Gigabytes:
编辑:一个简单的改进,以避免在复制 14 GB 时锁定表:
lock ;
create a new table, bar;
swap tables so that all writes go to bar;
unlock;
create table baz as select from foo;
drop foo;
create the index on baz;
lock;
insert into baz from bar;
swap tables;
unlock;
drop bar;
(let writes happen while you are doing the copy, and insert them post-factum).
(在您进行复制时让写入发生,并在事后插入它们)。
回答by Grant Johnson
While you cannot likely fix the problem of space usage (it is temporary, just until a vacuum) you can probably really speed up the process in terms of clock time. The fact that PostgreSQL uses MVCC means that you should be able to do this without any issues related to newly inserted rows. The create table as select will get around some of the performance issues, but will not allow for continued use of the table, and takes just as much space. Just ditch the index, and rebuild it, then do a vacuum.
虽然您不可能解决空间使用问题(它是暂时的,直到真空),但您可能真的可以在时钟时间方面加快进程。PostgreSQL 使用 MVCC 的事实意味着您应该能够做到这一点,而不会出现与新插入的行相关的任何问题。create table as select 将解决一些性能问题,但不允许继续使用该表,并且占用同样多的空间。只需放弃索引,重建它,然后做一个真空。
drop index replication_flag;
update big_table set replicated=0;
create index replication_flag on big_table btree(ID) WHERE replicated=0;
vacuum full analyze big_table;
回答by norlan V
I think it's better to change your postgres to version 8.X. probably the cause is the low version of Postgres. Also try this query below. I hope this can help.
我认为最好将您的 postgres 更改为 8.X 版。可能原因是 Postgres 的低版本。也试试下面的这个查询。我希望这会有所帮助。
UPDATE table1 SET name = table2.value
FROM table2
WHERE table1.id = table2.id;
回答by Tometzky
This is pseudocode. You'll need 400MB (for ints) or 800MB (for bigints) temporary file (you can compress it with zlib if it is a problem). It will need about 100 scans of a table for vacuums. But it will not bloat a table more than 1% (at most 1000000 dead rows at any time). You can also trade less scans for more table bloat.
这是伪代码。您将需要 400MB(对于整数)或 800MB(对于 bigint)临时文件(如果有问题,您可以使用 zlib 对其进行压缩)。它需要大约 100 次扫描真空吸尘器。但它不会使表膨胀超过 1%(任何时候最多 1000000 个死行)。您还可以用更少的扫描换取更多的表膨胀。
// write all ids to temporary file in disk order
// no where clause will ensure disk order
$file = tmpfile();
for $id, $replicated in query("select id, replicated from table") {
if ( $replicated<>0 ) {
write($file,&$id,sizeof($id));
}
}
// prepare an update query
query("prepare set_replicated_0(bigint) as
update table set replicated=0 where id=?");
// reread this file, launch prepared query and every 1000000 updates commit
// and vacuum a table
rewind($file);
$counter = 0;
query("start transaction");
while read($file,&$id,sizeof($id)) {
query("execute set_replicated_0($id)");
$counter++;
if ( $counter % 1000000 == 0 ) {
query("commit");
query("vacuum table");
query("start transaction");
}
}
query("commit");
query("vacuum table");
close($file);
回答by Kapil
I guess what you need to do is a. copy the 2000 records PK value into a temporary table with the same standard limit, etc. b. select the same 2000 records and perform the necessary operations in the cursor as it is. c. If successful, run a single update query against the records in the temp table. Clear the temp table and run step a again. d. If unsuccessful, clear the temp table without running the update query. Simple, efficient and reliable. Regards, KT
我想你需要做的是一个。将2000条记录PK值复制到一个标准限制相同的临时表中,等等。 b.选择相同的 2000 条记录并按原样在游标中执行必要的操作。C。如果成功,则针对临时表中的记录运行单个更新查询。清除临时表并再次运行步骤 a。d. 如果不成功,请清除临时表而不运行更新查询。简单、高效、可靠。问候, KT