SQL 按 ID 删除数百万行的最佳方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8290900/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 13:25:20  来源:igfitidea点击:

Best way to delete millions of rows by ID

sqlpostgresqlbigdatasql-deletepostgresql-performance

提问by Anthony Greco

I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.

我需要从我的 PG 数据库中删除大约 200 万行。我有一个需要删除的 ID 列表。但是,我尝试这样做的任何方式都需要几天时间。

I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id's from an ID table, delete where IN that list, delete from ids table the 100 I selected).

我尝试将它们放在一个表中,并以 100 为一组进行。4 天后,这仍然在运行,仅删除了 2972​​68 行。(我必须从 ID 表中选择 100 个 ID,删除该列表中的 where,从 ids 表中删除我选择的 100 个)。

I tried:

我试过:

DELETE FROM tbl WHERE id IN (select * from ids)

That's taking forever, too. Hard to gauge how long, since I can't see it's progress till done, but the query was still running after 2 days.

这也需要永远。很难估计多长时间,因为我看不到它的进度,直到完成,但查询在 2 天后仍在运行。

Just kind of looking for the most effective way to delete from a table when I know the specific ID's to delete, and there are millions of IDs.

当我知道要删除的特定 ID 并且有数百万个 ID 时,只是寻找从表中删除的最有效方法。

回答by Erwin Brandstetter

It all depends ...

这一切都取决于...

  • Delete all indexes (except the one on the ID which you need for the delete)
    Recreate them afterwards (= much faster than incremental updates to indexes)

  • Check if you have triggers that can safely be deleted / disabled temporarily

  • Do foreign keys reference your table? Can they be deleted? Temporarily deleted?

  • Depending on your autovacuum settings it mayhelp to run VACUUM ANALYZEbefore the operation.

  • Assuming no concurrent write accessto involved tables or you may have to lock tables exclusively or this route may not be for you at all.

  • Some of the points listed in the related chapter of the manual Populating a Databasemay also be of use, depending on your setup.

  • If you delete large portions of the table and the rest fits into RAM, the fastest and easiest way would be this:

  • 删除所有索引(除了 ID 上需要删除的
    索引)之后重新创建它们(= 比对索引的增量更新快得多)

  • 检查您是否有可以安全删除/暂时禁用的触发器

  • 外键是否引用您的表?它们可以被删除吗?暂时删除?

  • 根据您的 autovacuum 设置,在操作之前运行可能会有所帮助VACUUM ANALYZE

  • 假设没有对相关表的并发写访问,或者您可能必须以独占方式锁定表,否则这条路线可能根本不适合您。

  • 根据您的设置,手册填充数据库的相关章节中列出的一些要点也可能有用。

  • 如果您删除表的大部分内容而其余部分适合 RAM,则最快和最简单的方法是:

SET temp_buffers = '1000MB'; -- or whatever you can spare temporarily

CREATE TEMP TABLE tmp AS
SELECT t.*
FROM   tbl t
LEFT   JOIN del_list d USING (id)
WHERE  d.id IS NULL;      -- copy surviving rows into temporary table

TRUNCATE tbl;             -- empty table - truncate is very fast for big tables

INSERT INTO tbl
SELECT * FROM tmp;        -- insert back surviving rows.

This way you don't have to recreate views, foreign keys or other depending objects. Read about the temp_bufferssetting in the manual. This method is fast as long as the table fits into memory, or at least most of it. Be aware that you can lose data if your server crashes in the middle of this operation. You can wrap all of it into a transaction to make it safer.

这样您就不必重新创建视图、外键或其他依赖对象。阅读temp_buffers手册中设置。只要表适合内存,或者至少适合大部分内存,这种方法就会很快。请注意,如果您的服务器在此操作过程中崩溃,您可能会丢失数据。您可以将所有这些都包装到一个事务中以使其更安全。

Run ANALYZEafterwards. Or VACUUM ANALYZEif you did not go the truncate route, or VACUUM FULL ANALYZEif you want to bring it to minimum size. For big tables consider the alternatives CLUSTER/ pg_repack:

ANALYZE之后跑。或者,VACUUM ANALYZE如果您没有走截断路线,或者VACUUM FULL ANALYZE您想将其缩小到最小尺寸。对于大表,请考虑替代方案CLUSTER/ pg_repack

For small tables, a simple DELETEinstead of TRUNCATEis often faster:

对于小表,简单的DELETE而不是TRUNCATE通常更快:

DELETE FROM tbl t
USING  del_list d
WHERE  t.id = d.id;

Readthe Notessection for TRUNCATEin the manual. In particular (as Pedro also pointed out in his comment):

阅读手册中注释部分TRUNCATE。特别是(正如佩德罗在他的评论中也指出的):

TRUNCATEcannot be used on a table that has foreign-key references from other tables, unless all such tables are also truncated in the same command. [...]

TRUNCATE不能用于具有来自其他表的外键引用的表,除非所有这些表也在同一命令中被截断。[...]

And:

和:

TRUNCATEwill not fire any ON DELETEtriggers that might exist for the tables.

TRUNCATE不会ON DELETE触发表中可能存在的任何触发器。

回答by francs

We know the update/delete performance of PostgreSQL is not as powerful as Oracle. When we need to delete millions or 10's of millions of rows, it's really difficult and takes a long time.

我们知道 PostgreSQL 的更新/删除性能不如 Oracle。当我们需要删除数百万或数百万行时,这真的很困难并且需要很长时间。

However, we can still do this in production dbs. The following is my idea:

但是,我们仍然可以在生产数据库中执行此操作。以下是我的想法:

First, we should create a log table with 2 columns - id& flag(idrefers to the id you want to delete; flagcan be Yor null, with Ysignifying the record is successfully deleted).

首先,我们应该创建一个包含 2 列的日志表 - id& flagid指的是您要删除的 id;flag可以是YnullY表示记录已成功删除)。

Later, we create a function. We do the delete task every 10,000 rows. You can see more details on my blog. Though it's in Chinese, you can still can get the info you want from the SQL code there.

稍后,我们创建一个函数。我们每 10,000 行执行一次删除任务。你可以在我的博客上看到更多细节。虽然它是中文的,但你仍然可以从那里的 SQL 代码中获取你想要的信息。

Make sure the idcolumn of both tables are indexes, as it will run faster.

确保id两个表的列都是索引,因为它会运行得更快。

回答by Saulius ?emaitaitis

You may try copying all the data in the table exceptthe IDs you want to delete onto a new table, then renaming then swapping the tables (provided you have enough resources to do it).

您可以尝试将表中要删除的 ID之外的所有数据复制到新表中,然后重命名然后交换表(前提是您有足够的资源来执行此操作)。

This is not an expert advice.

这不是专家建议。

回答by Zaldy Baguinon

Two possible answers:

两种可能的答案:

  1. Your table may have lots of constraint or triggers attached to it when you try to delete a record. It will incur much processor cycles and checking from other tables.

  2. You may need to put this statement inside a transaction.

  1. 当您尝试删除记录时,您的表可能附加了许多约束或触发器。它会产生很多处理器周期并从其他表中进行检查。

  2. 您可能需要将此语句放入事务中。

回答by Mark Ransom

First make sure you have an index on the ID fields, both in the table you want to delete from and the table you are using for deletion IDs.

首先确保在要删除的表和用于删除 ID 的表中的 ID 字段上都有索引。

100 at a time seems too small. Try 1000 or 10000.

一次 100 个似乎太小了。尝试 1000 或 10000。

There's no need to delete anything from the deletion ID table. Add a new column for a Batch number and fill it with 1000 for batch 1, 1000 for batch 2, etc. and make sure the deletion query includes the batch number.

无需从删除 ID 表中删除任何内容。为批次号添加一个新列,并为批次 1 填充 1000,为批次 2 填充 1000,等等,并确保删除查询包含批次号。

回答by Vincent Agnello

The easiest way to do this would be to drop all your constraints and then do the delete.

最简单的方法是删除所有约束,然后进行删除。

回答by FunctorSalad

If the table you're deleting from is referenced by some_other_table(and you don't want to drop the foreign keys even temporarily), make sure you have an index on the referencingcolumn in some_other_table!

如果表中您要删除的从被引用some_other_table(你不想甚至暂时下降外键),请确保您对索引引用some_other_table

I had a similar problem and used auto_explainwith auto_explain.log_nested_statements = true, which revealed that the deletewas actually doing seq_scans on some_other_table:

我有一个类似的问题并使用auto_explainwith auto_explain.log_nested_statements = true,这表明delete实际上正在对 seq_scans 进行some_other_table

    Query Text: SELECT 1 FROM ONLY "public"."some_other_table" x WHERE  OPERATOR(pg_catalog.=) "id" FOR KEY SHARE OF x    
    LockRows  (cost=[...])  
      ->  Seq Scan on some_other_table x  (cost=[...])  
            Filter: ( = id)

Apparently it's trying to lock the referencing rows in the other table (which shouldn't exist, or the delete will fail). After I created indexes on the referencing tables, the delete was orders of magnitude faster.

显然它试图锁定另一个表中的引用行(它不应该存在,否则删除将失败)。在引用表上创建索引后,删除速度快了几个数量级。