SQL 删除在另一个表中没有匹配项的记录

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15959061/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 14:47:42  来源:igfitidea点击:

DELETE records which do not have a match in another table

sqlpostgresqlexistsbigdatasql-delete

提问by miloxe

There are two tables linked by an id:

有两个由 id 链接的表:

item_tbl (id)
link_tbl (item_id)

There are some records in item_tblthat don't have matching rows in link_tbl. A select which would count their amount would be:

有一些记录item_tbl没有匹配的行link_tbl。一个可以计算它们数量的选择是:

SELECT COUNT(*)
FROM link_tbl lnk LEFT JOIN item_tbl itm ON lnk.item_id=itm.id
WHERE itm.id IS NULL

I would like to delete those orphan records (those which don't have match in the other table) from link_tblbut the only way I could think of was:

我想从中删除那些孤立记录(那些在另一个表中没有匹配的记录),link_tbl但我能想到的唯一方法是:

DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)

There are
262,086,253records in link_tbl
3,033,811in item_tbl
16,844,347orphan records in link_tbl.
The server has 4GB RAM and 8 core CPU.


262086253条记录link_tbl
3033811item_tbl
16844347个孤儿记录link_tbl
服务器有 4GB RAM 和 8 核 CPU。

EXPLAIN DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)

Returns:

返回:

Delete on link lnk  (cost=0.00..11395249378057.98 rows=131045918 width=6)
->  Seq Scan on link lnk  (cost=0.00..11395249378057.98 rows=131045918 width=6)
     Filter: (NOT (SubPlan 1))
     SubPlan 1
       ->  Materialize  (cost=0.00..79298.10 rows=3063207 width=4)
             ->  Seq Scan on item itm  (cost=0.00..52016.07 rows=3063207 width=4)

The questions are:

问题是:

  1. Is there any better way how to delete orphan records from link_tbl?
  2. How accurate is the explain above, or how long it could take to delete those records?

    • Edit: fixed according to Erwin Brandstetter comment.
    • Edit: PostgreSql version is 9.1
    • Edit: some parts of postgresql.config
      1. shared_buffers = 368MB
      2. temp_buffers = 32MB
      3. work_mem = 32MB
      4. maintenance_work_mem = 64MB
      5. max_stack_depth = 6MB
      6. fsync = off
      7. synchronous_commit = off
      8. full_page_writes = off
      9. wal_buffers = 16MB
      10. wal_writer_delay = 5000ms
      11. commit_delay = 10
      12. commit_siblings = 10
      13. effective_cache_size = 1600MB
  1. 有没有更好的方法来删除孤立记录link_tbl
  2. 上面的解释有多准确,或者删除这些记录需要多长时间?

    • 编辑:根据 Erwin Brandstetter 评论修复。
    • 编辑:PostgreSql 版本是 9.1
    • 编辑:postgresql.config 的某些部分
      1. 共享缓冲区 = 368MB
      2. 临时缓冲区 = 32MB
      3. 工作内存 = 32MB
      4. 维护工作内存 = 64MB
      5. max_stack_depth = 6MB
      6. fsync = 关闭
      7. 同步提交 = 关闭
      8. full_page_writes = 关闭
      9. wal_buffers = 16MB
      10. wal_writer_delay = 5000 毫秒
      11. 提交延迟 = 10
      12. commit_siblings = 10
      13. 有效缓存大小 = 1600MB

Resolution:

解析度:

Thank you all for your advices, it was very helpful. I finally used the delete advised by Erwin Brandstetter https://stackoverflow.com/a/15959896/1331340but I tweaked it a little:

谢谢大家的建议,很有帮助。我终于使用了 Erwin Brandstetter https://stackoverflow.com/a/15959896/1331340建议的删除,但我稍微调整了一下:

DELETE FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 10000
  AND lnk.item_id NOT IN (SELECT itm.id FROM item itm
                          WHERE itm.id BETWEEN 0 AND 10000)

I compared results for NOT IN and NOT EXISTS and the output is below, although I used COUNT instead of DELETE which I think should be the same (I mean in sake of relative comparison):

我比较了 NOT IN 和 NOT EXISTS 的结果,输出如下,虽然我使用了 COUNT 而不是 DELETE,我认为这应该是相同的(我的意思是为了相对比较):

EXPLAIN ANALYZE SELECT COUNT(*) 
FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 20000
  AND lnk.item_id NOT IN (SELECT itm.id
                          FROM item_tbl itm
                          WHERE itm.id BETWEEN 0 AND 20000);

QUERY PLAN
Aggregate  (cost=6002667.56..6002667.57 rows=1 width=0) (actual time=226817.086..226817.088 rows=1 loops=1)
->  Seq Scan on link_tbl lnk  (cost=1592.50..5747898.65 rows=101907564 width=0) (actual time=206.029..225289.570 rows=566625 loops=1)
     Filter: ((item_id >= 0) AND (item_id <= 20000) AND (NOT (hashed SubPlan 1)))
     SubPlan 1
       ->  Index Scan using item_tbl_pkey on item_tbl itm  (cost=0.00..1501.95 rows=36221 width=4) (actual time=0.056..99.266 rows=17560 loops=1)
             Index Cond: ((id >= 0) AND (id <= 20000))
Total runtime: 226817.211 ms


EXPLAIN ANALYZE SELECT COUNT(*)
FROM link_tbl lnk WHERE lnk.item_id>0 AND lnk.item_id<20000
  AND NOT EXISTS (SELECT 1 FROM item_tbl itm WHERE itm.id=lnk.item_id);

QUERY PLAN
Aggregate  (cost=8835772.00..8835772.01 rows=1 width=0)
   (actual time=1209235.133..1209235.135 rows=1 loops=1)
->  Hash Anti Join  (cost=102272.16..8835771.99 rows=1 width=0)
   (actual time=19315.170..1207900.612 rows=566534 loops=1)
     Hash Cond: (lnk.item_id = itm.id)
     ->  Seq Scan on link_tbl lnk  (cost=0.00..5091076.55 rows=203815128 width=4) (actual time=0.016..599147.604 rows=200301872 loops=1)
           Filter: ((item_id > 0) AND (item_id < 20000))
     ->  Hash  (cost=52016.07..52016.07 rows=3063207 width=4) (actual time=19313.976..19313.976 rows=3033811 loops=1)
           Buckets: 131072  Batches: 4  Memory Usage: 26672kB
           ->  Seq Scan on item_tbl itm  (cost=0.00..52016.07 rows=3063207 width=4) (actual time=0.013..9274.158 rows=3033811 loops=1)
Total runtime: 1209260.228 ms

NOT EXISTS was 5 times slower.

NOT EXISTS 慢了 5 倍。

The actual delete of the data didn't take so long as I was worried, I was able to delete it in 5 batches (10000-20000,20000-100000,100000-200000,200000-1000000 and 1000000-1755441). At first I found out max item_id and I only had to went through half of the table.

数据的实际删除并没有像我担心的那么久,我可以分5批(10000-20000、20000-100000、100000-200000、200000-1000000和1000000-175544)删除它。起初我发现了 max item_id 并且我只需要经过一半的桌子。

When I tried NOT IN or EXISTS without the range (with select count) it didn't even finish, I let it run during the night and it was still running in the morning.

当我在没有范围(带有选择计数)的情况下尝试 NOT IN 或 EXISTS 时,它甚至没有完成,我让它在晚上运行,但它在早上仍在运行。

I think I was looking for DELETE with USING from wildplasser's answer https://stackoverflow.com/a/15988033/1331340but it came too late.

我想我正在从 Wildplasser 的回答https://stackoverflow.com/a/15988033/1331340 中使用 USING 寻找 DELETE,但为时已晚。

DELETE FROM one o
USING (
    SELECT o2.id
    FROM one o2
    LEFT JOIN two t ON t.one_id = o2.id
    WHERE t.one_id IS NULL
    ) sq
WHERE sq.id = o.id
    ;

回答by wildplasser

I benchmarked four typical queries, with different settings for {work_mem, effective_cache_size, random_page_cost}, these settings have the largest influence on the selected plan. I first did a "run in" with my default settings to warm the cache. Note: the test-set is small enough to allow all needed pages to be present in cache.

我对四个典型查询进行了基准测试,{work_mem, Effective_cache_size, random_page_cost} 的设置不同,这些设置对所选计划的影响最大。我首先用我的默认设置“运行”来预热缓存。注意:测试集足够小以允许所有需要的页面存在于缓存中。

The test-set

测试集

SET search_path=tmp;

/************************/
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

CREATE TABLE one
        ( id SERIAL NOT NULL PRIMARY KEY
        , payload varchar
        );

CREATE TABLE two
        ( id SERIAL NOT NULL PRIMARY KEY
        , one_id INTEGER REFERENCES one
        , payload varchar
        );

INSERT INTO one (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
INSERT INTO two (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;


UPDATE two t
SET one_id = o.id
FROM one o
WHERE o.id = t.id
AND random() < 0.1;

INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;

VACUUM ANALYZE one;
VACUUM ANALYZE two;
/***************/

The queries:

查询:

\echo NOT EXISTS()
EXPLAIN ANALYZE
DELETE FROM one o
WHERE NOT EXISTS ( SELECT * FROM two t
        WHERE t.one_id = o.id
        );

\echo NOT IN()
EXPLAIN ANALYZE 
DELETE FROM one o
WHERE o.id NOT IN ( SELECT one_id FROM two t)
        ;

\echo USING (subquery self LEFT JOIN two where NULL)
EXPLAIN ANALYZE
DELETE FROM one o
USING (
        SELECT o2.id
        FROM one o2
        LEFT JOIN two t ON t.one_id = o2.id
        WHERE t.one_id IS NULL
        ) sq
WHERE sq.id = o.id
        ;

\echo USING (subquery self WHERE NOT EXISTS(two)))
EXPLAIN ANALYZE
DELETE FROM one o
USING (
        SELECT o2.id
        FROM one o2
        WHERE NOT EXISTS ( SELECT *
                FROM two t WHERE t.one_id = o2.id
                )
        ) sq
WHERE sq.id = o.id
        ;

The result (summarised)

结果(总结)

                        NOT EXISTS()    NOT IN()        USING(LEFT JOIN NULL)   USING(NOT EXISTS)
1) rpc=4.0.csz=1M wmm=64        80.358  14389.026       77.620                  72.917
2) rpc=4.0.csz=1M wmm=64000     60.527  69.104          51.851                  51.004
3) rpc=1.5.csz=1M wmm=64        69.804  10758.480       80.402                  77.356
4) rpc=1.5.csz=1M wmm=64000     50.872  69.366          50.763                  53.339
5) rpc=4.0.csz=1G wmm=64        84.117  7625.792        69.790                  69.627
6) rpc=4.0.csz=1G wmm=64000     49.964  67.018          49.968                  49.380
7) rpc=1.5.csz=1G wmm=64        68.567  3650.008        70.283                  69.933
8) rpc=1.5.csz=1G wmm=64000     49.800  67.298          50.116                  50.345

legend: 
rpc := "random_page_cost"
csz := "effective_cache_size"
wmm := "work_mem"

As you can see, the NOT IN()variant is very sensitive to shortage of work_mem. Agreed, the setting 64(KB) is very low, but this `more or less* corresponds to large data sets, which won't fit in hashtables, either.

如您所见,该NOT IN()变体对work_mem. 同意,设置 64(KB) 非常低,但是这个“或多或少*”对应于大数据集,它也不适合哈希表。

EXTRA: during the warm-in phase, the NOT EXISTS()query suffered from extreme FK-trigger contention. This apears to be a result of a conflict with the vacuum deamon, which is still active after the table set-up.:

额外:在预热阶段,NOT EXISTS()查询遭受极端 FK 触发器争用。这似乎是与真空守护进程冲突的结果,该守护进程在表设置后仍处于活动状态。:

PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit
NOT EXISTS()
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Delete on one o  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=80.596..80.596 rows=0 loops=1)
   ->  Hash Anti Join  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=49.174..61.327 rows=27050 loops=1)
         Hash Cond: (o.id = t.one_id)
         ->  Seq Scan on one o  (cost=0.00..463.00 rows=30000 width=10) (actual time=0.003..5.156 rows=30000 loops=1)
         ->  Hash  (cost=3736.00..3736.00 rows=240000 width=10) (actual time=49.121..49.121 rows=23600 loops=1)
               Buckets: 32768  Batches: 1  Memory Usage: 1015kB
               ->  Seq Scan on two t  (cost=0.00..3736.00 rows=240000 width=10) (actual time=0.006..33.790 rows=240000 loops=1)
 Trigger for constraint two_one_id_fkey: time=467720.117 calls=27050
 Total runtime: 467824.652 ms
(9 rows)

回答by Erwin Brandstetter

First off: your text says:

首先:你的文字说:

I would like to delete those orphan records from item_tbl.

我想从item_tbl.

But your code says:

但是你的代码说:

DELETE FROM link_tbl lnk ...

Update:On rereading the Q I find it more likely you want to delete orphaned rows in link_tbl. The row-counts point in that direction. @Lucas) query would be correct in this case. But I am afraid, NOT EXISTSis actually slowerthan NOT INin this case.

更新:在重新阅读 QI 时,您更有可能想要删除link_tbl. 行数指向那个方向。@Lucas) 查询在这种情况下是正确的。但是我很害怕,NOT EXISTS实际速度较慢NOT IN在这种情况下。

To verify I ran a test case, that's remotely like your setup. Couldn't make it much bigger, or SQLfiddle would run into a timeout.

为了验证我运行了一个测试用例,这与您的设置很相似。不能让它变得更大,否则 SQLfiddle 会遇到超时。

-> SQLfiddle.

-> SQLfiddle

NOT EXISTSwould be faster for the reversed case. (I tested that, too.) EXISTSis better suited for testing the "many"-side. And generally, there is more to gain with EXISTSthan with NOT EXISTS- that form has to check the whole table anyway. It's much harder to prove something does not existthan to prove that something exists. This universal truth also applies to databases.

NOT EXISTS对于相反的情况会更快。(我也测试过。)EXISTS更适合测试“多”方面。一般来说,使用EXISTS比使用更多NOT EXISTS- 无论如何,该表格必须检查整个表格。证明某事不存在比证明某事存在要困难得多。这个普遍真理也适用于数据库。

Divide and conquer

分而治之

This operation is suited to be split up. Especially if you have concurrent transactions (but even without) I would consider splitting the DELETEinto several slices, so that the transaction can COMMITafter a decent amount of time.

此操作适合拆分。特别是如果您有并发事务(但即使没有),我会考虑将其拆分DELETE为几个切片,以便事务可以COMMIT在相当长的时间内完成。

Something like:

就像是:

DELETE FROM link_tbl l
WHERE  l.item_id < 1000000
AND    l.item_id NOT IN (SELECT i.id FROM item_tbl i)

Then l.item_id BETWEEN 100001 AND 200000, etc.

然后l.item_id BETWEEN 100001 AND 200000,等等。

You cannot automate this with a function. That would wrap everything into a transaction and defy the purpose. So you'd have to script it from any client.
Or you could use ..

您无法使用函数自动执行此操作。这会将所有东西都包装成一个交易并违背目的。因此,您必须从任何客户端编写脚本。
或者你可以使用..

dblink

数据库链接

This additional module lets you run separate transactions in any database including the one it's running in. And that can be done via persistent connection, which should remove most of the connection overhead. For instructions how to install it:
How to use (install) dblink in PostgreSQL?

这个附加模块允许您在任何数据库中运行单独的事务,包括它正在运行的数据库。这可以通过持久连接来完成,这应该消除大部分连接开销。有关如何安装它的说明:
如何在 PostgreSQL 中使用(安装)dblink?

DOwould do the job (PostgreSQL 9.0 or later). Running 100 DELETEcommands for 50000 item_idat a time:

DO将完成这项工作(PostgreSQL 9.0 或更高版本)。一次DELETE为 50000 个运行 100 个命令item_id

DO
$$
DECLARE
   _sql text;
BEGIN

PERFORM dblink_connect('port=5432 dbname=mydb');  -- your connection parameters

FOR i IN 0 .. 100
LOOP
   _sql := format('
   DELETE FROM link_tbl l
   WHERE  l.item_id BETWEEN %s AND %s
   AND    l.item_id NOT IN (SELECT i.id FROM item_tbl i)'
   , (50000 * i)::text
   , (50000 * (i+1))::text);

   PERFORM  dblink_exec(_sql);
END LOOP;

PERFORM dblink_disconnect();

END
$$

If the script should get interrupted: dblink_connectwrites to the DB log what it executed, so you see what's done already.

如果脚本被中断:dblink_connect将它执行的内容写入数据库日志,以便您查看已经完成的操作。

回答by Lucas

Perhaps this:

也许这个:

DELETE FROM link_tbl lnk
WHERE NOT EXISTS
  ( SELECT 1 FROM item_tbl item WHERE item.id = lnk.item_id );

When dealing with large numbers of records, it can be much more efficient to create a temp table, perform INSERT INTO SELECT * FROM ...then drop the original table, rename the temp table, then add your indexes back...

在处理大量记录时,创建临时表,执行INSERT INTO SELECT * FROM ...然后删除原始表,重命名临时表,然后重新添加索引会更有效率...