SQL 在PostgreSQL中查找两个大表之间的差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15330403/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 14:11:28  来源:igfitidea点击:

Find difference between two big tables in PostgreSQL

sqlpostgresqlleft-joinexistsfull-outer-join

提问by odiszapc

I have two similar tables in Postgres with just one 32-byte latin field (simple md5 hash). Both tables have ~30,000,000 rows. Tables have little difference (10-1000 rows are different)

我在 Postgres 中有两个类似的表,只有一个 32 字节的拉丁字段(简单的 md5 哈希)。两个表都有大约 30,000,000 行。表格差别不大(10-1000行不一样)

Is it possible with Postgres to find a difference between these tables, the result should be 10-1000 rows I described above.

Postgres 是否有可能找到这些表之间的差异,结果应该是我上面描述的 10-1000 行。

This is not a real task, I just want to know about how PostgreSQL deals with JOIN-like logic.

这不是一个真正的任务,我只是想知道 PostgreSQL 如何处理类似 JOIN 的逻辑。

回答by Erwin Brandstetter

The best option is probably an EXISTSanti-semi-join:

最好的选择可能是EXISTS反半连接:

tbl1is the table with surplus rows in this example:

tbl1是本例中具有剩余行的表:

SELECT *
FROM   tbl1
WHERE  NOT EXISTS (SELECT 1 FROM tbl2 WHERE tbl2.col = tbl1.col);

If you don't know which table has surplus rows or both have, you can either repeat the above query after switching table names, or:

如果您不知道哪个表有多余的行或两者都有,您可以在切换表名后重复上述查询,或者:

SELECT *
FROM   tbl1
FULL   OUTER JOIN tbl2 USING (col)
WHERE  tbl2 col IS NULL OR
       tbl1.col IS NULL;

Overview over basic techniques in a later post:

在后面的文章中对基本技术的概述:

BTW, it would be muchmore efficient to use uuidcolumns for md5 hashes:

顺便说一句,这将是很多更有效地使用uuid列MD5哈希值:

回答by ThomasH

To augment existing answers I use the row()function for the join condition. This allows you to compare entire rows. E.g. my typical query to see the symmetric difference looks like this

为了增加现有的答案,我使用row()了连接条件的函数。这允许您比较整行。例如,我查看对称差异的典型查询如下所示

select *
from tbl1
full outer join tbl2 
    on row(tbl1) = row(tbl2)
where tbl1.col is null
or    tbl2.col is null

回答by 0xCAFEBABE

In my experience, NOT IN with a subquery takes a very long time. I'd do it with an inclusive join:

根据我的经验,NOT IN 使用子查询需要很长时间。我会通过包容性加入来做到这一点:

DELETE FROM table1 where ID IN (
SELECT id FROM table1
LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield
WHERE table2.hashfield IS NULL)

And then do the same the other way around for the other table.

然后对另一张桌子以相反的方式做同样的事情。