SQL 在PostgreSQL中查找两个大表之间的差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15330403/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find difference between two big tables in PostgreSQL
提问by odiszapc
I have two similar tables in Postgres with just one 32-byte latin field (simple md5 hash). Both tables have ~30,000,000 rows. Tables have little difference (10-1000 rows are different)
我在 Postgres 中有两个类似的表,只有一个 32 字节的拉丁字段(简单的 md5 哈希)。两个表都有大约 30,000,000 行。表格差别不大(10-1000行不一样)
Is it possible with Postgres to find a difference between these tables, the result should be 10-1000 rows I described above.
Postgres 是否有可能找到这些表之间的差异,结果应该是我上面描述的 10-1000 行。
This is not a real task, I just want to know about how PostgreSQL deals with JOIN-like logic.
这不是一个真正的任务,我只是想知道 PostgreSQL 如何处理类似 JOIN 的逻辑。
回答by Erwin Brandstetter
The best option is probably an EXISTS
anti-semi-join:
最好的选择可能是EXISTS
反半连接:
tbl1
is the table with surplus rows in this example:
tbl1
是本例中具有剩余行的表:
SELECT *
FROM tbl1
WHERE NOT EXISTS (SELECT 1 FROM tbl2 WHERE tbl2.col = tbl1.col);
If you don't know which table has surplus rows or both have, you can either repeat the above query after switching table names, or:
如果您不知道哪个表有多余的行或两者都有,您可以在切换表名后重复上述查询,或者:
SELECT *
FROM tbl1
FULL OUTER JOIN tbl2 USING (col)
WHERE tbl2 col IS NULL OR
tbl1.col IS NULL;
Overview over basic techniques in a later post:
在后面的文章中对基本技术的概述:
BTW, it would be muchmore efficient to use uuid
columns for md5 hashes:
顺便说一句,这将是很多更有效地使用uuid
列MD5哈希值:
回答by ThomasH
To augment existing answers I use the row()
function for the join condition. This allows you to compare entire rows. E.g. my typical query to see the symmetric difference looks like this
为了增加现有的答案,我使用row()
了连接条件的函数。这允许您比较整行。例如,我查看对称差异的典型查询如下所示
select *
from tbl1
full outer join tbl2
on row(tbl1) = row(tbl2)
where tbl1.col is null
or tbl2.col is null
回答by 0xCAFEBABE
In my experience, NOT IN with a subquery takes a very long time. I'd do it with an inclusive join:
根据我的经验,NOT IN 使用子查询需要很长时间。我会通过包容性加入来做到这一点:
DELETE FROM table1 where ID IN (
SELECT id FROM table1
LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield
WHERE table2.hashfield IS NULL)
And then do the same the other way around for the other table.
然后对另一张桌子以相反的方式做同样的事情。