最快的“获取重复项”SQL 脚本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/197111/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest "Get Duplicates" SQL script
提问by Johan Bresler
What is an example of a fast SQL to get duplicates in datasets with hundreds of thousands of records. I typically use something like:
什么是快速 SQL 在具有数十万条记录的数据集中获取重复项的示例。我通常使用类似的东西:
SELECT afield1, afield2 FROM afile a
WHERE 1 < (SELECT count(afield1) FROM afile b WHERE a.afield1 = b.afield1);
But this is quite slow.
但这很慢。
回答by Vinko Vrsalovic
This is the more direct way:
这是更直接的方法:
select afield1,count(afield1) from atable
group by afield1 having count(afield1) > 1
回答by Tony Andrews
You could try:
你可以试试:
select afield1, afield2 from afile a
where afield1 in
( select afield1
from afile
group by afield1
having count(*) > 1
);
回答by Walter Mitty
A similar question was asked last week. There are some good answers there.
上周有人问过类似的问题。那里有一些很好的答案。
SQL to find duplicate entries (within a group)
In that question, the OP was interested in all the columns (fields) in the table (file), but rows belonged in the same group if they had the same key value (afield1).
在那个问题中,OP 对表(文件)中的所有列(字段)感兴趣,但如果行具有相同的键值(afield1),则它们属于同一组。
There are three kinds of answers:
答案分为三种:
subqueries in the where clause, like some of the other answers in here.
where 子句中的子查询,就像这里的其他一些答案一样。
an inner join between the table and the groups viewed as a table (my answer)
表和被视为表的组之间的内部连接(我的回答)
and analytic queries (something that's new to me).
和分析查询(对我来说是新的东西)。
回答by Magnus Smith
By the way, if anyone wants to remove the duplicates, I have used this:
顺便说一句,如果有人想删除重复项,我已经使用了这个:
delete from MyTable where MyTableID in (
select max(MyTableID)
from MyTable
group by Thing1, Thing2, Thing3
having count(*) > 1
)
回答by Simon East
This should be reasonably fast (even faster if the dupeFields are indexed).
这应该相当快(如果 dupeFields 被索引甚至更快)。
SELECT DISTINCT a.id, a.dupeField1, a.dupeField2
FROM TableX a
JOIN TableX b
ON a.dupeField1 = b.dupeField2
AND a.dupeField2 = b.dupeField2
AND a.id != b.id
I guess the only downside to this query is that because you're not doing a COUNT(*)
you can't check for the number of timesit is duplicated, only that it appears more than once.
我想这个查询的唯一缺点是因为你没有做 aCOUNT(*)
你不能检查它被复制的次数,只是它出现了不止一次。