最快的“获取重复项”SQL 脚本

Question

提问by Johan Bresler

What is an example of a fast SQL to get duplicates in datasets with hundreds of thousands of records. I typically use something like:

什么是快速 SQL 在具有数十万条记录的数据集中获取重复项的示例。我通常使用类似的东西：

SELECT afield1, afield2 FROM afile a 
WHERE 1 < (SELECT count(afield1) FROM afile b WHERE a.afield1 = b.afield1);

But this is quite slow.

但这很慢。

Answer 1

回答by Vinko Vrsalovic

This is the more direct way:

这是更直接的方法：

select afield1,count(afield1) from atable 
group by afield1 having count(afield1) > 1

Answer 2

回答by Tony Andrews

You could try:

你可以试试：

select afield1, afield2 from afile a
where afield1 in
( select afield1
  from afile
  group by afield1
  having count(*) > 1
);

Answer 3

回答by Walter Mitty

A similar question was asked last week. There are some good answers there.

上周有人问过类似的问题。那里有一些很好的答案。

SQL to find duplicate entries (within a group)

SQL 查找重复条目（组内）

In that question, the OP was interested in all the columns (fields) in the table (file), but rows belonged in the same group if they had the same key value (afield1).

在那个问题中，OP 对表（文件）中的所有列（字段）感兴趣，但如果行具有相同的键值（afield1），则它们属于同一组。

There are three kinds of answers:

答案分为三种：

subqueries in the where clause, like some of the other answers in here.

where 子句中的子查询，就像这里的其他一些答案一样。

an inner join between the table and the groups viewed as a table (my answer)

表和被视为表的组之间的内部连接（我的回答）

and analytic queries (something that's new to me).

和分析查询（对我来说是新的东西）。

Answer 4

回答by Magnus Smith

By the way, if anyone wants to remove the duplicates, I have used this:

顺便说一句，如果有人想删除重复项，我已经使用了这个：

delete from MyTable where MyTableID in (
  select max(MyTableID)
  from MyTable
  group by Thing1, Thing2, Thing3
  having count(*) > 1
)

Answer 5

回答by Simon East

This should be reasonably fast (even faster if the dupeFields are indexed).

这应该相当快（如果 dupeFields 被索引甚至更快）。

SELECT DISTINCT a.id, a.dupeField1, a.dupeField2
FROM TableX a
JOIN TableX b
ON a.dupeField1 = b.dupeField2
AND a.dupeField2 = b.dupeField2
AND a.id != b.id

I guess the only downside to this query is that because you're not doing a COUNT(*)you can't check for the number of timesit is duplicated, only that it appears more than once.

我想这个查询的唯一缺点是因为你没有做 aCOUNT(*)你不能检查它被复制的次数，只是它出现了不止一次。

最快的“获取重复项”SQL 脚本

提问by Johan Bresler

回答by Vinko Vrsalovic

回答by Tony Andrews

回答by Walter Mitty

回答by Magnus Smith

回答by Simon East

相关推荐

最近更新

标签

最快的“获取重复项”SQL 脚本

提问by Johan Bresler

回答by Vinko Vrsalovic

回答by Tony Andrews

回答by Walter Mitty

回答by Magnus Smith

回答by Simon East

相关推荐

SQL 为每个类别选择前 10 条记录

SQL 如何在 Oracle 10+ 中对包含 NULL 的列使用基于函数的索引？

如何使用 SQL Server 读取最后一行

T-SQL 中的布尔值“NOT”不适用于“位”数据类型？

相关推荐

最近更新

标签