SQL:如何根据两个字段查找重复项?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3504012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL: How to find duplicates based on two fields?
提问by James Adams
I have rows in an Oracle database table which should be unique for a combination of two fields but the unique constrain is not set up on the table so I need to find all rows which violate the constraint myself using SQL. Unfortunately my meager SQL skills aren't up to the task.
我在 Oracle 数据库表中有行,它对于两个字段的组合应该是唯一的,但是表上没有设置唯一约束,所以我需要自己使用 SQL 找到所有违反约束的行。不幸的是,我微薄的 SQL 技能无法胜任这项任务。
My table has three columns which are relevant: entity_id, station_id, and obs_year. For each row the combination of station_id and obs_year should be unique, and I want to find out if there are rows which violate this by flushing them out with an SQL query.
我的表有三列相关:entity_id、station_id 和 obs_year。对于每一行,station_id 和 obs_year 的组合应该是唯一的,我想通过使用 SQL 查询将它们刷新出来来找出是否有违反这一点的行。
I have tried the following SQL (suggested by this previous question) but it doesn't work for me (I get ORA-00918 column ambiguously defined):
我已经尝试了以下 SQL(由上一个问题建议),但它对我不起作用(我得到了含糊不清的 ORA-00918 列):
SELECT
entity_id, station_id, obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes
ON
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year
Can someone suggest what I'm doing wrong, and/or how to solve this?
有人可以建议我做错了什么,和/或如何解决这个问题?
回答by Quassnoi
SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
FROM mytable t
)
WHERE rn > 1
回答by Michael Pakhantsov
SELECT entity_id, station_id, obs_year
FROM mytable t1
WHERE EXISTS (SELECT 1 from mytable t2 Where
t1.station_id = t2.station_id
AND t1.obs_year = t2.obs_year
AND t1.RowId <> t2.RowId)
回答by FrustratedWithFormsDesigner
Re-write of your query
重写您的查询
SELECT
t1.entity_id, t1.station_id, t1.obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes
ON
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year
I think the ambiguous column error (ORA-00918) was because you were select
ing columns whose names appeared in both the table and the subquery, but you did not specifiy if you wanted it from dupes
or from mytable
(aliased as t1
).
我认为不明确的列错误 (ORA-00918) 是因为您正在使用select
名称同时出现在表和子查询中的列,但是您没有指定是从dupes
还是从mytable
(别名为t1
)。
回答by Basic
Change the 3 fields in the initial select to be
将初始选择中的 3 个字段更改为
SELECT
t1.entity_id, t1.station_id, t1.obs_year
回答by fredley
Could you not create a new table that includes the unique constraint, and then copy across the data row by row, ignoring failures?
你不能创建一个包含唯一约束的新表,然后逐行复制数据,忽略失败吗?
回答by fredley
You need to specify the table for the columns in the main select. Also, assuming entity_id is the unique key for mytable and is irrelevant to finding duplicates, you should not be grouping on it in the dupes subquery.
您需要为主选择中的列指定表。此外,假设 entity_id 是 mytable 的唯一键并且与查找重复项无关,则不应在 dupes 子查询中对其进行分组。
Try:
尝试:
SELECT t1.entity_id, t1.station_id, t1.obs_year
FROM mytable t1
INNER JOIN (
SELECT station_id, obs_year FROM mytable
GROUP BY station_id, obs_year HAVING COUNT(*) > 1) dupes
ON
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year
回答by Suresh Nambiar
SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
FROM mytable t
)
WHERE rn > 1
by Quassnoi is the most efficient for large tables. I had this analysis of cost :
由 Quassnoi 提供,对于大表来说是最有效的。我有这样的成本分析:
SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
WHERE EXISTS (SELECT 1 from trn_refil_book b Where
a.dist_code = b.dist_code and a.book_date = b.book_date and a.book_no = b.book_no
AND a.RowId <> b.RowId)
;
gave a cost of 1322341
给出了 1322341 的成本
SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
INNER JOIN (
SELECT b.dist_code, b.book_date, b.book_no FROM trn_refil_book b
GROUP BY b.dist_code, b.book_date, b.book_no HAVING COUNT(*) > 1) c
ON
a.dist_code = c.dist_code and a.book_date = c.book_date and a.book_no = c.book_no
;
gave a cost of 1271699
给出了 1271699 的成本
while
尽管
SELECT dist_code, book_date, book_no
FROM (
SELECT t.dist_code, t.book_date, t.book_no, ROW_NUMBER() OVER (PARTITION BY t.book_date, t.book_no
ORDER BY t.dist_code) AS rn
FROM trn_refil_book t
) p
WHERE p.rn > 1
;
gave a cost of 1021984
给出了1021984的成本
The table was not indexed....
该表未编入索引....
回答by grokster
SELECT entity_id, station_id, obs_year
FROM mytable
GROUP BY entity_id, station_id, obs_year
HAVING COUNT(*) > 1
Specify the fields to find duplicates on both the SELECT and the GROUP BY.
指定字段以在 SELECT 和 GROUP BY 上查找重复项。
It works by using GROUP BY
to find any rows that match any other rows based on the specified Columns.
The HAVING COUNT(*) > 1
says that we are only interested in seeing any rows that occur more than 1 time (and are therefore duplicates)
它的工作原理是GROUP BY
根据指定的列查找与任何其他行匹配的任何行。该HAVING COUNT(*) > 1
说,我们只看到发生的1倍以上任何行有兴趣的(因此重复)
回答by Ben Petersen
I thought a lot of the solutions here were cumbersome and tough to understand since I had a 3 column primary key constraint and needed to find the duplicates. So here's an option
我认为这里的很多解决方案都很麻烦且难以理解,因为我有一个 3 列主键约束并且需要找到重复项。所以这里有一个选项
SELECT id, name, value, COUNT(*) FROM db_name.table_name
GROUP BY id, name, value
HAVING COUNT(*) > 1