SQL:如何根据两个字段查找重复项?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3504012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 07:14:33  来源:igfitidea点击:

SQL: How to find duplicates based on two fields?

sqloracleuniqueunique-constraintora-00918

提问by James Adams

I have rows in an Oracle database table which should be unique for a combination of two fields but the unique constrain is not set up on the table so I need to find all rows which violate the constraint myself using SQL. Unfortunately my meager SQL skills aren't up to the task.

我在 Oracle 数据库表中有行,它对于两个字段的组合应该是唯一的,但是表上没有设置唯一约束,所以我需要自己使用 SQL 找到所有违反约束的行。不幸的是,我微薄的 SQL 技能无法胜任这项任务。

My table has three columns which are relevant: entity_id, station_id, and obs_year. For each row the combination of station_id and obs_year should be unique, and I want to find out if there are rows which violate this by flushing them out with an SQL query.

我的表有三列相关:entity_id、station_id 和 obs_year。对于每一行,station_id 和 obs_year 的组合应该是唯一的,我想通过使用 SQL 查询将它们刷新出来来找出是否有违反这一点的行。

I have tried the following SQL (suggested by this previous question) but it doesn't work for me (I get ORA-00918 column ambiguously defined):

我已经尝试了以下 SQL(由上一个问题建议),但它对我不起作用(我得到了含糊不清的 ORA-00918 列):

SELECT
entity_id, station_id, obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable 
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

Can someone suggest what I'm doing wrong, and/or how to solve this?

有人可以建议我做错了什么,和/或如何解决这个问题?

回答by Quassnoi

SELECT  *
FROM    (
        SELECT  t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
        FROM    mytable t
        )
WHERE   rn > 1

回答by Michael Pakhantsov

SELECT entity_id, station_id, obs_year
FROM mytable t1
WHERE EXISTS (SELECT 1 from mytable t2 Where
       t1.station_id = t2.station_id
       AND t1.obs_year = t2.obs_year
       AND t1.RowId <> t2.RowId)

回答by FrustratedWithFormsDesigner

Re-write of your query

重写您的查询

SELECT
t1.entity_id, t1.station_id, t1.obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable 
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

I think the ambiguous column error (ORA-00918) was because you were selecting columns whose names appeared in both the table and the subquery, but you did not specifiy if you wanted it from dupesor from mytable(aliased as t1).

我认为不明确的列错误 (ORA-00918) 是因为您正在使用select名称同时出现在表和子查询中的列,但是您没有指定是从dupes还是从mytable(别名为t1)。

回答by Basic

Change the 3 fields in the initial select to be

将初始选择中的 3 个字段更改为

SELECT
t1.entity_id, t1.station_id, t1.obs_year

回答by fredley

Could you not create a new table that includes the unique constraint, and then copy across the data row by row, ignoring failures?

你不能创建一个包含唯一约束的新表,然后逐行复制数据,忽略失败吗?

回答by fredley

You need to specify the table for the columns in the main select. Also, assuming entity_id is the unique key for mytable and is irrelevant to finding duplicates, you should not be grouping on it in the dupes subquery.

您需要为主选择中的列指定表。此外,假设 entity_id 是 mytable 的唯一键并且与查找重复项无关,则不应在 dupes 子查询中对其进行分组。

Try:

尝试:

SELECT t1.entity_id, t1.station_id, t1.obs_year
FROM mytable t1
INNER JOIN (
SELECT station_id, obs_year FROM mytable 
GROUP BY station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

回答by Suresh Nambiar

SELECT  *
FROM    (
        SELECT  t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
        FROM    mytable t
        )
WHERE   rn > 1

by Quassnoi is the most efficient for large tables. I had this analysis of cost :

由 Quassnoi 提供,对于大表来说是最有效的。我有这样的成本分析:

SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
WHERE EXISTS (SELECT 1 from trn_refil_book b Where
       a.dist_code = b.dist_code and a.book_date = b.book_date and a.book_no = b.book_no
       AND a.RowId <> b.RowId)
       ;

gave a cost of 1322341

给出了 1322341 的成本

SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
INNER JOIN (
SELECT b.dist_code, b.book_date, b.book_no FROM trn_refil_book b 
GROUP BY b.dist_code, b.book_date, b.book_no HAVING COUNT(*) > 1) c 
ON 
 a.dist_code = c.dist_code and a.book_date = c.book_date and a.book_no = c.book_no
;

gave a cost of 1271699

给出了 1271699 的成本

while

尽管

SELECT  dist_code, book_date, book_no
FROM    (
        SELECT  t.dist_code, t.book_date, t.book_no, ROW_NUMBER() OVER (PARTITION BY t.book_date, t.book_no
          ORDER BY t.dist_code) AS rn
        FROM    trn_refil_book t
        ) p
WHERE   p.rn > 1
;

gave a cost of 1021984

给出了1021984的成本

The table was not indexed....

该表未编入索引....

回答by grokster

  SELECT entity_id, station_id, obs_year
    FROM mytable
GROUP BY entity_id, station_id, obs_year
HAVING COUNT(*) > 1

Specify the fields to find duplicates on both the SELECT and the GROUP BY.

指定字段以在 SELECT 和 GROUP BY 上查找重复项。

It works by using GROUP BYto find any rows that match any other rows based on the specified Columns. The HAVING COUNT(*) > 1says that we are only interested in seeing any rows that occur more than 1 time (and are therefore duplicates)

它的工作原理是GROUP BY根据指定的列查找与任何其他行匹配的任何行。该HAVING COUNT(*) > 1说,我们只看到发生的1倍以上任何行有兴趣的(因此重复)

回答by Ben Petersen

I thought a lot of the solutions here were cumbersome and tough to understand since I had a 3 column primary key constraint and needed to find the duplicates. So here's an option

我认为这里的很多解决方案都很麻烦且难以理解,因为我有一个 3 列主键约束并且需要找到重复项。所以这里有一个选项

SELECT id, name, value, COUNT(*) FROM db_name.table_name
GROUP BY id, name, value
HAVING COUNT(*) > 1