SQL 如何删除没有唯一标识符的重复行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26769454/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 02:55:05  来源:igfitidea点击:

How to delete duplicate rows without unique identifier

sqldatabasepostgresqlduplicatesnetezza

提问by moe

I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:

我的表中有重复的行,我想以最有效的方式删除重复项,因为表很大。经过一番研究,我想出了这个查询:

WITH TempEmp AS
(
SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount
FROM mytable
)
-- Now Delete Duplicate Records
DELETE FROM TempEmp
WHERE duplicateRecCount > 1;

But it only works in SQL, not in Netezza. It would seem that it does not like the DELETEafter the WITHclause?

但它只适用于 SQL,不适用于 Netezza。它似乎不喜欢DELETEafterWITH子句?

采纳答案by isapir

I like @erwin-brandstetter 's solution, but wanted to show a solution with the USINGkeyword:

我喜欢 @erwin-brandstetter 的解决方案,但想用USING关键字展示一个解决方案:

DELETE   FROM table_with_dups T1
  USING       table_with_dups T2
WHERE  T1.ctid    < T2.ctid       -- delete the "older" ones
  AND  T1.name    = T2.name       -- list columns that define duplicates
  AND  T1.address = T2.address
  AND  T1.zipcode = T2.zipcode;

If you want to review the records before deleting them, then simply replace DELETEwith SELECT *and USINGwith a comma ,, i.e.

如果你想删除它们之前查看的记录,然后只需更换DELETESELECT *USING用逗号,,即

SELECT * FROM table_with_dups T1
  ,           table_with_dups T2
WHERE  T1.ctid    < T2.ctid       -- select the "older" ones
  AND  T1.name    = T2.name       -- list columns that define duplicates
  AND  T1.address = T2.address
  AND  T1.zipcode = T2.zipcode;

Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...)clause as those generate a lot of rows in the subquery.

更新:我在这里测试了一些不同的解决方案以提高速度。如果您不希望有很多重复项,那么此解决方案的性能比具有NOT IN (...)子句的解决方案要好得多,因为它们会在子查询中生成大量行。

If you rewrite the query to use IN (...)then it performs similarly to the solution presented here, but the SQL code becomes much less concise.

如果您重写要使用的查询,IN (...)那么它的执行与此处提供的解决方案类似,但 SQL 代码变得不那么简洁。

Update 2: If you have NULLvalues in one of the key columns (which you really shouldn't IMO), then you can use COALESCE()in the condition for that column, e.g.

更新 2:如果您NULL在其中一个关键列中有值(您确实不应该在 IMO 中使用),那么您可以COALESCE()在该列的条件中使用,例如

  AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')

回答by Gordon Linoff

If you have no other unique identifier, you can use ctid:

如果您没有其他唯一标识符,则可以使用ctid

delete from mytable
    where exists (select 1
                  from mytable t2
                  where t2.name = mytable.name and
                        t2.address = mytable.address and
                        t2.zip = mytable.zip and
                        t2.ctid > mytable.ctid
                 );

It is a good idea to have a unique, auto-incrementing id in every table. Doing a deletelike this is one important reason why.

在每个表中都有一个唯一的、自动递增的 id 是个好主意。这样做delete是一个重要原因。

回答by Erwin Brandstetter

In a perfect world, everytable has a unique identifier of some sort.
In the absence of any unique column (or combination thereof), use the ctidcolumn:

在完美的世界中,每个表都有某种唯一标识符。
在不存在任何唯一的列(或其组合)的,使用ctid

DELETE FROM tbl
WHERE  ctid NOT IN (
   SELECT min(ctid)                    -- ctid is NOT NULL by definition
   FROM   tbl
   GROUP  BY name, address, zipcode);  -- list columns defining duplicates

The above query is short, conveniently listing column names only once. NOT IN (SELECT ...)is a tricky query style when NULL values can be involved, but the system column ctidis never NULL. See:

上面的查询很短,方便地只列出列名一次。NOT IN (SELECT ...)当可能涉及 NULL 值时,这是一种棘手的查询样式,但系统列ctid从不为 NULL。看:

Using EXISTSas demonstrated by @Gordonis typically faster. So is a self-join with the USINGclause like @isapir added later. Both should result in the same query plan.

EXISTS@Gordon演示的那样使用通常更快。与稍后添加的@isapir 等USING子句的自连接也是如此。两者都应该产生相同的查询计划。

But note an important difference: These other queries treat NULLvalues as not equal, while GROUP BY(or DISTINCTor DISTINCT ON ()) treats NULL values as equal. Does not matter if key columns are defined NOT NULL. Else, depending on your definition of "duplicate", you'll need one or the other approach. Oruse IS NOT DISTINCT FROMin comparison of values (which may not be able to use some indexes).

但请注意一个重要的区别:这些其他查询将NULL值视为不相等,而GROUP BY(或DISTINCTDISTINCT ON ())将 NULL 值视为相等。键列是否定义无关紧要NOT NULL。否则,根据您对“重复”的定义,您将需要一种或另一种方法。用于IS NOT DISTINCT FROM值的比较(可能无法使用某些索引)。

Disclaimer:

免责声明:

ctidis an internal implementation detail of Postgres, it's not in the SQL standard and can be changed between major versions without warning (even if that's very unlikely). Its values can change between commands due to background processes or concurrent write operations (but not within the same command).

ctid是 Postgres 的内部实现细节,它不在 SQL 标准中,并且可以在没有警告的情况下在主要版本之间进行更改(即使这不太可能)。由于后台进程或并发写入操作(但不在同一命令内),它的值可以在命令之间更改。

Related:

有关的:

Aside:

在旁边:

The target of a DELETEstatement cannot be the CTE, only the underlying table. That's a spillover from SQL Server - as is your whole approach.

DELETE语句的目标不能是 CTE,只能是基础表。这是 SQL Server 的溢出 - 就像您的整个方法一样。

回答by Bruno Calza

Here is what I came up with, using a group by

这是我想出的,使用 group by

DELETE FROM mytable
WHERE id NOT in (
  SELECT MIN(id) 
  FROM mytable
  GROUP BY name, address, zipcode
)

It deletes the duplicates, preserving the oldest record that has duplicates.

它删除重复项,保留有重复项的最旧记录。

回答by Vivek S.

We can use a window function for very effective removal of duplicate rows:

我们可以使用窗口函数非常有效地去除重复行:

DELETE FROM tab 
  WHERE id IN (SELECT id 
                  FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), id 
                           FROM tab) x 
                 WHERE x.row_number > 1);

Some PostgreSQL's optimized version (with ctid):

一些PostgreSQL的优化版本(带ctid):

DELETE FROM tab 
  WHERE ctid = ANY(ARRAY(SELECT ctid 
                  FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), ctid 
                           FROM tab) x 
                 WHERE x.row_number > 1));

回答by Joe Murray

The valid syntax is specified at http://www.postgresql.org/docs/current/static/sql-delete.html

有效语法在http://www.postgresql.org/docs/current/static/sql-delete.html 中指定

I would ALTER your table to add a unique auto-incrementing primary key id so that you can run a query like the following which will keep the first of each set of duplicates (ie the one with the lowest id). Note that adding the key is a bit more complicated in Postgres than some other DBs.

我会更改您的表以添加唯一的自动递增主键 ID,以便您可以运行如下查询,该查询将保留每组重复项中的第一个(即 ID 最低的那个)。请注意,在 Postgres 中添加密钥比其他一些数据库要复杂一些。

DELETE FROM mytable d USING (
  SELECT min(id), name, address, zip 
  FROM mytable 
  GROUP BY name, address, zip HAVING COUNT() > 1
) AS k 
WHERE d.id <> k.id 
AND d.name=k.name 
AND d.address=k.address 
AND d.zip=k.zip;

回答by Aditya Nathireddy

If you want to keep one row out of duplicate rows in the table.

如果要在表中的重复行中保留一行。

create table some_name_for_new_table as 
(select * from (select *,row_number() over (partition by pk_id) row_n from 
your_table_name_where_duplicates_are_present) a where row_n = 1);

This will create a table which you can copy.

这将创建一个您可以复制的表。

Before copying table please delete the column 'row_n'

在复制表之前,请删除“row_n”列

回答by Chad Crowe

From the documentation delete duplicate rows

从文档中删除重复的行

A frequent question in IRC is how to delete rows that are duplicates over a set of columns, keeping only the one with the lowest ID. This query does that for all rows of tablename having the same column1, column2, and column3.

IRC 中的一个常见问题是如何删除在一组列上重复的行,只保留 ID 最低的行。此查询对具有相同 column1、column2 和 column3 的 tablename 的所有行执行此操作。

DELETE FROM tablename
WHERE id IN (SELECT id
          FROM (SELECT id,
                         ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
                 FROM tablename) t
          WHERE t.rnum > 1);

Sometimes a timestamp field is used instead of an ID field.

有时使用时间戳字段而不是 ID 字段。

回答by wildplasser

If you want a unique identifier for every row, you could just add one (a serial, or a guid), and treat it like a surrogate key.

如果您希望每一行都有一个唯一标识符,您只需添加一个(序列号或 guid),并将其视为代理键



CREATE TABLE thenames
        ( name text not null
        , address text not null
        , zipcode text not null
        );
INSERT INTO thenames(name,address,zipcode) VALUES
('James', 'main street', '123' )
,('James', 'main street', '123' )
,('James', 'void street', '456')
,('Alice', 'union square' , '123')
        ;

SELECT*FROM thenames;


        -- add a surrogate key
ALTER TABLE thenames
        ADD COLUMN seq serial NOT NULL PRIMARY KEY
        ;
SELECT*FROM thenames;

DELETE FROM thenames del
WHERE EXISTS(
        SELECT*FROM thenames x
        WHERE x.name=del.name
        AND x.address=del.address
        AND x.zipcode=del.zipcode
        AND x.seq < del.seq
        );

        -- add the unique constrain,so that new dupplicates cannot be created in the future
ALTER TABLE thenames
        ADD UNIQUE (name,address,zipcode)
        ;

SELECT*FROM thenames;