SQL 如何删除重复的条目?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1746213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 04:26:18  来源:igfitidea点击:

How to delete duplicate entries?

sqlpostgresqlduplicate-removalunique-constraintsql-delete

提问by gjrwebber

I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.

我必须向现有表添加唯一约束。这很好,只是该表已经有数百万行,而且许多行违反了我需要添加的唯一约束。

What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?

删除违规行的最快方法是什么?我有一个查找重复项并删除它们的 SQL 语句,但它需要永远运行。有没有其他方法可以解决这个问题?也许备份表,然后在添加约束后恢复?

回答by Tim

Some of these approaches seem a little complicated, and I generally do this as:

其中一些方法看起来有点复杂,我通常这样做:

Given table table, want to unique it on (field1, field2) keeping the row with the max field3:

给定 table table,想要在 (field1, field2) 上唯一它保持最大 field3 的行:

DELETE FROM table USING table alias 
  WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
    table.max_field < alias.max_field

For example, I have a table, user_accounts, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).

例如,我有一个表,user_accounts我想在电子邮件上添加一个唯一约束,但我有一些重复项。还说我想保留最近创建的一个(重复项中的最大 id)。

DELETE FROM user_accounts USING user_accounts ua2
  WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;
  • Note - USINGis not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.
  • 注意 -USING不是标准 SQL,它是 PostgreSQL 扩展(但非常有用),但原始问题特别提到了 PostgreSQL。

回答by just somebody

For example you could:

例如,您可以:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;

回答by Erwin Brandstetter

Instead of creating a new table, you can also re-insert unique rows into the same table after truncating it. Do it all in one transaction. Optionally, you can drop the temporary table at the end of the transaction automatically with ON COMMIT DROP. See below.

除了创建新表之外,您还可以在截断后将唯一行重新插入到同一个表中。在一笔交易中完成所有操作。或者,您可以在事务结束时使用 自动删除临时表ON COMMIT DROP。见下文。

This approach is only useful where there are lots of rows to delete from all over the table. For just a few duplicates, use a plain DELETE.

这种方法只在有很多行要从整个表中删除的情况下才有用。对于少数重复项,请使用普通的DELETE.

You mentioned millions of rows. To make the operation fastyou want to allocate enough temporary buffersfor the session. The setting has to be adjusted beforeany temp buffer is used in your current session. Find out the size of your table:

你提到了数百万行。为了使操作更快,您需要为会话分配足够的临时缓冲区。在当前会话中使用任何临时缓冲区之前,必须调整设置。找出你的桌子的大小:

SELECT pg_size_pretty(pg_relation_size('tbl'));

Set temp_buffersaccordingly. Round up generously because in-memory representation needs a bit more RAM.

相应地设置temp_buffers。由于内存中的表示需要更多的 RAM,因此可以慷慨地进行四舍五入。

SET temp_buffers = 200MB;    -- example value

BEGIN;

-- CREATE TEMPORARY TABLE t_tmp ON COMMIT DROP AS -- drop temp table at commit
CREATE TEMPORARY TABLE t_tmp AS  -- retain temp table after commit
SELECT DISTINCT * FROM tbl;  -- DISTINCT folds duplicates

TRUNCATE tbl;

INSERT INTO tbl
SELECT * FROM t_tmp;
-- ORDER BY id; -- optionally "cluster" data while being at it.

COMMIT;

This method can be superior to creating a new table ifdepending objects exist. Views, indexes, foreign keys or other objects referencing the table. TRUNCATEmakes you begin with a clean slate anyway (new file in the background) and is muchfaster than DELETE FROM tblwith big tables (DELETEcan actually be faster with small tables).

如果存在依赖对象,此方法可能优于创建新表。引用表的视图、索引、外键或其他对象。TRUNCATE让你用干净的石板开始呢(在后台新的文件),并且是快于DELETE FROM tbl大表(DELETE其实是可以用小桌子更快)。

For big tables, it is regularly fasterto drop indexes and foreign keys, refill the table and recreate these objects. As far as fk constraints are concerned you have to be certain the new data is valid of course or you'll run into an exception on trying to create the fk.

对于大表,删除索引和外键、重新填充表并重新创建这些对象通常会更快。就 fk 约束而言,您当然必须确定新数据是有效的,否则在尝试创建 fk 时会遇到异常。

Note that TRUNCATErequires more aggressive locking than DELETE. This may be an issue for tables with heavy, concurrent load.

请注意,TRUNCATEDELETE. 对于具有大量并发负载的表,这可能是一个问题。

If TRUNCATEis not an option or generally for small to medium tablesthere is a similar technique with a data-modifying CTE(Postgres 9.1+):

如果TRUNCATE不是一个选项,或者通常对于中小型表来说,有一个类似的技术和数据修改 CTE(Postgres 9.1+):

WITH del AS (DELETE FROM tbl RETURNING *)
INSERT INTO tbl
SELECT DISTINCT * FROM del;
-- ORDER BY id; -- optionally "cluster" data while being at it.

Slower for big tables, because TRUNCATEis faster there. But may be faster (and simpler!) for small tables.

大表的速度较慢,因为TRUNCATE那里速度较快。但对于小桌子来说可能更快(更简单!)。

If you have no depending objects at all, you might create a new table and delete the old one, but you hardly gain anything over this universal approach.

如果您根本没有依赖对象,您可能会创建一个新表并删除旧表,但通过这种通用方法几乎没有任何好处。

For very big tables that would not fit into available RAM, creating a newtable will be considerably faster. You'll have to weigh this against possible troubles / overhead with depending objects.

对于不适合可用 RAM 的非常大的表,创建表的速度会快得多。您必须权衡这与依赖对象可能出现的麻烦/开销。

回答by Jan Marek

You can use oid or ctid, which is normally a "non-visible" columns in the table:

您可以使用 oid 或 ctid,它们通常是表中的“不可见”列:

DELETE FROM table
 WHERE ctid NOT IN
  (SELECT MAX(s.ctid)
    FROM table s
    GROUP BY s.column_has_be_distinct);

回答by shekwi

The PostgreSQL window function is handy for this problem.

PostgreSQL 窗口函数可以很方便地解决这个问题。

DELETE FROM tablename
WHERE id IN (SELECT id
              FROM (SELECT id,
                             row_number() over (partition BY column1, column2, column3 ORDER BY id) AS rnum
                     FROM tablename) t
              WHERE t.rnum > 1);

See Deleting duplicates.

请参阅删除重复项

回答by naXa

Generalized query to delete duplicates:

删除重复项的通用查询:

DELETE FROM table_name
WHERE ctid NOT IN (
  SELECT max(ctid) FROM table_name
  GROUP BY column1, [column 2, ...]
);

The column ctidis a special column available for every table but not visible unless specifically mentioned. The ctidcolumn value is considered unique for every row in a table.

该列ctid是每个表都可用的特殊列,但除非特别提及,否则不可见。该ctid列的值被认为是表中的每一行都是唯一的。

回答by Bhavik Ambani

From an old postgresql.org mailing list:

来自旧的 postgresql.org 邮件列表

create table test ( a text, b text );

Unique values

独特的价值

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

Duplicate values

重复值

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

One more double duplicate

再来一份双份

insert into test values ( 'x', 'y');

select oid, a, b from test;

Select duplicate rows

选择重复行

select o.oid, o.a, o.b from test o
    where exists ( select 'x'
                   from test i
                   where     i.a = o.a
                         and i.b = o.b
                         and i.oid < o.oid
                 );

Delete duplicate rows

删除重复行

Note: PostgreSQL dosn't support aliases on the table mentioned in the fromclause of a delete.

注意:PostgreSQL 不支持from删除子句中提到的表的别名。

delete from test
    where exists ( select 'x'
                   from test i
                   where     i.a = test.a
                         and i.b = test.b
                         and i.oid < test.oid
             );

回答by codebykat

I just used Erwin Brandstetter's answersuccessfully to remove duplicates in a join table (a table lacking its own primary IDs), but found that there's one important caveat.

我刚刚使用Erwin Brandstetter 的回答成功地删除了连接表(缺少自己的主 ID 的表)中的重复项,但发现有一个重要的警告。

Including ON COMMIT DROPmeans the temporary table will get dropped at the end of the transaction. For me, that meant the temporary table was no longer availableby the time I went to insert it!

包括ON COMMIT DROP意味着临时表将在事务结束时被删除。对我来说,这意味着当我去插入临时表时它不再可用

I just did CREATE TEMPORARY TABLE t_tmp AS SELECT DISTINCT * FROM tbl;and everything worked fine.

我刚刚做了CREATE TEMPORARY TABLE t_tmp AS SELECT DISTINCT * FROM tbl;,一切正常。

The temporary table does get dropped at the end of the session.

临时表在会话结束时会被删除。

回答by Skippy le Grand Gourou

If you have only one or a few duplicated entries, and they are indeed duplicated(that is, they appear twice), you can use the "hidden" ctidcolumn, as proposed above, together with LIMIT:

如果您只有一个或几个重复的条目,并且它们确实是重复的(即,它们出现了两次),您可以使用ctid上面建议的“隐藏”列,以及LIMIT

DELETE FROM mytable WHERE ctid=(SELECT ctid FROM mytable WHERE […] LIMIT 1);

This will delete only the first of the selected rows.

这将仅删除所选行中的第一行。

回答by Secko

DELETE FROM table
  WHERE something NOT IN
    (SELECT     MAX(s.something)
      FROM      table As s
      GROUP BY  s.this_thing, s.that_thing);