SQL 如何删除重复的条目?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1746213/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to delete duplicate entries?
提问by gjrwebber
I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.
我必须向现有表添加唯一约束。这很好,只是该表已经有数百万行,而且许多行违反了我需要添加的唯一约束。
What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?
删除违规行的最快方法是什么?我有一个查找重复项并删除它们的 SQL 语句,但它需要永远运行。有没有其他方法可以解决这个问题?也许备份表,然后在添加约束后恢复?
回答by Tim
Some of these approaches seem a little complicated, and I generally do this as:
其中一些方法看起来有点复杂,我通常这样做:
Given table table
, want to unique it on (field1, field2) keeping the row with the max field3:
给定 table table
,想要在 (field1, field2) 上唯一它保持最大 field3 的行:
DELETE FROM table USING table alias
WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
table.max_field < alias.max_field
For example, I have a table, user_accounts
, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).
例如,我有一个表,user_accounts
我想在电子邮件上添加一个唯一约束,但我有一些重复项。还说我想保留最近创建的一个(重复项中的最大 id)。
DELETE FROM user_accounts USING user_accounts ua2
WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;
- Note -
USING
is not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.
- 注意 -
USING
不是标准 SQL,它是 PostgreSQL 扩展(但非常有用),但原始问题特别提到了 PostgreSQL。
回答by just somebody
For example you could:
例如,您可以:
CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;
回答by Erwin Brandstetter
Instead of creating a new table, you can also re-insert unique rows into the same table after truncating it. Do it all in one transaction. Optionally, you can drop the temporary table at the end of the transaction automatically with ON COMMIT DROP
. See below.
除了创建新表之外,您还可以在截断后将唯一行重新插入到同一个表中。在一笔交易中完成所有操作。或者,您可以在事务结束时使用 自动删除临时表ON COMMIT DROP
。见下文。
This approach is only useful where there are lots of rows to delete from all over the table. For just a few duplicates, use a plain DELETE
.
这种方法只在有很多行要从整个表中删除的情况下才有用。对于少数重复项,请使用普通的DELETE
.
You mentioned millions of rows. To make the operation fastyou want to allocate enough temporary buffersfor the session. The setting has to be adjusted beforeany temp buffer is used in your current session. Find out the size of your table:
你提到了数百万行。为了使操作更快,您需要为会话分配足够的临时缓冲区。在当前会话中使用任何临时缓冲区之前,必须调整设置。找出你的桌子的大小:
SELECT pg_size_pretty(pg_relation_size('tbl'));
Set temp_buffers
accordingly. Round up generously because in-memory representation needs a bit more RAM.
相应地设置temp_buffers
。由于内存中的表示需要更多的 RAM,因此可以慷慨地进行四舍五入。
SET temp_buffers = 200MB; -- example value
BEGIN;
-- CREATE TEMPORARY TABLE t_tmp ON COMMIT DROP AS -- drop temp table at commit
CREATE TEMPORARY TABLE t_tmp AS -- retain temp table after commit
SELECT DISTINCT * FROM tbl; -- DISTINCT folds duplicates
TRUNCATE tbl;
INSERT INTO tbl
SELECT * FROM t_tmp;
-- ORDER BY id; -- optionally "cluster" data while being at it.
COMMIT;
This method can be superior to creating a new table ifdepending objects exist. Views, indexes, foreign keys or other objects referencing the table. TRUNCATE
makes you begin with a clean slate anyway (new file in the background) and is muchfaster than DELETE FROM tbl
with big tables (DELETE
can actually be faster with small tables).
如果存在依赖对象,则此方法可能优于创建新表。引用表的视图、索引、外键或其他对象。TRUNCATE
让你用干净的石板开始呢(在后台新的文件),并且是多快于DELETE FROM tbl
大表(DELETE
其实是可以用小桌子更快)。
For big tables, it is regularly fasterto drop indexes and foreign keys, refill the table and recreate these objects. As far as fk constraints are concerned you have to be certain the new data is valid of course or you'll run into an exception on trying to create the fk.
对于大表,删除索引和外键、重新填充表并重新创建这些对象通常会更快。就 fk 约束而言,您当然必须确定新数据是有效的,否则在尝试创建 fk 时会遇到异常。
Note that TRUNCATE
requires more aggressive locking than DELETE
. This may be an issue for tables with heavy, concurrent load.
请注意,TRUNCATE
与DELETE
. 对于具有大量并发负载的表,这可能是一个问题。
If TRUNCATE
is not an option or generally for small to medium tablesthere is a similar technique with a data-modifying CTE(Postgres 9.1+):
如果TRUNCATE
不是一个选项,或者通常对于中小型表来说,有一个类似的技术和数据修改 CTE(Postgres 9.1+):
WITH del AS (DELETE FROM tbl RETURNING *)
INSERT INTO tbl
SELECT DISTINCT * FROM del;
-- ORDER BY id; -- optionally "cluster" data while being at it.
Slower for big tables, because TRUNCATE
is faster there. But may be faster (and simpler!) for small tables.
大表的速度较慢,因为TRUNCATE
那里速度较快。但对于小桌子来说可能更快(更简单!)。
If you have no depending objects at all, you might create a new table and delete the old one, but you hardly gain anything over this universal approach.
如果您根本没有依赖对象,您可能会创建一个新表并删除旧表,但通过这种通用方法几乎没有任何好处。
For very big tables that would not fit into available RAM, creating a newtable will be considerably faster. You'll have to weigh this against possible troubles / overhead with depending objects.
对于不适合可用 RAM 的非常大的表,创建新表的速度会快得多。您必须权衡这与依赖对象可能出现的麻烦/开销。
回答by Jan Marek
You can use oid or ctid, which is normally a "non-visible" columns in the table:
您可以使用 oid 或 ctid,它们通常是表中的“不可见”列:
DELETE FROM table
WHERE ctid NOT IN
(SELECT MAX(s.ctid)
FROM table s
GROUP BY s.column_has_be_distinct);
回答by shekwi
The PostgreSQL window function is handy for this problem.
PostgreSQL 窗口函数可以很方便地解决这个问题。
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
row_number() over (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
See Deleting duplicates.
请参阅删除重复项。
回答by naXa
Generalized query to delete duplicates:
删除重复项的通用查询:
DELETE FROM table_name
WHERE ctid NOT IN (
SELECT max(ctid) FROM table_name
GROUP BY column1, [column 2, ...]
);
The column ctid
is a special column available for every table but not visible unless specifically mentioned. The ctid
column value is considered unique for every row in a table.
该列ctid
是每个表都可用的特殊列,但除非特别提及,否则不可见。该ctid
列的值被认为是表中的每一行都是唯一的。
回答by Bhavik Ambani
From an old postgresql.org mailing list:
create table test ( a text, b text );
Unique values
独特的价值
insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );
Duplicate values
重复值
insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );
One more double duplicate
再来一份双份
insert into test values ( 'x', 'y');
select oid, a, b from test;
Select duplicate rows
选择重复行
select o.oid, o.a, o.b from test o
where exists ( select 'x'
from test i
where i.a = o.a
and i.b = o.b
and i.oid < o.oid
);
Delete duplicate rows
删除重复行
Note: PostgreSQL dosn't support aliases on
the table mentioned in the from
clause
of a delete.
注意:PostgreSQL 不支持from
删除子句中提到的表的别名。
delete from test
where exists ( select 'x'
from test i
where i.a = test.a
and i.b = test.b
and i.oid < test.oid
);
回答by codebykat
I just used Erwin Brandstetter's answersuccessfully to remove duplicates in a join table (a table lacking its own primary IDs), but found that there's one important caveat.
我刚刚使用Erwin Brandstetter 的回答成功地删除了连接表(缺少自己的主 ID 的表)中的重复项,但发现有一个重要的警告。
Including ON COMMIT DROP
means the temporary table will get dropped at the end of the transaction. For me, that meant the temporary table was no longer availableby the time I went to insert it!
包括ON COMMIT DROP
意味着临时表将在事务结束时被删除。对我来说,这意味着当我去插入临时表时它不再可用!
I just did CREATE TEMPORARY TABLE t_tmp AS SELECT DISTINCT * FROM tbl;
and everything worked fine.
我刚刚做了CREATE TEMPORARY TABLE t_tmp AS SELECT DISTINCT * FROM tbl;
,一切正常。
The temporary table does get dropped at the end of the session.
临时表在会话结束时会被删除。
回答by Skippy le Grand Gourou
If you have only one or a few duplicated entries, and they are indeed duplicated(that is, they appear twice), you can use the "hidden" ctid
column, as proposed above, together with LIMIT
:
如果您只有一个或几个重复的条目,并且它们确实是重复的(即,它们出现了两次),您可以使用ctid
上面建议的“隐藏”列,以及LIMIT
:
DELETE FROM mytable WHERE ctid=(SELECT ctid FROM mytable WHERE […] LIMIT 1);
This will delete only the first of the selected rows.
这将仅删除所选行中的第一行。
回答by Secko
DELETE FROM table
WHERE something NOT IN
(SELECT MAX(s.something)
FROM table As s
GROUP BY s.this_thing, s.that_thing);