如何删除 MySQL 表上的重复项?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2630440/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to delete duplicates on a MySQL table?
提问by Ali Demirci
I need to DELETE
duplicated rows for specified sid on a MySQL
table.
我需要为表DELETE
上的指定 sid 复制行MySQL
。
How can I do this with an SQL query?
如何使用 SQL 查询执行此操作?
DELETE (DUPLICATED TITLES) FROM table WHERE SID = "1"
Something like this, but I don't know how to do it.
像这样的事情,但我不知道该怎么做。
回答by user187291
this removes duplicates in place, without making a new table
这将删除原地重复项,而无需创建新表
ALTER IGNORE TABLE `table_name` ADD UNIQUE (title, SID)
note: only works well if index fits in memory
注意:仅当索引适合内存时才有效
回答by Abhijoy_D
Suppose you have a table employee
, with the following columns:
假设您有一个 table employee
,其中包含以下列:
employee (first_name, last_name, start_date)
In order to delete the rows with a duplicate first_name
column:
为了删除具有重复first_name
列的行:
delete
from employee using employee,
employee e1
where employee.id > e1.id
and employee.first_name = e1.first_name
回答by Kamil Szot
Following remove duplicates for all SID-s, not only single one.
以下删除所有 SID-s 的重复项,而不仅仅是一个。
With temp table
带临时表
CREATE TABLE table_temp AS
SELECT * FROM table GROUP BY title, SID;
DROP TABLE table;
RENAME TABLE table_temp TO table;
Since temp_table
is freshly created it has no indexes. You'll need to recreate them after removing duplicates. You can check what indexes you have in the table with SHOW INDEXES IN table
由于temp_table
是新创建的,它没有索引。删除重复项后,您需要重新创建它们。您可以使用以下命令检查表中的索引SHOW INDEXES IN table
Without temp table:
没有临时表:
DELETE FROM `table` WHERE id IN (
SELECT all_duplicates.id FROM (
SELECT id FROM `table` WHERE (`title`, `SID`) IN (
SELECT `title`, `SID` FROM `table` GROUP BY `title`, `SID` having count(*) > 1
)
) AS all_duplicates
LEFT JOIN (
SELECT id FROM `table` GROUP BY `title`, `SID` having count(*) > 1
) AS grouped_duplicates
ON all_duplicates.id = grouped_duplicates.id
WHERE grouped_duplicates.id IS NULL
)
回答by Eric Leschinski
Deleting duplicate rows in MySQL in-place, (Assuming you have a timestamp col to sort by) walkthrough:
就地删除 MySQL 中的重复行,(假设您有一个时间戳列可以排序)演练:
Create the table and insert some rows:
创建表并插入一些行:
create table penguins(foo int, bar varchar(15), baz datetime);
insert into penguins values(1, 'skipper', now());
insert into penguins values(1, 'skipper', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(4, 'rico', now());
select * from penguins;
+------+----------+---------------------+
| foo | bar | baz |
+------+----------+---------------------+
| 1 | skipper | 2014-08-25 14:21:54 |
| 1 | skipper | 2014-08-25 14:21:59 |
| 3 | kowalski | 2014-08-25 14:22:09 |
| 3 | kowalski | 2014-08-25 14:22:13 |
| 3 | kowalski | 2014-08-25 14:22:15 |
| 4 | rico | 2014-08-25 14:22:22 |
+------+----------+---------------------+
6 rows in set (0.00 sec)
Remove the duplicates in place:
删除原地重复项:
delete a
from penguins a
left join(
select max(baz) maxtimestamp, foo, bar
from penguins
group by foo, bar) b
on a.baz = maxtimestamp and
a.foo = b.foo and
a.bar = b.bar
where b.maxtimestamp IS NULL;
Query OK, 3 rows affected (0.01 sec)
select * from penguins;
+------+----------+---------------------+
| foo | bar | baz |
+------+----------+---------------------+
| 1 | skipper | 2014-08-25 14:21:59 |
| 3 | kowalski | 2014-08-25 14:22:15 |
| 4 | rico | 2014-08-25 14:22:22 |
+------+----------+---------------------+
3 rows in set (0.00 sec)
You're done, duplicate rows are removed, last one by timestamp is kept.
大功告成,删除重复的行,保留时间戳的最后一行。
For those of you without a timestamp or unique column.
对于那些没有时间戳或唯一列的人。
You don't have a timestamp
or a unique index column to sort by? You're living in a state of degeneracy. You'll have to do additional steps to delete duplicate rows.
您没有timestamp
要排序的或唯一的索引列吗?你生活在堕落的状态中。您必须执行其他步骤来删除重复的行。
create the penguins table and add some rows
创建企鹅表并添加一些行
create table penguins(foo int, bar varchar(15));
insert into penguins values(1, 'skipper');
insert into penguins values(1, 'skipper');
insert into penguins values(3, 'kowalski');
insert into penguins values(3, 'kowalski');
insert into penguins values(3, 'kowalski');
insert into penguins values(4, 'rico');
select * from penguins;
# +------+----------+
# | foo | bar |
# +------+----------+
# | 1 | skipper |
# | 1 | skipper |
# | 3 | kowalski |
# | 3 | kowalski |
# | 3 | kowalski |
# | 4 | rico |
# +------+----------+
make a clone of the first table and copy into it.
克隆第一个表并复制到其中。
drop table if exists penguins_copy;
create table penguins_copy as ( SELECT foo, bar FROM penguins );
#add an autoincrementing primary key:
ALTER TABLE penguins_copy ADD moo int AUTO_INCREMENT PRIMARY KEY first;
select * from penguins_copy;
# +-----+------+----------+
# | moo | foo | bar |
# +-----+------+----------+
# | 1 | 1 | skipper |
# | 2 | 1 | skipper |
# | 3 | 3 | kowalski |
# | 4 | 3 | kowalski |
# | 5 | 3 | kowalski |
# | 6 | 4 | rico |
# +-----+------+----------+
The max aggregate operates upon the new moo index:
max 聚合对新的 moo 索引进行操作:
delete a from penguins_copy a left join(
select max(moo) myindex, foo, bar
from penguins_copy
group by foo, bar) b
on a.moo = b.myindex and
a.foo = b.foo and
a.bar = b.bar
where b.myindex IS NULL;
#drop the extra column on the copied table
alter table penguins_copy drop moo;
select * from penguins_copy;
#drop the first table and put the copy table back:
drop table penguins;
create table penguins select * from penguins_copy;
observe and cleanup
观察和清理
drop table penguins_copy;
select * from penguins;
+------+----------+
| foo | bar |
+------+----------+
| 1 | skipper |
| 3 | kowalski |
| 4 | rico |
+------+----------+
Elapsed: 1458.359 milliseconds
What's that big SQL delete statement doing?
那个大 SQL 删除语句在做什么?
Table penguins with alias 'a' is left joined on a subset of table penguins called alias 'b'. The right hand table 'b' which is a subset finds the max timestamp [ or max moo ] grouped by columns foo and bar. This is matched to left hand table 'a'. (foo,bar,baz) on left has every row in the table. The right hand subset 'b' has a (maxtimestamp,foo,bar) which is matched to left only on the one that IS the max.
别名为“a”的桌企鹅左连接在称为别名“b”的桌企鹅子集上。作为子集的右侧表 'b' 查找按列 foo 和 bar 分组的最大时间戳 [ 或 max moo ]。这与左侧表 'a' 相匹配。(foo,bar,baz) 左边有表格中的每一行。右侧的子集 'b' 有一个 (maxtimestamp,foo,bar),它只在最大的那个上与左边匹配。
Every row that is not that max has value maxtimestamp of NULL. Filter down on those NULL rows and you have a set of all rows grouped by foo and bar that isn't the latest timestamp baz. Delete those ones.
不是 max 的每一行的值 maxtimestamp 为 NULL。过滤掉那些 NULL 行,你就有了一组按 foo 和 bar 分组的所有行,这些行不是最新的时间戳 baz。删除那些。
Make a backup of the table before you run this.
在运行之前备份表。
Prevent this problem from ever happening again on this table:
防止此问题在此表上再次发生:
If you got this to work, and it put out your "duplicate row" fire. Great. Now define a new composite unique key on your table (on those two columns) to prevent more duplicates from being added in the first place.
如果你让它工作,它就会扑灭你的“重复行”火。伟大的。现在在您的表上(在这两列上)定义一个新的复合唯一键,以防止首先添加更多重复项。
Like a good immune system, the bad rows shouldn't even be allowed in to the table at the time of insert. Later on all those programs adding duplicates will broadcast their protest, and when you fix them, this issue never comes up again.
就像一个良好的免疫系统一样,在插入时甚至不应该允许坏行进入表。稍后,所有添加重复项的程序都会广播他们的抗议,当您修复它们时,这个问题就再也不会出现了。
回答by seaders
After running into this issue myself, on a huge database, I wasn't completely impressed with the performance of any of the other answers. I want to keep only the latest duplicate row, and delete the rest.
在自己遇到这个问题后,在一个巨大的数据库上,我对任何其他答案的性能都没有留下深刻的印象。我只想保留最新的重复行,并删除其余的行。
In a one-query statement, without a temp table, this worked best for me,
在没有临时表的单查询语句中,这对我来说效果最好,
DELETE e.*
FROM employee e
WHERE id IN
(SELECT id
FROM (SELECT MIN(id) as id
FROM employee e2
GROUP BY first_name, last_name
HAVING COUNT(*) > 1) x);
The only caveat is that I have to run the query multiple times, but even with that, I found it worked better for me than the other options.
唯一的警告是我必须多次运行查询,但即便如此,我发现它比其他选项更适合我。
回答by user3649739
This always seems to work for me:
这似乎总是对我有用:
CREATE TABLE NoDupeTable LIKE DupeTable;
INSERT NoDupeTable SELECT * FROM DupeTable group by CommonField1,CommonFieldN;
Which keeps the lowest ID on each of the dupes and the rest of the non-dupe records.
它保留了每个重复记录和其余非重复记录的最低 ID。
I've also taken to doing the following so that the dupe issue no longer occurs after the removal:
我还采取了以下措施,以便在删除后不再出现欺骗问题:
CREATE TABLE NoDupeTable LIKE DupeTable;
Alter table NoDupeTable Add Unique `Unique` (CommonField1,CommonField2);
INSERT IGNORE NoDupeTable SELECT * FROM DupeTable;
In other words, I create a duplicate of the first table, add a unique index on the fields I don't want duplicates of, and then do an Insert IGNORE
which has the advantage of not failing as a normal Insert
would the first time it tried to add a duplicate record based on the two fields and rather ignores any such records.
换句话说,我创建了第一个表的副本,在我不想重复的字段上添加了一个唯一索引,然后执行这样的操作Insert IGNORE
,其优点是Insert
第一次尝试添加时不会像正常情况一样失败基于两个字段的重复记录,而忽略任何此类记录。
Moving fwd it becomes impossible to create any duplicate records based on those two fields.
向前移动,基于这两个字段创建任何重复记录变得不可能。
回答by M.B.Miri
The following works for all tables
以下适用于所有表
CREATE TABLE `noDup` LIKE `Dup` ;
INSERT `noDup` SELECT DISTINCT * FROM `Dup` ;
DROP TABLE `Dup` ;
ALTER TABLE `noDup` RENAME `Dup` ;
回答by Ted Celestin
Here is a simple answer:
这是一个简单的答案:
delete a from target_table a left JOIN (select max(id_field) as id, field_being_repeated
from target_table GROUP BY field_being_repeated) b
on a.field_being_repeated = b.field_being_repeated
and a.id_field = b.id_field
where b.id_field is null;
回答by richardhell
This work for me to remove old records:
这对我来说可以删除旧记录:
delete from table where id in
(select min(e.id)
from (select * from table) e
group by column1, column2
having count(*) > 1
);
You can replace min(e.id) to max(e.id) to remove newest records.
您可以将 min(e.id) 替换为 max(e.id) 以删除最新记录。
回答by temonehm
delete p from
product p
inner join (
select max(id) as id, url from product
group by url
having count(*) > 1
) unik on unik.url = p.url and unik.id != p.id;