删除 MySQL 中的重复行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3311903/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 16:37:30  来源:igfitidea点击:

Remove duplicate rows in MySQL

mysqlsqlduplicates

提问by Chetan

I have a table with the following fields:

我有一个包含以下字段的表:

id (Unique)
url (Unique)
title
company
site_id

Now, I need to remove rows having same title, company and site_id. One way to do it will be using the following SQL along with a script (PHP):

现在,我需要删除具有相同title, company and site_id. 一种方法是使用以下 SQL 和脚本 ( PHP):

SELECT title, site_id, location, id, count( * ) 
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1

After running this query, I can remove duplicates using a server side script.

运行此查询后,我可以使用服务器端脚本删除重复项。

But, I want to know if this can be done only using SQL query.

但是,我想知道这是否只能使用 SQL 查询来完成。

回答by Chris Henry

A really easy way to do this is to add a UNIQUEindex on the 3 columns. When you write the ALTERstatement, include the IGNOREkeyword. Like so:

一个非常简单的方法是UNIQUE在 3 列上添加一个索引。编写ALTER语句时,请包含IGNORE关键字。像这样:

ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);

This will drop all the duplicate rows. As an added benefit, future INSERTsthat are duplicates will error out. As always, you may want to take a backup before running something like this...

这将删除所有重复的行。作为一个额外的好处,INSERTs重复的未来会出错。与往常一样,您可能需要在运行这样的东西之前进行备份......

回答by rehriff

If you don't want to alter the column properties, then you can use the query below.

如果您不想更改列属性,则可以使用下面的查询。

Since you have a column which has unique IDs (e.g., auto_incrementcolumns), you can use it to remove the duplicates:

由于您有一列具有唯一 ID(例如,auto_increment列),您可以使用它来删除重复项:

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND (`a`.`title` = `b`.`title` OR `a`.`title` IS NULL AND `b`.`title` IS NULL)
    AND (`a`.`company` = `b`.`company` OR `a`.`company` IS NULL AND `b`.`company` IS NULL)
    AND (`a`.`site_id` = `b`.`site_id` OR `a`.`site_id` IS NULL AND `b`.`site_id` IS NULL);

In MySQL, you can simplify it even more with the NULL-safe equal operator(aka "spaceship operator"):

在 MySQL 中,您可以使用NULL 安全等号运算符(又名“宇宙飞船运算符”)进一步简化它:

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND `a`.`title` <=> `b`.`title`
    AND `a`.`company` <=> `b`.`company`
    AND `a`.`site_id` <=> `b`.`site_id`;

回答by Andomar

MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:

MySQL 对引用您要从中删除的表有限制。您可以使用临时表解决该问题,例如:

create temporary table tmpTable (id int);

insert  into tmpTable
        (id)
select  id
from    YourTable yt
where   exists
        (
        select  *
        from    YourTabe yt2
        where   yt2.title = yt.title
                and yt2.company = yt.company
                and yt2.site_id = yt.site_id
                and yt2.id > yt.id
        );

delete  
from    YourTable
where   ID in (select id from tmpTable);

From Kostanos' suggestion in the comments:
The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:

来自 Kostanos 在评论中的建议:
上面唯一缓慢的查询是 DELETE,适用于您拥有非常大数据库的情况。此查询可能会更快:

DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id

回答by Kamil

If the IGNOREstatement won't work like in my case, you can use the below statement:

如果该IGNORE语句在我的情况下不起作用,您可以使用以下语句:

CREATE TABLE your_table_deduped LIKE your_table;


INSERT your_table_deduped
SELECT *
FROM your_table
GROUP BY index1_id,
         index2_id;

RENAME TABLE your_table TO your_table_with_dupes;

RENAME TABLE your_table_deduped TO your_table;

#OPTIONAL
ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`);

#OPTIONAL
DROP TABLE your_table_with_dupes;

回答by César Revert-Gomar

Deleting duplicates on MySQL tables is a common issue, that's genarally the result of a missing constraint to avoid those duplicates before hand. But this common issue usually comes with specific needs... that do require specific approaches. The approach should be different depending on, for example, the size of the data, the duplicated entry that should be kept (generally the first or the last one), whether there are indexes to be kept, or whether we want to perform any additional action on the duplicated data.

删除 MySQL 表上的重复项是一个常见问题,这通常是缺少约束的结果,以避免事先避免这些重复项。但是这个常见的问题通常伴随着特定的需求......确实需要特定的方法。方法应该有所不同,例如,数据的大小、应该保留的重复条目(通常是第一个或最后一个)、是否有要保留的索引,或者我们是否要执行任何额外的操作对重复数据的操作。

There are also some specificities on MySQL itself, such as not being able to reference the same table on a FROM cause when performing a table UPDATE (it'll raise MySQL error #1093). This limitation can be overcome by using an inner query with a temporary table (as suggested on some approaches above). But this inner query won't perform specially well when dealing with big data sources.

MySQL 本身也有一些特殊性,例如在执行表更新时无法在 FROM 原因上引用同一个表(它会引发 MySQL 错误 #1093)。可以通过使用带有临时表的内部查询来克服此限制(如上述某些方法所建议的)。但是这个内部查询在处理大数据源时不会表现得特别好。

However, a better approach does exist to remove duplicates, that's both efficient and reliable, and that can be easily adapted to different needs.

但是,确实存在一种更好的方法来删除重复项,它既高效又可靠,并且可以轻松适应不同的需求。

The general idea is to create a new temporary table, usually adding a unique constraint to avoid further duplicates, and to INSERT the data from your former table into the new one, while taking care of the duplicates. This approach relies on simple MySQL INSERT queries, creates a new constraint to avoid further duplicates, and skips the need of using an inner query to search for duplicates and a temporary table that should be kept in memory (thus fitting big data sources too).

一般的想法是创建一个新的临时表,通常添加一个唯一约束以避免进一步重复,并将之前表中的数据插入到新表中,同时处理重复项。这种方法依赖于简单的 MySQL INSERT 查询,创建一个新的约束来避免进一步的重复,并跳过使用内部查询来搜索重复项和应该保存在内存中的临时表的需要(因此也适合大数据源)。

This is how it can be achieved. Given we have a table employee, with the following columns:

这就是它可以实现的方式。鉴于我们有一个表employee,包含以下列:

employee (id, first_name, last_name, start_date, ssn)

In order to delete the rows with a duplicate ssncolumn, and keeping only the first entry found, the following process can be followed:

为了删除具有重复ssn列的行,并仅保留找到的第一个条目,可以遵循以下过程:

-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;

-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);

-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;

-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

Technical explanation

技术说明

  • Line #1 creates a new tmp_eployeetable with exactly the same structure as the employeetable
  • Line #2 adds a UNIQUE constraint to the new tmp_eployeetable to avoid any further duplicates
  • Line #3 scans over the original employeetable by id, inserting new employee entries into the new tmp_eployeetable, while ignoring duplicated entries
  • Line #4 renames tables, so that the new employeetable holds all the entries without the duplicates, and a backup copy of the former data is kept on the backup_employeetable
  • 第 1 行创建一个新的tmp_eployee表,其结构与员工表完全相同
  • 第 2行将UNIQUE 约束添加到新的tmp_eployee表以避免任何进一步的重复
  • 第 3 行按 id扫描原始员工表,将新员工条目插入新的tmp_eployee表,同时忽略重复的条目
  • 第 4 行重命名表,以便新的员工表保存所有条目而没有重复项,并且在backup_employee表中保留了以前数据的备份副本

? Using this approach, 1.6M registers were converted into 6k in less than 200s.

? 使用这种方法,1.6M 寄存器在不到 200 秒内转换为 6k。

Chetan, following this process, you could fast and easily remove all your duplicates and create a UNIQUE constraint by running:

Chetan,按照此过程,您可以快速轻松地删除所有重复项并通过运行创建一个 UNIQUE 约束:

CREATE TABLE tmp_jobs LIKE jobs;

ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);

INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;

RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;

Of course, this process can be further modified to adapt it for different needs when deleting duplicates. Some examples follow.

当然,在删除重复项时,可以进一步修改此过程以适应不同的需求。下面是一些例子。

? Variation for keeping the last entry instead of the first one

? 保留最后一个条目而不是第一个条目的变化

Sometimes we need to keep the last duplicated entry instead of the first one.

有时我们需要保留最后一个重复的条目而不是第一个。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • On line #3, the ORDER BY id DESCclause makes the last ID's to get priority over the rest
  • 在第 3 行,ORDER BY id DESC子句使最后一个 ID 优先于其余 ID

? Variation for performing some tasks on the duplicates, for example keeping a count on the duplicates found

? 对重复项执行某些任务的变化,例如对找到的重复项进行计数

Sometimes we need to perform some further processing on the duplicated entries that are found (such as keeping a count of the duplicates).

有时我们需要对找到的重复条目执行一些进一步的处理(例如保持重复的计数)。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • On line #3, a new column n_duplicatesis created
  • On line #4, the INSERT INTO ... ON DUPLICATE KEY UPDATEquery is used to perform an additional update when a duplicate is found (in this case, increasing a counter) The INSERT INTO ... ON DUPLICATE KEY UPDATEquery can be used to perform different types of updates for the duplicates found.
  • 在第 3 行,创建了一个新列n_duplicates
  • 在第 4 行,INSERT INTO ... ON DUPLICATE KEY UPDATE查询用于在发现重复项时执行额外的更新(在这种情况下,增加计数器)INSERT INTO ... ON DUPLICATE KEY UPDATE查询可以是用于对找到的重复项执行不同类型的更新。

? Variation for regenerating the auto-incremental field id

? 用于重新生成自动增量字段 ID 的变体

Sometimes we use an auto-incremental field and, in order the keep the index as compact as possible, we can take advantage of the deletion of the duplicates to regenerate the auto-incremental field in the new temporary table.

有时我们使用自增字段,为了保持索引尽可能紧凑,我们可以利用删除重复项在新临时表中重新生成自增字段。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • On line #3, instead of selecting all the fields on the table, the id field is skipped so that the DB engine generates a new one automatically
  • 在第 3 行,不是选择表中的所有字段,而是跳过 id 字段,以便数据库引擎自动生成一个新字段

? Further variations

? 进一步的变化

Many further modifications are also doable depending on the desired behavior. As an example, the following queries will use a second temporary table to, besides 1) keep the last entry instead of the first one; and 2) increase a counter on the duplicates found; also 3) regenerate the auto-incremental field id while keeping the entry order as it was on the former data.

根据所需的行为,许多进一步的修改也是可行的。例如,以下查询将使用第二个临时表,除了 1) 保留最后一个条目而不是第一个条目;和 2) 增加对发现的重复项的计数器;还 3) 重新生成自动增量字段 ID,同时保持输入顺序与以前的数据相同。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

CREATE TABLE tmp_employee2 LIKE tmp_employee;

INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;

DROP TABLE tmp_employee;

RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;

回答by Mostafa -T

There is another solution :

还有另一种解决方案:

DELETE t1 FROM my_table t1, my_table t2 WHERE t1.id < t2.id AND t1.my_field = t2.my_field AND t1.my_field_2 = t2.my_field_2 AND ...

回答by faisalbhagat

if you have a large table with huge number of records then above solutions will not work or take too much time. Then we have a different solution

如果您有一个包含大量记录的大表,那么上述解决方案将不起作用或花费太多时间。然后我们有不同的解决方案

-- Create temporary table

CREATE TABLE temp_table LIKE table1;

-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);

-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;

-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;

回答by Nav

The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.

更快的方法是将不同的行插入到临时表中。使用删除,我花了几个小时从一个 800 万行的表中删除重复项。使用 insert 和 distinct 只需要 13 分钟。

CREATE TABLE tempTableName LIKE tableName;  
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);  
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;  
TRUNCATE TABLE tableName;
INSERT INTO tableName SELECT * FROM tempTableName; 
DROP TABLE tempTableName;  

回答by Eduardo Rascon

I have this query snipet for SQLServer but I think It can be used in others DBMS with little changes:

我有 SQLServer 的这个查询 snipet,但我认为它可以在其他 DBMS 中使用,只需稍作改动:

DELETE
FROM Table
WHERE Table.idTable IN  (  
    SELECT MAX(idTable)
    FROM idTable
    GROUP BY field1, field2, field3
    HAVING COUNT(*) > 1)

I forgot to tell you that this query doesn't remove the row with the lowest id of the duplicated rows. If this works for you try this query:

我忘了告诉你这个查询不会删除重复行中具有最低 id 的行。如果这对您有用,请尝试以下查询:

DELETE
FROM jobs
WHERE jobs.id IN  (  
    SELECT MAX(id)
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING COUNT(*) > 1)

回答by Rico Nguyen

I found a simple way. (keep latest)

我找到了一个简单的方法。(保持最新)

DELETE t1 FROM tablename t1 INNER JOIN tablename t2 
WHERE t1.id < t2.id AND t1.column1 = t2.column1 AND t1.column2 = t2.column2;