MySQL MySQL快速从大数据库中删除重复项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1651999/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 14:24:01  来源:igfitidea点击:

MySQL remove duplicates from big database quick

sqlmysqlduplicates

提问by bizzz

I've got big (>Mil rows) MySQL database messed up by duplicates. I think it could be from 1/4 to 1/2 of the whole db filled with them. I need to get rid of them quick (i mean query execution time). Here's how it looks:
id (index) | text1 | text2 | text3
text1 & text2 combination should be unique, if there are any duplicates, only one combination with text3 NOT NULL should remain. Example:

我有很大的(> 百万行)MySQL 数据库被重复项搞砸了。我认为它可能占整个数据库的 1/4 到 1/2。我需要快速摆脱它们(我的意思是查询执行时间)。它的外观如下:
id (index) | 文本1 | 文本2 | text3
text1 & text2 组合应该是唯一的,如果有任何重复,则应该只保留一个 text3 NOT NULL 组合。例子:

1 | abc | def | NULL  
2 | abc | def | ghi  
3 | abc | def | jkl  
4 | aaa | bbb | NULL  
5 | aaa | bbb | NULL  

...becomes:

...变成:

1 | abc | def | ghi   #(doesn't realy matter id:2 or id:3 survives)   
2 | aaa | bbb | NULL  #(if there's no NOT NULL text3, NULL will do)

New ids cold be anything, they do not depend on old table ids.
I've tried things like:

新 ID 可以是任何东西,它们不依赖于旧表 ID。
我试过这样的事情:

CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;

Or SELECT DISTINCT and other variations.
While they work on small databases, query execution time on mine is just huge (never got to the end, actually; > 20 min)

或 SELECT DISTINCT 和其他变体。
当他们在小型数据库上工作时,我的查询执行时间非常长(实际上从未结束;> 20 分钟)

Is there any faster way to do that? Please help me solve this problem.

有没有更快的方法来做到这一点?请帮我解决这个问题。

回答by ???u

I believe this will do it, using on duplicate key + ifnull():

我相信这会做到,使用重复键 + ifnull():

create table tmp like yourtable;

alter table tmp add unique (text1, text2);

insert into tmp select * from yourtable 
    on duplicate key update text3=ifnull(text3, values(text3));

rename table yourtable to deleteme, tmp to yourtable;

drop table deleteme;

Should be much faster than anything that requires group by or distinct or a subquery, or even order by. This doesn't even require a filesort, which is going to kill performance on a large temporary table. Will still require a full scan over the original table, but there's no avoiding that.

应该比任何需要 group by 或 distinct 或子查询,甚至 order by 的东西快得多。这甚至不需要文件排序,这会降低大型临时表的性能。仍然需要对原始表进行全面扫描,但无法避免。

回答by liorq

Found this simple 1-line code to do exactly what I needed:

找到这个简单的 1 行代码来完成我所需要的:

ALTER IGNORE TABLE dupTest ADD UNIQUE INDEX(a,b);

Taken from: http://mediakey.dk/~cc/mysql-remove-duplicate-entries/

取自:http: //mediakey.dk/~cc/mysql-remove-duplicate-entries/

回答by Kevin Peno

DELETE FROM dups
WHERE id NOT IN(
    SELECT id FROM (
        SELECT DISTINCT id, text1, text2
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC
    ) as tmp
)

This queries all records, groups by the distinction fields and orders by ID (means we pick the first not null text3 record). Then we select the id's from that result (these are good ids...they wont be deleted) and delete all IDs that AREN'T those.

这将查询所有记录,按区分字段分组并按 ID 排序(意味着我们选择第一个非空的 text3 记录)。然后我们从该结果中选择 id(这些是好的 id...它们不会被删除)并删除所有不是这些的 ID。

Any query like this affecting the entire table will be slow. You just need to run it and let it roll out so you can prevent it in the future.

任何像这样影响整个表的查询都会很慢。你只需要运行它并让它推出,这样你就可以在将来防止它。

After you have done this "fix" I would apply UNIQUE INDEX (text1, text2) to that table. To prevent the posibility of duplicates in the future.

完成此“修复”后,我会将 UNIQUE INDEX (text1, text2) 应用于该表。以防止将来重复的可能性。

If you want to go the "create a new table and replace the old one" route. You could use the very inner select statement to create your insert statement.

如果你想走“创建一个新表并替换旧表”的路线。您可以使用非常内部的 select 语句来创建插入语句。

MySQL specific (assumes new table is named my_tbl2 and has exactly the same structure):

MySQL 特定(假设新表名为 my_tbl2 并且具有完全相同的结构):

INSERT INTO my_tbl2
    SELECT DISTINCT id, text1, text2, text3
            FROM dups
        GROUP BY text1, text2
        ORDER BY text3 DESC

See MySQL INSERT ... SELECTfor more information.

有关更多信息,请参阅MySQL INSERT ... SELECT

回答by Gadelkareem

remove duplicates without removing foreign keys

删除重复项而不删除外键

create table tmp like mytable;
ALTER TABLE tmp ADD UNIQUE INDEX(text1, text2, text3, text4, text5, text6);
insert IGNORE into tmp select * from mytable;
delete from mytable where id not in ( select id from tmp);

回答by Scott Saunders

If you can create a new table, do so with a unique key on the text1 + text2 fields. Then insert into the table ignoring errors (using the INSERT IGNORE syntax):

如果您可以创建新表,请在 text1 + text2 字段上使用唯一键。然后插入表忽略错误(使用 INSERT IGNORE 语法):

select * from my_tbl order by text3 desc
  • I think the order by text3 desc will put the NULLs last, but double check that.
  • 我认为 text3 desc 的顺序会将 NULL 放在最后,但请仔细检查。

Indexes on all those columns could help a lot, but creating them now could be pretty slow.

所有这些列上的索引可能会有很大帮助,但现在创建它们可能会很慢。

回答by user1931858

For large tables with few duplicates, you may want to avoid copying the whole table to another place. One way is to create a temporary table holding the rows you want to keep (for each key with duplicates), and then delete duplicates from the original table.

对于几乎没有重复项的大表,您可能希望避免将整个表复制到另一个地方。一种方法是创建一个临时表来保存要保留的行(对于每个具有重复项的键),然后从原始表中删除重复项。

An example is given here.

这里给出一个例子。

回答by JDuarteDJ

I know this is an Old thread but I have a somewhat messymethod that is much faster and customizable, in terms of speed I'd say 10sec instead of 100sec (10:1).

我知道这是一个旧线程,但我有一个有点凌乱的方法,它更快且可定制,就速度而言,我会说 10 秒而不是 100 秒(10:1)。

My method does required all that messystuff you were trying to avoid:

我的方法确实需要你试图避免的所有杂乱的东西:

  • Group by (and Having)
  • group concat with ORDER BY
  • 2 temporary tables
  • using files on disk!
  • somehow (php?) deleting the file after
  • 分组依据(和拥有)
  • 使用 ORDER BY 分组连接
  • 2个临时表
  • 使用磁盘上的文件!
  • 以某种方式(php?)删除文件后

But when you are talking about MILLIONS (or in my case Tens of Millions) it's worth it.

但是当你谈论数百万(或者在我的情况下是数千万)时,这是值得的。

anyway its not much because comment are in portuguese but here is my sample:

无论如何它并不多,因为评论是葡萄牙语,但这是我的样本:

EDIT: if I get comments I'll explain further how it works :)

编辑:如果我收到评论,我会进一步解释它是如何工作的:)

START TRANSACTION;

DROP temporary table if exists to_delete;

CREATE temporary table to_delete as (
    SELECT
        -- escolhe todos os IDs duplicados menos os que ficam na BD
        -- A ordem de escolha dos IDs é dada por "ORDER BY campo_ordenacao DESC" em que o primeiro é o que fica
        right(
            group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ','),
            length(group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ',')) 
                - locate(",",group_concat(id ORDER BY campos_ordenacao DESC SEPARATOR ','))
        ) as ids,

        count(*) as c

    -- Tabela a eliminar duplicados
    FROM teste_dup

    -- campos a usar para identificar  duplicados
    group by test_campo1, test_campo2, teste_campoN
    having count(*) > 1 -- é duplicado
);

-- aumenta o limite desta variável de sistema para o máx 
SET SESSION group_concat_max_len=4294967295;

-- envia os ids todos a eliminar para um ficheiro
select group_concat(ids SEPARATOR ',') from to_delete INTO OUTFILE 'sql.dat';

DROP temporary table if exists del3;
create temporary table del3 as (select CAST(1 as signed) as ix LIMIT 0);

-- insere os ids a eliminar numa tabela temporaria a partir do ficheiro
load data infile 'sql.dat' INTO TABLE del3
LINES TERMINATED BY ',';

alter table del3 add index(ix);

-- elimina os ids seleccionados
DELETE teste_dup -- tabela 
from teste_dup -- tabela

join del3 on id=ix;

COMMIT;

回答by kamran Sheikh

you can remove all the duplicate entries by using this simple query. that will select all the duplicate records and remove them.

您可以使用这个简单的查询删除所有重复的条目。这将选择所有重复的记录并删除它们。

 DELETE i1 
FROM TABLE i1
LEFT JOIN TABLE i2
  ON i1.id = i2.id
 AND i1.colo = i2.customer_invoice_id
 AND i1.id < i2.id
WHERE i2.customer_invoice_id IS NOT NULL

回答by redcayuga

I don't have much experience with MySQL. If it has analytic functions try:

我对 MySQL 没有太多经验。如果它具有解析函数,请尝试:

delete from my_tbl
 where id in (
     select id 
       from (select id, row_number()
                            over (partition by text1, text2 order by text3 desc) as rn
               from my_tbl
               /* optional: where text1 like 'a%'  */
             ) as t2
       where rn > 1
     )

the optional where clause makes the means you'll have to run it multiple times, one for each letter, etc. Create an index on text1?

可选的 where 子句意味着您必须多次运行它,每个字母运行一次,等等。在 text1 上创建索引?

Before running this, confirm that "text desc" will sort nulls last in MySQL.

在运行之前,确认“text desc”将在 MySQL 中最后排序空值。