在 MySQL 中删除数百万行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1318972/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 13:57:19  来源:igfitidea点击:

Deleting millions of rows in MySQL

mysqlquery-performancemaintenancesql-delete

提问by Steven Surowiec

I recently found and fixed a bug in a site I was working on that resulted in millions of duplicate rows of data in a table that will be quite large even without them (still in the millions). I can easily find these duplicate rows and can run a single delete query to kill them all. The problem is that trying to delete this many rows in one shot locks up the table for a long time, which I would like to avoid if possible. The only ways I can see to get rid of these rows, without taking down the site (by locking up the table) are:

我最近发现并修复了我正在处理的站点中的一个错误,该错误导致表中的数百万重复数据行即使没有它们也会非常大(仍为数百万)。我可以很容易地找到这些重复的行,并且可以运行一个删除查询来杀死它们。问题是尝试一次删除这么多行会长时间锁定表,如果可能的话,我想避免这种情况。在不关闭站点(通过锁定表)的情况下,我可以看到摆脱这些行的唯一方法是:

  1. Write a script that will execute thousands of smaller delete queries in a loop. This will theoretically get around the locked table issue because other queries will be able to make it into the queue and run in between the deletes. But it will still spike the load on the database quite a bit and will take a long time to run.
  2. Rename the table and recreate the existing table (it'll now be empty). Then do my cleanup on the renamed table. Rename the new table, name the old one back and merge the new rows into the renamed table. This is way takes considerably more steps, but should get the job done with minimal interruption. The only tricky part here is that the table in question is a reporting table, so once it's renamed out of the way and the empty one put in its place all historic reports go away until I put it back in place. Plus the merging process could be a bit of a pain because of the type of data being stored. Overall this is my likely choice right now.
  1. 编写一个脚本,该脚本将在循环中执行数千个较小的删除查询。这理论上将解决锁定表问题,因为其他查询将能够将其放入队列并在删除之间运行。但是它仍然会大幅增加数据库的负载,并且需要很长时间才能运行。
  2. 重命名表并重新创建现有表(它现在将为空)。然后在重命名的表上进行清理。重命名新表,重新命名旧表并将新行合并到重命名的表中。这种方式需要更多的步骤,但应该以最少的中断完成工作。这里唯一棘手的部分是有问题的表是一个报告表,所以一旦它被重命名并且空的放在它的位置上,所有历史报告都会消失,直到我把它放回原处。另外,由于存储的数据类型,合并过程可能有点麻烦。总的来说,这是我现在可能的选择。

I was just wondering if anyone else has had this problem before and, if so, how you dealt with it without taking down the site and, hopefully, with minimal if any interruption to the users? If I go with number 2, or a different, similar, approach, I can schedule the stuff to run late at night and do the merge early the next morning and just let the users know ahead of time, so that's not a huge deal. I'm just looking to see if anyone has any ideas for a better, or easier, way to do the cleanup.

我只是想知道之前是否有其他人遇到过这个问题,如果有,你是如何在不关闭网站的情况下处理它的,希望对用户的干扰最小?如果我采用第 2 种方法或不同的类似方法,我可以安排这些东西在深夜运行,并在第二天早上进行合并,然后提前让用户知道,所以这没什么大不了的。我只是想看看是否有人有更好或更简单的清理方法的想法。

回答by chaos

DELETE FROM `table`
WHERE (whatever criteria)
ORDER BY `id`
LIMIT 1000

Wash, rinse, repeat until zero rows affected. Maybe in a script that sleeps for a second or three between iterations.

清洗、冲洗、重复直到受影响的行数为零。也许在一个在迭代之间休眠一三秒的脚本中。

回答by user1459144

I had a use case of deleting 1M+ rows in the 25M+ rows Table in the MySQL. Tried different approaches like batch deletes (described above).
I've found out that the fastest way (copy of required records to new table):

我有一个用例,在 MySQL 的 25M+ 行表中删除 1M+ 行。尝试了不同的方法,如批量删除(如上所述)。
我发现最快的方法(将所需记录复制到新表):

  1. Create Temporary Table that holds just ids.
  1. 创建仅包含 ID 的临时表。

CREATE TABLE id_temp_table ( temp_id int);

创建表 id_temp_table (temp_id int);

  1. Insert ids that should be removed:
  1. 插入应删除的 id:

insert into id_temp_table (temp_id) select.....

插入 id_temp_table (temp_id) 选择.....

  1. Create New table table_new

  2. Insert all records from table to table_new without unnecessary rows that are in id_temp_table

  1. 创建新表 table_new

  2. 将表中的所有记录插入到 table_new 中,而无需 id_temp_table 中的不必要的行

insert into table_new .... where table_id NOT IN (select distinct(temp_id) from id_temp_table);

插入 table_new .... where table_id NOT IN (select distinct(temp_id) from id_temp_table);

  1. Rename tables
  1. 重命名表

The whole process took ~1hr.In my use case simple delete of batch on 100 records took 10 mins.

整个过程耗时约 1 小时。在我的用例中,简单地删除 100 条记录的批次需要 10 分钟。

回答by duffymo

I'd also recommend adding some constraints to your table to make sure that this doesn't happen to you again. A million rows, at 1000 per shot, will take 1000 repetitions of a script to complete. If the script runs once every 3.6 seconds you'll be done in an hour. No worries. Your clients are unlikely to notice.

我还建议在您的表中添加一些约束,以确保这种情况不会再次发生在您身上。一百万行,每次拍摄 1000 行,需要重复 1000 次脚本才能完成。如果脚本每 3.6 秒运行一次,您将在一小时内完成。不用担心。您的客户不太可能注意到。

回答by rich

the following deletes 1,000,000 records, one at a time.

以下删除 1,000,000 条记录,一次删除一条。

 for i in `seq 1 1000`; do 
     mysql  -e "select id from table_name where (condition) order by id desc limit 1000 " | sed 's;/|;;g' | awk '{if(NR>1)print "delete from table_name where id = ",,";" }' | mysql; 
 done

you could group them together and do delete table_name where IN (id1,id2,..idN) im sure too w/o much difficulty

你可以将它们组合在一起并删除 table_name 其中 IN (id1,id2,..idN) 我肯定也没有太多困难

回答by casey

I'd use mk-archiverfrom the excellent Maatkitutilities package (a bunch of Perl scripts for MySQL management) Maatkit is from Baron Schwartz, the author of the O'Reilly "High Performance MySQL" book.

我会使用优秀的Maatkit实用程序包(一组用于 MySQL 管理的 Perl 脚本)中的mk-archiverMaatkit 来自 Baron Schwartz,他是 O'Reilly“高性能 MySQL”一书的作者。

The goal is a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much. You can insert the data into another table, which need not be on the same server. You can also write it to a file in a format suitable for LOAD DATA INFILE. Or you can do neither, in which case it's just an incremental DELETE.

目标是一项低影响、只进的工作,从表中蚕食旧数据,而不会对 OLTP 查询产生太大影响。您可以将数据插入到另一个表中,该表不需要在同一台服务器上。您还可以将其写入适合 LOAD DATA INFILE 的格式的文件中。或者两者都不做,在这种情况下它只是一个增量删除。

It's already built for archiving your unwanted rows in small batches and as a bonus, it can save the deleted rows to a file in case you screw up the query that selects the rows to remove.

它已经为小批量归档不需要的行而构建,作为奖励,它可以将删除的行保存到文件中,以防您搞砸了选择要删除的行的查询。

No installation required, just grab http://www.maatkit.org/get/mk-archiverand run perldoc on it (or read the web site) for documentation.

无需安装,只需获取http://www.maatkit.org/get/mk-archiver并在其上运行 perldoc(或阅读网站)以获取文档。

回答by Mukul Bansal

I faced a similar problem. We had a really big table, about 500 GB in size with no partitioning and one only one index on the primary_key column. Our master was a hulk of a machine, 128 cores and 512 Gigs of RAM and we had multiple slaves too. We tried a few techniques to tackle the large-scale deletion of rows. I will list them all here from worst to best that we found-

我遇到了类似的问题。我们有一个非常大的表,大约 500 GB,没有分区,primary_key 列上只有一个索引。我们的主机是一台机器,有 128 个内核和 512 GB 的 RAM,我们也有多个从机。我们尝试了一些技术来解决大规模删除行的问题。我将在此处列出我们发现的从最坏到最好的所有内容-

  1. Fetching and Deleting one row at a time. This is the absolute worst that you could do. So, we did not even try this.
  2. Fetching first 'X' rows from the database using a limit query on the primary_key column, then checking the row ids to delete in the application and firing a single delete query with a list of primary_key ids. So, 2 queries per 'X' rows. Now, this approach was fine but doing this using a batch job deleted about 5 million rows in 10 minutes or so, due to which the slaves of our MySQL DB were lagged by 105 seconds. 105-second lag in 10-minute activity. So, we had to stop.
  3. In this technique, we introduced a 50 ms lag between our subsequent batch fetch and deletions of size 'X' each. This solved the lag problem but we were now deleting 1.2-1.3 million rows per 10 minutes as compared to 5 million in technique #2.
  4. Partitioning the database table and then deleting the entire partitions when not needed. This is the best solution we have but it requires a pre-partitioned table. We followed step 3 because we had a non-partitioned very old table with only indexing on the primary_key column. Creating a partition would have taken too much time and we were in a crisis mode. Here are some links related to partitioning that I found helpful- Official MySQL Reference, Oracle DB daily partitioning.
  1. 一次获取和删除一行。这绝对是你能做的最糟糕的事情。所以,我们甚至没有尝试这个。
  2. 使用对 primary_key 列的限制查询从数据库中获取第一个 'X' 行,然后检查要在应用程序中删除的行 ID,并使用 primary_key id 列表触发单个删除查询。因此,每 'X' 行 2 个查询。现在,这种方法很好,但是使用批处理作业在 10 分钟左右删除了大约 500 万行,因此我们 MySQL 数据库的从属数据库滞后了 105 秒。10 分钟活动延迟 105 秒。所以,我们不得不停下来。
  3. 在这种技术中,我们在后续批量获取和删除大小为“X”的每个之间引入了 50 毫秒的延迟。这解决了滞后问题,但我们现在每 10 分钟删除 1.2-130 万行,而技术#2 是 500 万行。
  4. 对数据库表进行分区,然后在不需要时删除整个分区。这是我们拥有的最佳解决方案,但它需要一个预先分区的表。我们遵循第 3 步,因为我们有一个未分区的非常旧的表,仅在 primary_key 列上建立索引。创建分区会花费太多时间,我们处于危机模式。以下是我发现有用的一些与分区相关的链接 -官方 MySQL 参考Oracle DB 每日分区

So, IMO, if you can afford to have the luxury of creating a partition in your table, go for the option #4, otherwise, you are stuck with option #3.

所以,IMO,如果你有能力在你的表中创建一个分区,请选择选项 #4,否则,你会被选项 #3 卡住。

回答by Tho

Here's the recommended practice:

以下是推荐的做法:

rows_affected = 0
do {
 rows_affected = do_query(
   "DELETE FROM messages WHERE created < DATE_SUB(NOW(),INTERVAL 3 MONTH)
   LIMIT 10000"
 )
} while rows_affected > 0

Deleting 10,000 rows at a time is typically a large enough task to make each query efficient, and a short enough task to minimize the impact on the server4 (transactional storage engines might benefit from smaller transactions). It might also be a good idea to add some sleep time between the DELETE statements to spread the load over time and reduce the amount of time locks are held.

一次删除 10,000 行通常是一个足够大的任务,可以使每个查询高效,并且足够短,可以最大限度地减少对服务器的影响4(事务存储引擎可能受益于较小的事务)。在 DELETE 语句之间添加一些睡眠时间以随着时间的推移分散负载并减少持有锁的时间量也可能是一个好主意。

Reference MySQL High Performance

参考MySQL 高性能

回答by cherouvim

Do it in batches of lets say 2000 rows at a time. Commit in-between. A million rows isn't that much and this will be fast, unless you have many indexes on the table.

分批进行,一次可以说 2000 行。在两者之间提交。一百万行并不是那么多,这会很快,除非表上有很多索引。

回答by Birchlabs

For us, the DELETE WHERE %s ORDER BY %s LIMIT %danswer was not an option, because the WHERE criteria was slow (a non-indexed column), and would hit master.

对我们来说,DELETE WHERE %s ORDER BY %s LIMIT %d答案不是一个选项,因为 WHERE 条件很慢(非索引列),并且会命中 master。

SELECT from a read-replica a list of primary keys that you wish to delete. Export with this kind of format:

从只读副本中选择要删除的主键列表。以这种格式导出:

00669163-4514-4B50-B6E9-50BA232CA5EB
00679DE5-7659-4CD4-A919-6426A2831F35

Use the following bash script to grab this input and chunk it into DELETE statements [requires bash ≥ 4 because of mapfilebuilt-in]:

使用以下 bash 脚本获取此输入并将其分块到 DELETE 语句中[需要 bash ≥ 4 因为mapfile内置]:

sql-chunker.sh(remember to chmod +xme, and change the shebang to point to your bash 4 executable):

sql-chunker.sh(记住chmod +x我,并将shebang更改为指向您的bash 4可执行文件)

#!/usr/local/Cellar/bash/4.4.12/bin/bash

# Expected input format:
: <<!
00669163-4514-4B50-B6E9-50BA232CA5EB
00669DE5-7659-4CD4-A919-6426A2831F35
!

if [ -z "" ]
  then
    echo "No chunk size supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi

if [ -z "" ]
  then
    echo "No file supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi

function join_by {
    local d=
    shift
    echo -n ""
    shift
    printf "%s" "${@/#/$d}"
}

while mapfile -t -n "" ary && ((${#ary[@]})); do
    printf "DELETE FROM my_cool_table WHERE id IN ('%s');\n" `join_by "','" "${ary[@]}"`
done < ""

Invoke like so:

像这样调用:

./sql-chunker.sh 1000 ids.txt > batch_1000.sql

This will give you a file with output formatted like so (I've used a batch size of 2):

这将为您提供一个输出格式如下的文件(我使用的批处理大小为 2):

DELETE FROM my_cool_table WHERE id IN ('006CC671-655A-432E-9164-D3C64191EDCE','006CD163-794A-4C3E-8206-D05D1A5EE01E');
DELETE FROM my_cool_table WHERE id IN ('006CD837-F1AD-4CCA-82A4-74356580CEBC','006CDA35-F132-4F2C-8054-0F1D6709388A');

Then execute the statements like so:

然后像这样执行语句:

mysql --login-path=master billing < batch_1000.sql

For those unfamiliar with login-path, it's just a shortcut to login without typing password in the command line.

对于不熟悉的人来说login-path,它只是一种无需在命令行中输入密码即可登录的快捷方式。

回答by npocmaka

I had a really loaded base that needed to delete some older entries all the time. Some of the delete queries started to hang so I needed to kill them, and if there are too many deletes the whole base become unresponsive so I needed to restrict the parallel runs. So I've created a cron jobrunning every minute starting this script:

我有一个真正加载的基础,需要一直删除一些旧条目。一些删除查询开始挂起,所以我需要杀死它们,如果删除太多,整个库就会变得无响应,所以我需要限制并行运行。所以我创建了一个cron job每分钟运行一次这个脚本:

#!/bin/bash

#######################
#
i_size=1000
max_delete_queries=10
sleep_interval=15
min_operations=8
max_query_time=1000

USER="user"
PASS="super_secret_password"

log_max_size=1000000
log_file="/var/tmp/clean_up.log"
#
#######################

touch $log_file
log_file_size=`stat -c%s "$log_file"`
if (( $log_file_size > $log_max_size ))
then
    rm -f "$log_file"
fi 

delete_queries=`mysql -u user -p$PASS -e  "SELECT * FROM information_schema.processlist WHERE Command = 'Query' AND INFO LIKE 'DELETE FROM big.table WHERE result_timestamp %';"| grep Query|wc -l`

## -- here the hanging DELETE queries will be stopped
mysql-u $USER -p$PASS -e "SELECT ID FROM information_schema.processlist WHERE Command = 'Query' AND INFO LIKE 'DELETE FROM big.table WHERE result_timestamp %'and TIME>$max_query_time;" |grep -v ID| while read -r id ; do
    echo "delete query stopped on `date`" >>  $log_file
    mysql -u $USER -p$PASS -e "KILL $id;"
done

if (( $delete_queries > $max_delete_queries ))
then
  sleep $sleep_interval

  delete_queries=`mysql-u $USER -p$PASS -e  "SELECT * FROM information_schema.processlist WHERE Command = 'Query' AND INFO LIKE 'DELETE FROM big.table WHERE result_timestamp %';"| grep Query|wc -l`

  if (( $delete_queries > $max_delete_queries ))
  then

      sleep $sleep_interval

      delete_queries=`mysql -u $USER -p$PASS -e  "SELECT * FROM information_schema.processlist WHERE Command = 'Query' AND INFO LIKE 'DELETE FROM big.table WHERE result_timestamp %';"| grep Query|wc -l`

      # -- if there are too many delete queries after the second wait
      #  the table will be cleaned up by the next cron job
      if (( $delete_queries > $max_delete_queries ))
        then
            echo "clean-up skipped on `date`" >> $log_file
            exit 1
        fi
  fi

fi

running_operations=`mysql-u $USER -p$PASS -p -e "SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE COMMAND != 'Sleep';"| wc -l`

if (( $running_operations < $min_operations ))
then
    # -- if the database is not too busy this bigger batch can be processed
    batch_size=$(($i_size * 5))
else 
    batch_size=$i_size
fi

echo "starting clean-up on `date`" >>  $log_file

mysql-u $USER -p$PASS -e 'DELETE FROM big.table WHERE result_timestamp < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 31 DAY))*1000 limit '"$batch_size"';'

if [ $? -eq 0 ]; then
    # -- if the sql command exited normally the exit code will be 0
    echo "delete finished successfully on `date`" >>  $log_file
else
    echo "delete failed on `date`" >>  $log_file
fi

With this I've achieved around 2 million deletes per day which was ok for my usecase.

有了这个,我每天实现了大约 200 万次删除,这对我的用例来说是可以的。