使用相关子查询删除 4200 万行表的 SQL?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3428345/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 07:07:48  来源:igfitidea点击:

DELETE SQL with correlated subquery for table with 42 million rows?

sqlsql-serversql-deletecorrelated-subquery

提问by JohnB

I have a table catswith 42,795,120 rows.

我有一个cats包含 42,795,120 行的表。

Apparently this is a lot of rows. So when I do:

显然这是很多行。所以当我这样做时:

/* owner_cats is a many-to-many join table */
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

the query times out :(

查询超时:(

(edit: I need to increase myCommandTimeoutvalue, default is only 30 seconds)

(编辑:我需要增加我的CommandTimeout值,默认只有 30 秒)

I can't use TRUNCATE TABLE catsbecause I don't want to blow away cats from other owners.

我不能用,TRUNCATE TABLE cats因为我不想从其他主人那里吹走猫。

I'm using SQL Server 2005 with "Recovery model" set to "Simple."

我正在使用 SQL Server 2005,并将“恢复模型”设置为“简单”。

So, I thought about doing something like this (executing this SQL from an application btw):

所以,我想过做这样的事情(顺便说一下,从应用程序执行这个 SQL):

DELETE TOP (25) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

DELETE TOP(50) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

My question is: what is the threshold of the number of rows I can DELETEin SQL Server 2005?

我的问题是:DELETESQL Server 2005的行数阈值是多少?

Or, if my approach is not optimal, please suggest a better approach. Thanks.

或者,如果我的方法不是最佳的,请提出更好的方法。谢谢。

This post didn't help me enough:

这篇文章对我的帮助还不够:

EDIT (8/6/2010):

编辑(8/6/2010):

Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even id_catis not a PK, because in my real life schema, it's not a unique field.

好的,我再次阅读上述链接后才意识到这些表上没有索引。此外,你们中的一些人已经在下面的评论中指出了这个问题。请记住,这是一个虚构的模式,所以即使id_cat不是 PK,因为在我的现实生活模式中,它不是一个独特的领域。

I will put indexes on:

我将把索引放在:

  1. cats.id_cat
  2. owner_cats.id_cat
  3. owner_cats.id_owner
  1. cats.id_cat
  2. owner_cats.id_cat
  3. owner_cats.id_owner

I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the JOINfields right?

我想我仍然掌握了这个数据仓库的窍门,显然我需要所有JOIN字段的索引,对吗?

However, it takes hours for me to do this batch load process. I'm already doing it as a SqlBulkCopy(in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy:

但是,我需要花费数小时才能完成此批量加载过程。我已经在做SqlBulkCopy(以块为单位,而不是一次性完成 4200 万个)。我有一些索引和 PK。我阅读了以下帖子,这些帖子证实了我的理论,即即使是批量复制,索引也会减慢速度:

So I'm going to DROPmy indexes before the copy and then re CREATEthem when it's done.

所以我会DROP在复制之前访问我的索引,然后CREATE在完成后重新创建它们。

Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results.

由于加载时间很长,我需要一段时间来测试这些建议。我会回来报告结果。

UPDATE (8/7/2010):

更新 (8/7/2010):

Tom suggested:

汤姆建议:

DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)

And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes!

仍然没有索引,对于 4200 万行,使用上述方式需要 13:21 分:秒与 22:08。然而,对于 1300 万行,他以 2:13 和我以前的 2:10 的方式进行。这是一个好主意,但我仍然需要使用索引!

Update (8/8/2010):

更新 (8/8/2010):

Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min (yes an hour!)versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help.

有什么不对劲的!现在有了索引,我上面的第一个删除查询花费了 1:9 hrs:min (是一个小时!)对比 22:08 min:sec 和 13:21 min:sec 对比 2:10 min:sec 4200 万行和分别为 1300 万行。我现在要尝试使用索引进行 Tom 的查询,但这是朝着错误的方向前进。请帮忙。

Update (8/9/2010):

更新 (8/9/2010):

Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. Deletes are taking longer on my database when I use indexes by an order of magnitude!I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! What am I doing wrong?

Tom 的删除时间为 1:06 hrs:min(4200 万行)和 10:50 min:sec(1300 万行索引),分别为 13:21 min:sec 和 2:13 min:sec。 当我使用索引一个数量级时,删除在我的数据库上花费的时间更长!我想我知道为什么,我的数据库 .mdf 和 .ldf 在第一次(4200 万)删除期间从 3.5 GB 增长到 40.6 GB! 我究竟做错了什么?

Update (8/10/2010):

更新 (8/10/2010):

For lack of any other options, I have come up with what I feel is a lackluster solution (hopefully temporary):

由于缺乏任何其他选择,我想出了一个我觉得乏味的解决方案(希望是暂时的)

  1. Increase timeout for database connection to 1 hour (CommandTimeout=60000;default was 30 sec)
  2. Use Tom's query: DELETE FROM WHERE EXISTS (SELECT 1 ...)because it performed a little faster
  3. DROPall indexes and PKs before running delete statement (???)
  4. Run DELETEstatement
  5. CREATEall indexes and PKs
  1. 将数据库连接超时增加到 1 小时(CommandTimeout=60000;默认为 30 秒)
  2. 使用 Tom 的查询:DELETE FROM WHERE EXISTS (SELECT 1 ...)因为它执行得更快一点
  3. DROP运行删除语句之前的所有索引和 PK (???)
  4. 运行DELETE语句
  5. CREATE所有索引和PK

Seems crazy, but at least it's faster than using TRUNCATEand starting over my load from the beginning with the first owner_id, because one of my owner_idtakes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that owner_id, but I don't want to blow away previous owner_id, so I don't want to TRUNCATEthe owner_catstable, which is why I'm trying to use DELETE.)

看起来很疯狂,但至少它比TRUNCATE从第一个开始使用和重新开始加载要快owner_id,因为我的一个owner_id需要 2:30 hrs:min 加载,而我刚刚描述的删除过程需要 17:22 min:sec 4200 万行。(注:如果我的加载过程中抛出一个异常,我重新开始为owner_id,但我不希望吹走之前的owner_id,所以我不想TRUNCATEowner_cats表中,这就是为什么我试图使用DELETE。)

Anymore help would still be appreciated :)

任何帮助仍将不胜感激:)

回答by Dave Markle

There is no practical threshold. It depends on what your command timeout is set to on your connection.

没有实际的门槛。这取决于您的连接上的命令超时设置。

Keep in mind that the time it takes to delete all of these rows is contingent upon:

请记住,删除所有这些行所需的时间取决于:

  • The time it takes to find the rows of interest
  • The time it takes to log the transaction in the transaction log
  • The time it takes to delete the index entries of interest
  • The time it takes to delete the actual rows of interest
  • The time it takes to wait for other processes to stop using the table so you can acquire what in this case will most likely be an exclusive table lock
  • 找到感兴趣的行所需的时间
  • 在事务日志中记录事务所需的时间
  • 删除感兴趣的索引条目所需的时间
  • 删除实际感兴趣的行所需的时间
  • 等待其他进程停止使用该表所花费的时间,以便您可以获得在这种情况下最有可能是排他表锁

The last point may often be the most significant. Do an sp_who2 command in another query window to make sure that there isn't lock contention going on, preventing your command from executing.

最后一点往往是最重要的。在另一个查询窗口中执行 sp_who2 命令以确保没有发生锁争用,从而阻止您的命令执行。

Improperly configured SQL Servers will do poorly at this type of query. Transaction logs which are too small and/or share the same disks as the data files will often incur severe performance penalties when working with large rows.

不正确配置的 SQL Server 在这种类型的查询中表现不佳。当处理大行时,事务日志太小和/或与数据文件共享相同的磁盘通常会导致严重的性能损失。

As for a solution, well, like all things, it depends. Is this something you intend to be doing often? Depending on how many rows you have left, the fastest way might be to rebuild the table as another name and then rename it and recreate its constraints, all inside a transaction. If this is just an ad-hoc thing, make sure your ADO CommandTimeout is set high enough and you can just bear the cost of this big delete.

至于解决方案,嗯,就像所有事情一样,这取决于。这是您打算经常做的事情吗?根据您剩余的行数,最快的方法可能是将表重建为另一个名称,然后重命名它并重新创建其约束,所有这些都在事务中进行。如果这只是一个临时的事情,请确保您的 ADO CommandTimeout 设置得足够高,并且您可以承担这次大删除的费用。

回答by Patrick Marchand

If the delete will remove "a significant number" of rows from the table, this can be an alternative to a DELETE: put the records to keep somewhere else, truncate the original table, put back the 'keepers'. Something like:

如果删除将从表中删除“大量”行,这可以作为 DELETE 的替代:将记录放在其他地方,截断原始表,放回“管理员”。就像是:

SELECT *
INTO #cats_to_keep
FROM cats
WHERE cats.id_cat NOT IN (    -- note the NOT
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

TRUNCATE TABLE cats

INSERT INTO cats
SELECT * FROM #cats_to_keep

回答by Conrad Frix

Have you tried no Subquery and use a join instead?

您是否尝试过没有子查询并使用连接?

DELETE cats 
FROM
 cats c
 INNER JOIN owner_cats oc
 on c.id_cat = oc.id_cat
WHERE
   id_owner =1

And if you have have you also tried different Join hints e.g.

如果你有你也尝试过不同的加入提示,例如

DELETE cats 
FROM
 cats c
 INNER HASH JOIN owner_cats oc
 on c.id_cat = oc.id_cat
WHERE
   id_owner =1

回答by Tom

If you use an EXISTSrather than an IN, you should get much better performance. Try this:

如果您使用 anEXISTS而不是 an IN,您应该获得更好的性能。尝试这个:

DELETE
  FROM cats c
 WHERE EXISTS (SELECT 1
                 FROM owner_cats o
                WHERE o.id_cat = c.id_cat
                  AND o.id_owner = 1)

回答by Will A

There's no threshold as such - you can DELETE all the rows from any table given enough transaction log space- which is where your query is most likely falling over. If you're getting some results from your DELETE TOP (n) PERCENT FROM cats WHERE ... then you can wrap it in a loop as below:

没有这样的阈值 - 您可以从任何表中删除所有行给足够的事务日志空间- 这是您的查询最有可能失败的地方。如果您从 DELETE TOP (n) PERCENT FROM cat WHERE ... 中得到一些结果,那么您可以将其包装在一个循环中,如下所示:

SELECT 1
WHILE @@ROWCOUNT <> 0
BEGIN
 DELETE TOP (somevalue) PERCENT FROM cats
 WHERE cats.id_cat IN (
 SELECT owner_cats.id_cat FROM owner_cats
 WHERE owner_cats.id_owner = 1)
END

回答by Thomas

As others have mentioned, when you delete 42 million rows, the db has to log 42 million deletions against the database. Thus, the transaction log has to grow substantially. What you might try is to break up the delete into chunks. In the following query, I use the NTile ranking function to break up the rows into 100 buckets. If that is too slow, you can expand the number of buckets so that each delete is smaller. It will help tremendously if there is an index on owner_cats.id_owner, owner_cats.id_catsand cats.id_cat(which I assumed the primary key and numeric).

正如其他人所提到的,当您删除 4200 万行时,数据库必须针对数据库记录 4200 万次删除。因此,事务日志必须大幅增长。您可能会尝试将删除分成块。在以下查询中,我使用 NTile 排名函数将行分成 100 个存储桶。如果这太慢,您可以扩大存储桶的数量,以便每次删除都更小。如果owner_cats.id_owner,owner_cats.id_catscats.id_cat(我假设是主键和数字)上有索引,这将非常有帮助。

Declare @Cats Cursor
Declare @CatId int  --assuming an integer PK here
Declare @Start int
Declare @End int
Declare @GroupCount int

Set @GroupCount = 100

Set @Cats = Cursor Fast_Forward For
    With CatHerd As
        (
        Select cats.id_cat
            , NTile(@GroupCount) Over ( Order By cats.id_cat ) As Grp
        From cats
            Join owner_cats
                On owner_cats.id_cat = cats.id_cat
        Where owner_cats.id_owner = 1
        )
        Select Grp, Min(id_cat) As MinCat, Max(id_cat) As MaxCat
        From CatHerd
        Group By Grp
Open @Cats
Fetch Next From @Cats Into @CatId, @Start, @End

While @@Fetch_Status = 0
Begin
    Delete cats
    Where id_cat Between @Start And @End

    Fetch Next From @Cats Into @CatId, @Start, @End
End 

Close @Cats
Deallocate @Cats

The notable catch with the above approach is that it is not transactional. Thus, if it fails on the 40th chunk, you will have deleted 40% of the rows and the other 60% will still exist.

上述方法的显着特点是它不是事务性的。因此,如果它在第 40 个块上失败,您将删除 40% 的行,而其他 60% 将仍然存在。

回答by onedaywhen

Might be worth trying MERGEe.g.

可能值得一试,MERGE例如

MERGE INTO cats 
   USING owner_cats
      ON cats.id_cat = owner_cats.id_cat
         AND owner_cats.id_owner = 1
WHEN MATCHED THEN DELETE;

回答by Andrew Steitz

<Edit> (9/28/2011)
My answer performs basically the same way as Thomas' solution (Aug 6 '10). I missed it when I posted my answer because it he uses an actual CURSOR so I thought to myself "bad" because of the # of records involved. However, when I reread his answer just now I realize that the WAY he uses the cursor is actually "good". Very clever. I just voted up his answer and will probably use his approach in the future. If you don't understand why, take a look at it again. If you still can't see it, post a comment on this answer and I will come back and try to explain in detail. I decided to leave my answer because someone may have a DBA who refuses to let them use an actual CURSOR regardless of how "good" it is. :-)
</Edit>

<Edit> (9/28/2011)
我的回答与 Thomas 的解决方案(2011 年 8 月 6 日)基本相同。当我发布我的答案时我错过了它,因为它使用了一个实际的 CURSOR,所以我认为自己“不好”,因为涉及的记录数量。然而,当我刚刚重读他的回答时,我意识到他使用光标的方式实际上是“好”的。非常聪明。我刚刚投票赞成他的答案,将来可能会使用他的方法。如果你不明白为什么,请再看一遍。如果还是看不到,请在此答案上发表评论,我会回来尝试详细解释。我决定留下我的答案,因为有人可能有一个 DBA,他拒绝让他们使用实际的 CURSOR,不管它有多“好”。:-)
</编辑>

I realize that this question is a year old but I recently had a similar situation. I was trying to do "bulk" updates to a large table with a join to a different table, also fairly large. The problem was that the join was resulting in so many "joined records" that it took too long to process and could have led to contention problems. Since this was a one-time update I came up with the following "hack." I created a WHILE LOOP that went through the table to be updated and picked 50,000 records to update at a time. It looked something like this:

我意识到这个问题已经有一年了,但我最近遇到了类似的情况。我试图对一个大表进行“批量”更新,并连接到另一个相当大的表。问题是连接导致了如此多的“连接记录”,以至于处理时间过长,并可能导致争用问题。由于这是一次性更新,我想出了以下“黑客”。我创建了一个 WHILE LOOP,它遍历要更新的表并一次选择 50,000 条记录进行更新。它看起来像这样:

DECLARE @RecId bigint
DECLARE @NumRecs bigint
SET @NumRecs = (SELECT MAX(Id) FROM [TableToUpdate])
SET @RecId = 1
WHILE @RecId < @NumRecs
BEGIN
    UPDATE [TableToUpdate]
    SET UpdatedOn = GETDATE(),
        SomeColumn = t2.[ColumnInTable2]
    FROM    [TableToUpdate] t
    INNER JOIN [Table2] t2 ON t2.Name = t.DBAName 
        AND ISNULL(t.PhoneNumber,'') = t2.PhoneNumber 
        AND ISNULL(t.FaxNumber, '') = t2.FaxNumber
    LEFT JOIN [Address] d ON d.AddressId = t.DbaAddressId 
        AND ISNULL(d.Address1,'') = t2.DBAAddress1
        AND ISNULL(d.[State],'') = t2.DBAState
        AND ISNULL(d.PostalCode,'') = t2.DBAPostalCode
    WHERE t.Id BETWEEN @RecId AND (@RecId + 49999)
    SET @RecId = @RecId + 50000
END

Nothing fancy but it got the job done. Because it was only processing 50,000 records at a time, any locks that got created were short lived. Also, the optimizer realized that it did not have to do the entire table so it did a better job of picking an execution plan.

没什么特别的,但它完成了工作。因为它一次只处理 50,000 条记录,所以创建的任何锁都是短暂的。此外,优化器意识到它不必处理整个表,因此它在选择执行计划方面做得更好。

<Edit> (9/28/2011)
There is a HUGE caveat to the suggestion that has been mentioned here more than once and is posted all over the place around the web regarding copying the "good" records to a different table, doing a TRUNCATE (or DROP and reCREATE, or DROP and rename) and then repopulating the table.

<Edit> (9/28/2011)
这里不止一次提到并张贴在网络各处的关于将“好”记录复制到不同表的建议有一个巨大的警告,做一个TRUNCATE(或 DROP 和 reCREATE,或 DROP 和重命名)然后重新填充表。

You cannot do this if the table is the PK table in a PK-FK relationship (or other CONSTRAINT). Granted, you could DROP the relationship, do the clean up, and re-establish the relationship, but you would have to clean up the FK table, too. You can do that BEFORE re-establishing the relationship, which means more "down-time", or you can choose to not ENFORCE the CONSTRAINT on creation and clean up afterwards. I guess you could also clean up the FK table BEFORE you clean up the PK table. Bottom line is that you have to explicitly clean up the FK table, one way or the other.

如果表是 PK-FK 关系(或其他 CONSTRAINT)中的 PK 表,则不能执行此操作。当然,您可以删除关系,进行清理,然后重新建立关系,但您也必须清理 FK 表。您可以在重新建立关系之前这样做,这意味着更多的“停机时间”,或者您可以选择不对创建执行约束并在之后进行清理。我想你也可以在清理 PK 表之前清理 FK 表。最重要的是,您必须以一种或另一种方式明确清理 FK 表。

My answer is a hybrid SET-based/quasi-CURSOR process. Another benefit of this method is that if the PK-FK relationship is setup to CASCADE DELETES you don't have to do the clean up I mention above because the server will take care of it for you. If your company/DBA discourage cascading deletes, you can ask that it be enabled only while this process is running and then disabled when it is finished. Depending on the permission levels of the account that runs the clean up, the ALTER statements to enable/disable cascading deletes can be tacked onto the beginning and the end of the SQL statement. </Edit>

我的答案是基于 SET/quasi-CURSOR 的混合过程。这种方法的另一个好处是,如果 PK-FK 关系设置为 CASCADE DELETES,则您不必进行我上面提到的清理,因为服务器会为您处理。如果您的公司/DBA 不鼓励级联删除,您可以要求仅在此进程运行时启用它,然后在完成时禁用它。根据运行清理的帐户的权限级别,可以将启用/禁用级联删除的 ALTER 语句添加到 SQL 语句的开头和结尾。 </编辑>

回答by JohnB

Bill Karwin's answerto another question applies to my situation also:

Bill Karwin对另一个问题的回答也适用于我的情况:

"If your DELETEis intended to eliminate a great majority of the rows in that table, one thing that people often do is copy just the rows you want to keep to a duplicate table, and then use DROP TABLEor TRUNCATEto wipe out the original table much more quickly."

“如果您DELETE打算消除该表中的大部分行,人们经常做的一件事就是将您想要保留的行复制到重复表中,然后使用DROP TABLETRUNCATE更快地清除原始表.”

Matt in this answersays it this way:

马特在这个答案中是这样说的:

"If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename."

“如果脱机并删除大量 %,则可能只构建一个包含要保留的数据的新表,删除旧表并重命名。”

ammoQ in this answer(from the same question) recommends (paraphrased):

ammoQ 在这个答案中(来自同一问题)建议(释义):

  • issue a table lock when deleting a large amount of rows
  • put indexes on any foreign key columns
  • 删除大量行时发出表锁
  • 将索引放在任何外键列上