SQL 进行批量更新的最快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2954629/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 06:25:49  来源:igfitidea点击:

Fastest way to do mass update

sqlsql-servertsql

提问by user356004

Let's say you have a table with about 5 million records and a nvarchar(max)column populated with large text data. You want to set this column to NULLif SomeOtherColumn = 1in the fastest possible way.

假设您有一个包含大约 500 万条记录的表和一个nvarchar(max)填充有大文本数据的列。您希望以最快的方式将此列设置为NULLif SomeOtherColumn = 1

The brute force UPDATEdoes not work very well here because it will create large implicit transaction and take forever.

蛮力UPDATE在这里效果不佳,因为它会创建大型隐式事务并且需要永远。

Doing updates in small batches of 50K records at a time works but it's still taking 47 hours to complete on beefy 32 core/64GB server.

一次对 50K 记录进行小批量更新是可行的,但在强大的 32 核/64GB 服务器上完成更新仍需要 47 小时。

Is there any way to do this update faster? Are there any magic query hints / table options that sacrifices something else (like concurrency) in exchange for speed?

有什么办法可以更快地进行此更新吗?是否有任何神奇的查询提示/表选项会牺牲其他东西(如并发性)来换取速度?

NOTE: Creating temp table or temp column is not an option because this nvarchar(max)column involves lots of data and so consumes lots of space!

注意:创建临时表或临时列不是一种选择,因为此nvarchar(max)列涉及大量数据,因此会占用大量空间!

PS: Yes, SomeOtherColumnis already indexed.

PS:是的,SomeOtherColumn已经编入索引了。

回答by Cobusve

From everything I can see it does not look like your problems are related to indexes.

从我所看到的一切来看,您的问题似乎与索引无关。

The key seems to be in the fact that your nvarchar(max) field contains "lots" of data. Think about what SQL has to do in order to perform this update.

关键似乎在于您的 nvarchar(max) 字段包含“大量”数据。想一想 SQL 必须做什么才能执行此更新。

Since the column you are updating is likely more than 8000 characters it is stored off-page, which implies additional effort in reading this column when it is not NULL.

由于您正在更新的列可能超过 8000 个字符,因此它被存储在页外,这意味着在它不是 NULL 时读取该列需要额外的努力。

When you run a batch of 50000 updates SQL has to place this in an implicit transaction in order to make it possible to roll back in case of any problems. In order to roll back it has to store the original value of the column in the transaction log.

当您运行一批 50000 次更新时,SQL 必须将其置于隐式事务中,以便在出现任何问题时可以回滚。为了回滚它必须在事务日志中存储列的原始值。

Assuming (for simplicity sake) that each column contains on average 10,000 bytes of data, that means 50,000 rows will contain around 500MB of data, which has to be stored temporarily (in simple recovery mode) or permanently (in full recovery mode).

假设(为简单起见)每列平均包含 10,000 字节的数据,这意味着 50,000 行将包含大约 500MB 的数据,这些数据必须临时(在简单恢复模式下)或永久(在完全恢复模式下)存储。

There is no way to disable the logs as it will compromise the database integrity.

没有办法禁用日志,因为它会损害数据库的完整性。

I ran a quick test on my dog slow desktop, and running batches of even 10,000 becomes prohibitively slow, but bringing the size down to 1000 rows, which implies a temporary log size of around 10MB, worked just nicely.

我在我的狗慢速桌面上运行了一个快速测试,即使运行 10,000 个批次也会变得非常慢,但是将大小降低到 1000 行,这意味着临时日志大小约为 10MB,效果很好。

I loaded a table with 350,000 rows and marked 50,000 of them for update. This completed in around 4 minutes, and since it scales linearly you should be able to update your entire 5Million rows on my dog slow desktop in around 6 hours on my 1 processor 2GB desktop, so I would expect something much better on your beefy server backed by SAN or something.

我加载了一个包含 350,000 行的表,并标记了其中的 50,000 行以进行更新。这在大约 4 分钟内完成,并且由于它是线性扩展的,您应该能够在我的 1 个处理器 2GB 台式机上在大约 6 小时内更新我的狗慢速台式机上的全部 500 万行,所以我希望在您支持的强大服务器上有更好的表现通过 SAN 或其他东西。

You may want to run your update statement as a select, selecting only the primary key and the large nvarchar column, and ensure this runs as fast as you expect.

您可能希望将更新语句作为选择运行,仅选择主键和大 nvarchar 列,并确保其运行速度与您预期的一样快。

Of course the bottleneck may be other users locking things or contention on your storage or memory on the server, but since you did not mention other users I will assume you have the DB in single user mode for this.

当然,瓶颈可能是其他用户在服务器上的存储或内存上锁定内容或争用,但由于您没有提到其他用户,因此我假设您的数据库处于单用户模式。

As an optimization you should ensure that the transaction logs are on a different physical disk /disk group than the data to minimize seek times.

作为优化,您应该确保事务日志与数据位于不同的物理磁盘/磁盘组上,以最大限度地减少查找时间。

回答by Zachary Scott

Hopefully you already dropped any indexes on the column you are setting to null, including full text indexes. As said before, turning off transactions and the log file temporarily would do the trick. Backing up your data will usually truncate your log files too.

希望您已经删除了您设置为 null 的列上的所有索引,包括全文索引。如前所述,暂时关闭事务和日志文件可以解决问题。备份您的数据通常也会截断您的日志文件。

回答by Mitch Wheat

You could set the database recovery mode to Simpleto reduce logging, BUT do not do this without considering the full implications for a production environment.

您可以将数据库恢复模式设置Simple为减少日志记录,但不要在不考虑对生产环境的全部影响的情况下执行此操作。

What indexes are in place on the table? Given that batch updates of approx. 50,000 rows take so long, I would say you require an index.

表上有哪些索引?鉴于批量更新约。50,000 行需要这么长时间,我会说你需要一个索引。

回答by John Hartsock

Have you tried placing an index or statistics on someOtherColumn?

您是否尝试过在 someOtherColumn 上放置索引或统计信息?

回答by user347594

This really helped me. I went from 2 hours to 20 minutes with this.

这真的帮助了我。我用这个从 2 小时缩短到 20 分钟。

/* I'm using database recovery mode to Simple */
/* Update table statistics */

set transaction isolation level read uncommitted     

/* Your 50k update, just to have a measures of the time it will take */

set transaction isolation level READ COMMITTED

In my experience, working in MSSQL 2005, moving everyday (automatically) 4 Million 46-byte-records (no nvarchar(max) though) from one table in a database to another table in a different database takes around 20 minutes in a QuadCore 8GB, 2Ghz server and it doesn't hurt application performance. By moving I mean INSERT INTO SELECT and then DELETE. The CPU usage never goes over 30 %, even when the table being deleted has 28M records and it constantly makes around 4K insert per minute but no updates. Well, that's my case, it may vary depending on your server load.

根据我的经验,在 MSSQL 2005 中工作,每天(自动)将 400 万条 46 字节记录(尽管没有 nvarchar(max))从一个数据库中的一个表移动到另一个数据库中的另一个表,在 QuadCore 8GB 中大约需要 20 分钟, 2Ghz 服务器,它不会损害应用程序性能。通过移动,我的意思是 INSERT INTO SELECT 然后 DELETE。CPU 使用率永远不会超过 30%,即使被删除的表有 28M 记录并且每分钟持续进行大约 4K 的插入但没有更新。嗯,这就是我的情况,它可能会因您的服务器负载而异。

READ UNCOMMITTED

未提交的阅读

"Specifies that statements (your updates) can read rows that have been modified by other transactions but not yet committed." In my case, the records are readonly.

“指定语句(您的更新)可以读取已被其他事务修改但尚未提交的行。” 就我而言,记录是只读的。

I don't know what rg-tsql means but hereyou'll find info about transaction isolation levels in MSSQL.

我不知道 rg-tsql 是什么意思,但在这里您可以找到有关 MSSQL 中事务隔离级别的信息。

回答by Mike Cellini

Try indexing 'SomeOtherColumn'...50K records should update in a snap. If there is already an index in place see if the index needs to be reorganized and that statistics have been collected for it.

尝试索引“SomeOtherColumn”... 50K 记录应立即更新。如果已经有一个索引,请查看是否需要重新组织该索引以及已为其收集统计信息。

回答by Dr. belisarius

If you are running a production environment with not enough space to duplicate all your tables, I believe that you are looking for trouble sooner or later.

如果你运行的生产环境没有足够的空间来复制你所有的表,我相信你迟早要找麻烦。

If you provide some info about the number of rows with SomeOtherColumn=1, perhaps we can think another way, but I suggest:

如果您提供有关 SomeOtherColumn=1 的行数的一些信息,也许我们可以考虑另一种方式,但我建议:

0) Backup your table 1) Index the flag column 2) Set the table option to "no log tranctions" ... if posible 3) write a stored procedure to run the updates

0) 备份您的表 1) 索引标志列 2) 将表选项设置为“无日志传输”...如果可能 3) 编写一个存储过程来运行更新