SQL 为什么基于关系集的查询比游标更好?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24168/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 23:09:45  来源:igfitidea点击:

Why are relational set-based queries better than cursors?

sqllanguage-agnosticcursor

提问by Eric Z Beard

When writing database queries in something like TSQL or PLSQL, we often have a choice of iterating over rows with a cursor to accomplish the task, or crafting a single SQL statement that does the same job all at once.

在用 TSQL 或 PLSQL 之类的东西编写数据库查询时,我们通常可以选择使用游标遍历行来完成任务,或者编写一条 SQL 语句来一次性完成相同的工作。

Also, we have the choice of simply pulling a large set of data back into our application and then processing it row by row, with C# or Java or PHP or whatever.

此外,我们可以选择简单地将大量数据拉回我们的应用程序,然后使用 C#、Java 或 PHP 或其他方式逐行处理它。

Why is it better to use set-based queries? What is the theory behind this choice? What is a good example of a cursor-based solution and its relational equivalent?

为什么使用基于集合的查询更好?这个选择背后的理论是什么?什么是基于游标的解决方案及其等价关系的好例子?

回答by Matt Hamilton

The main reason that I'm aware of is that set-based operations can be optimised by the engine by running them across multiple threads. For example, think of a quicksort - you can separate the list you're sorting into multiple "chunks" and sort each separately in their own thread. SQL engines can do similar things with huge amounts of data in one set-based query.

我知道的主要原因是引擎可以通过跨多个线程运行它们来优化基于集合的操作。例如,考虑快速排序 - 您可以将要排序的列表分成多个“块”,并在它们自己的线程中分别对每个块进行排序。SQL 引擎可以在一个基于集合的查询中对大量数据执行类似的操作。

When you perform cursor-based operations, the engine can only run sequentially and the operation has to be single threaded.

执行基于游标的操作时,引擎只能顺序运行,并且操作必须是单线程的。

回答by Matt Rogish

In addition to the above "let the DBMS do the work" (which is a great solution), there are a couple other good reasons to leave the query in the DBMS:

除了上述“让 DBMS 完成工作”(这是一个很好的解决方案)之外,还有其他几个很好的理由将查询留在 DBMS 中:

  • It's (subjectively) easier to read.When looking at the code later, would you rather try and parse a complex stored procedure (or client-side code) with loops and things, or would you rather look at a concise SQL statement?
  • It avoids network round trips.Why shove all that data to the client and then shove more back? Why thrash the network if you don't need to?
  • It's wasteful.Your DBMS and app server(s) will need to buffer some/all of that data to work on it. If you don't have infinite memory you'll likely page out other data; why kick out possibly important things from memory to buffer a result set that is mostly useless?
  • Why wouldn't you?You bought (or are otherwise using) a highly reliable, very fast DBMS. Why wouldn't you use it?
  • 它(主观上)更容易阅读。稍后查看代码时,您是更愿意尝试使用循环和事物解析复杂的存储过程(或客户端代码),还是更愿意查看简洁的 SQL 语句?
  • 它避免了网络往返。为什么将所有数据推送给客户端,然后推送更多数据?如果不需要,为什么要颠簸网络?
  • 很浪费。您的 DBMS 和应用程序服务器将需要缓冲部分/所有数据才能对其进行处理。如果您没有无限内存,您可能会调出其他数据;为什么从内存中剔除可能重要的东西来缓冲一个几乎没用的结果集?
  • 你为什么不呢?您购买(或正在使用)高度可靠、速度非常快的 DBMS。你为什么不使用它?

回答by Mark Brackett

Set based queries are (usually) faster because:

基于集合的查询(通常)更快,因为:

  1. They have more information for the query optimizer to optimize
  2. They can batch reads from disk
  3. There's less logging involved for rollbacks, transaction logs, etc.
  4. Less locks are taken, which decreases overhead
  5. Set based logic is the focus of RDBMSs, so they've been heavily optimized for it (often, at the expense of procedural performance)
  1. 他们有更多信息供查询优化器优化
  2. 他们可以从磁盘批量读取
  3. 回滚、事务日志等涉及的日志记录较少。
  4. 使用较少的锁,从而减少开销
  5. 基于集合的逻辑是 RDBMS 的重点,因此它们已经对其进行了大量优化(通常以牺牲过程性能为代价)

Pulling data out to the middle tier to process it can be useful, though, because it removes the processing overhead off the DB server (which is the hardest thing to scale, and is normally doing other things as well). Also, you normally don't have the same overheads (or benefits) in the middle tier. Things like transactional logging, built-in locking and blocking, etc. - sometimes these are necessary and useful, other times they're just a waste of resources.

不过,将数据拉出到中间层进行处理可能很有用,因为它消除了数据库服务器的处理开销(这是最难扩展的事情,通常还会做其他事情)。此外,您通常在中间层没有相同的开销(或收益)。诸如事务日志、内置锁定和阻塞等之类的东西 - 有时这些是必要且有用的,有时它们只是浪费资源。

A simple cursor with procedural logic vs. set based example (T-SQL) that will assign an area code based on the telephone exchange:

具有程序逻辑的简单游标与基于集合的示例 (T-SQL) 将根据电话交换机分配区号:

--Cursor
DECLARE @phoneNumber char(7)
DECLARE c CURSOR LOCAL FAST_FORWARD FOR
   SELECT PhoneNumber FROM Customer WHERE AreaCode IS NULL
OPEN c
FETCH NEXT FROM c INTO @phoneNumber
WHILE @@FETCH_STATUS = 0 BEGIN
   DECLARE @exchange char(3), @areaCode char(3)
   SELECT @exchange = LEFT(@phoneNumber, 3)

   SELECT @areaCode = AreaCode 
   FROM AreaCode_Exchange 
   WHERE Exchange = @exchange

   IF @areaCode IS NOT NULL BEGIN
       UPDATE Customer SET AreaCode = @areaCode
       WHERE CURRENT OF c
   END
   FETCH NEXT FROM c INTO @phoneNumber
END
CLOSE c
DEALLOCATE c
END

--Set
UPDATE Customer SET
    AreaCode = AreaCode_Exchange.AreaCode
FROM Customer
JOIN AreaCode_Exchange ON
    LEFT(Customer.PhoneNumber, 3) = AreaCode_Exchange.Exchange
WHERE
    Customer.AreaCode IS NULL

回答by HLGEM

You wanted some real-life examples. My company had a cursor that took over 40 minutes to process 30,000 records (and there were times when I needed to update over 200,000 records). It took 45 second to do the same task without the cursor. In another case I removed a cursor and sent the processing time from over 24 hours to less than a minute. One was an insert using the values clause instead of a select and the other was an update that used variables instead of a join. A good rule of thumb is that if it is an insert, update, or delete, you should look for a set-based way to perform the task.

你想要一些现实生活中的例子。我的公司有一个游标需要 40 多分钟才能处理 30,000 条记录(有时我需要更新超过 200,000 条记录)。在没有光标的情况下完成相同的任务需要 45 秒。在另一种情况下,我删除了一个游标并将处理时间从超过 24 小时缩短到不到一分钟。一个是使用 values 子句而不是选择的插入,另一个是使用变量而不是连接的更新。一个好的经验法则是,如果是插入、更新或删除,您应该寻找基于集合的方式来执行任务。

Cursors have their uses (or the code wouldn't be their in the first place), but they should be extremely rare when querying a relational database (Except Oracle which is optimized to use them). One place where they can be faster is when doing calculations based on the value of the preceeding record (running totals). BUt even that should be tested.

游标有它们的用途(或者代码一开始就不是它们的用途),但是在查询关系数据库时它们应该是非常罕见的(Oracle 除外,它经过优化以使用它们)。它们可以更快的一个地方是根据前一记录(运行总计)的值进行计算。但即使这样也应该进行测试。

Another limited case of using a cursor is to do some batch processing. If you are trying to do too much at once in set-based fashion it can lock the table to other users. If you havea truly large set, it may be best to break it up into smaller set-based inserts, updates or deletes that will not hold the lock too long and then run through the sets using a cursor.

使用游标的另一个有限情况是进行一些批处理。如果您试图以基于集合的方式一次做太多事情,它可能会将表锁定给其他用户。如果你有一个真正大的集合,最好将它分解成较小的基于集合的插入、更新或删除,这些插入、更新或删除不会长时间保持锁定,然后使用游标运行这些集合。

A third use of a cursor is to run system stored procs through a group of input values. SInce this is limited to a generally small set and no one should mess with the system procs, this is an acceptable thing for an adminstrator to do. I do not recommend doing the same thing with a user created stored proc in order to process a large batch and to re-use code. It is better to write a set-based version that will be a better performer as performance should trump code reuse in most cases.

游标的第三种用途是通过一组输入值运行系统存储过程。由于这仅限于一个通常很小的集合并且没有人应该弄乱系统过程,因此对于管理员来说这是可以接受的事情。我不建议对用户创建的存储过程做同样的事情,以便处理大批量和重用代码。最好编写一个基于集合的版本,因为在大多数情况下,性能应该胜过代码重用。

回答by kemiller2002

I think the real answer is, like all approaches in programming, that it depends on which one is better. Generally, a set based language is going to be more efficient, because that is what it was designed to do. There are two places where a cursor is at an advantage:

我认为真正的答案是,就像编程中的所有方法一样,这取决于哪一种更好。一般来说,基于集合的语言会更有效率,因为这就是它的设计目的。游标有两个优势:

  1. You are updating a large data set in a database where locking rows is not acceptable (during production hours maybe). A set based update has a possibility of locking a table for several seconds (or minutes), where a cursor (if written correctly) does not. The cursor can meander through the rows updating one at a time and you don't have to worry about affecting anything else.

  2. The advantage to using SQL is that the bulk of the work for optimization is handled by the database engine in most circumstances. With the enterprise class db engines the designers have gone to painstaking lengths to make sure the system is efficient at handling data. The drawback is that SQL is a set based language. You have to be able to define a set of data to use it. Although this sounds easy, in some circumstances it is not. A query can be so complex that the internal optimizers in the engine can't effectively create an execution path, and guess what happens... your super powerful box with 32 processors uses a single thread to execute the query because it doesn't know how to do anything else, so you waste processor time on the database server which generally there is only one of as opposed to multiple application servers (so back to reason 1, you run into resource contentions with other things needing to run on the database server). With a row based language (C#, PHP, JAVA etc.), you have more control as to what happens. You can retrieve a data set and force it to execute the way you want it to. (Separate the data set out to run on multiple threads etc). Most of the time, it still isn't going to be efficient as running it on the database engine, because it will still have to access the engine to update the row, but when you have to do 1000+ calculations to update a row (and lets say you have a million rows), a database server can start to have problems.

  1. 您正在更新数据库中不可接受锁定行的大型数据集(可能在生产时间)。基于集合的更新可能会将表锁定几秒钟(或几分钟),而游标(如果写入正确)则不会。游标可以在行中蜿蜒而行,一次更新一个,您不必担心会影响其他任何内容。

  2. 使用 SQL 的优点是大部分优化工作在大多数情况下由数据库引擎处理。使用企业级数据库引擎,设计人员煞费苦心地确保系统处理数据的效率。缺点是 SQL 是一种基于集合的语言。您必须能够定义一组数据才能使用它。尽管这听起来很容易,但在某些情况下却并非如此。查询可能非常复杂,以至于引擎中的内部优化器无法有效地创建执行路径,然后猜猜会发生什么……您拥有 32 个处理器的超级强大的机器使用单个线程来执行查询,因为它不知道怎么做别的,因此,您会在数据库服务器上浪费处理器时间,而数据库服务器通常只有一个而不是多个应用程序服务器(因此回到原因 1,您会遇到需要在数据库服务器上运行的其他事物的资源争用)。使用基于行的语言(C#、PHP、JAVA 等),您可以更好地控制发生的情况。您可以检索数据集并强制它按照您希望的方式执行。(将数据集分开以在多个线程上运行等)。大多数情况下,它仍然不会像在数据库引擎上运行那样高效,因为它仍然需要访问引擎来更新行,但是当您必须进行 1000 多次计算来更新行时(假设您有一百万行),数据库服务器可能会开始出现问题。您会遇到需要在数据库服务器上运行的其他内容的资源争用)。使用基于行的语言(C#、PHP、JAVA 等),您可以更好地控制发生的情况。您可以检索数据集并强制它按照您希望的方式执行。(将数据集分开以在多个线程上运行等)。大多数情况下,它仍然不会像在数据库引擎上运行那样高效,因为它仍然需要访问引擎来更新行,但是当您必须进行 1000 多次计算来更新行时(假设您有一百万行),数据库服务器可能会开始出现问题。您会遇到需要在数据库服务器上运行的其他内容的资源争用)。使用基于行的语言(C#、PHP、JAVA 等),您可以更好地控制发生的情况。您可以检索数据集并强制它按照您希望的方式执行。(将数据集分开以在多个线程上运行等)。大多数情况下,它仍然不会像在数据库引擎上运行那样高效,因为它仍然需要访问引擎来更新行,但是当您必须进行 1000 多次计算来更新行时(假设您有一百万行),数据库服务器可能会开始出现问题。(将数据集分开以在多个线程上运行等)。大多数时候,它仍然不会像在数据库引擎上运行那样高效,因为它仍然需要访问引擎来更新行,但是当你必须进行 1000 多次计算来更新行时(假设您有一百万行),数据库服务器可能会开始出现问题。(将数据集分开以在多个线程上运行等)。大多数情况下,它仍然不会像在数据库引擎上运行那样高效,因为它仍然需要访问引擎来更新行,但是当您必须进行 1000 多次计算来更新行时(假设您有一百万行),数据库服务器可能会开始出现问题。

回答by dkretz

I think it comes down to using the database is was designed to be used. Relational database servers are specifically developed and optimized to respond best to questions expressed in set logic.

我认为这归结为使用数据库是为了使用而设计的。关系数据库服务器经过专门开发和优化,以最好地响应集合逻辑中表达的问题。

Functionally, the penalty for cursors will vary hugely from product to product. Some (most?) rdbmss are built at least partially on top of isam engines. If the question is appropriate, and the veneer thin enough, it might in fact be as efficient to use a cursor. But that's one of the things you should become intimately familiar with, in terms of your brand of dbms, before trying it.

在功能上,游标的惩罚会因产品而异。一些(大多数?)rdbmss 至少部分构建在 isam 引擎之上。如果问题是合适的,并且单板足够薄,实际上使用游标可能同样有效。但在尝试之前,就您的 dbms 品牌而言,这是您应该非常熟悉的事情之一。

回答by Cervo

As has been said, the database is optimized for set operations. Literally engineers sat down and debugged/tuned that database for long periods of time. The chances of you out optimizing them are pretty slim. There are all sorts of fun tricks you can play with if you have a set of data to work with like batching disk reads/writes together, caching, multi-threading. Also some operations have a high overhead cost but if you do it to a bunch of data at once the cost per piece of data is low. If you are only working one row at a time, a lot of these methods and operations just can't happen.

如前所述,数据库针对集合操作进行了优化。从字面上看,工程师坐下来调试/调整该数据库很长一段时间。你优化它们的机会非常渺茫。如果您有一组要处理的数据,例如将磁盘读取/写入批处理、缓存、多线程,则可以使用各种有趣的技巧。此外,某些操作的开销成本很高,但如果您一次对一堆数据执行此操作,则每条数据的成本很低。如果您一次只处理一行,那么很多这些方法和操作就无法发生。

For example, just look at the way the database joins. By looking at explain plans you can see several ways of doing joins. Most likely with a cursor you go row by row in one table and then select values you need from another table. Basically it's like a nested loop only without the tightness of the loop (which is most likely compiled into machine language and super optimized). SQL Server on its own has a whole bunch of ways of joining. If the rows are sorted, it will use some type of merge algorithm, if one table is small, it may turn one table into a hash lookup table and do the join by performing O(1) lookups from one table into the lookup table. There are a number of join strategies that many DBMS have that will beat you looking up values from one table in a cursor.

比如看数据库的join方式。通过查看解释计划,您可以看到几种执行联接的方法。最有可能使用游标在一个表中逐行移动,然后从另一个表中选择您需要的值。基本上它就像一个嵌套循环,只是没有循环的紧密性(这很可能被编译成机器语言并进行了超级优化)。SQL Server 本身有很多加入方式。如果行被排序,它将使用某种类型的合并算法,如果一个表很小,它可能会将一个表变成一个哈希查找表,并通过从一个表到查找表执行 O(1) 查找来进行连接。许多 DBMS 都有许多连接策略,它们可以击败您从游标中的一个表中查找值。

Just look at the example of creating a hash lookup table. To build the table is probably m operations if you are joining two tables one of length n and one of length m where m is the smaller table. Each lookup should be constant time, so that is n operations. so basically the efficiency of a hash join is around m (setup) + n (lookups). If you do it yourself and assuming no lookups/indexes, then for each of the n rows you will have to search m records (on average it equates to m/2 searches). So basically the level of operations goes from m + n (joining a bunch of records at once) to m * n / 2 (doing lookups through a cursor). Also the operations are simplifications. Depending upon the cursor type, fetching each row of a cursor may be the same as doing another select from the first table.

只需看一下创建哈希查找表的示例。如果要连接两个长度为 n 的表和长度为 m 的表,其中 m 是较小的表,则构建表可能是 m 次操作。每次查找都应该是常数时间,所以是 n 次操作。所以基本上散列连接的效率大约是 m(设置)+ n(查找)。如果您自己进行并假设没有查找/索引,那么对于 n 行中的每一行,您将不得不搜索 m 条记录(平均而言,它相当于 m/2 次搜索)。所以基本上操作级别从 m + n (一次加入一堆记录)到 m * n / 2 (通过游标进行查找)。操作也是简化。根据游标类型,获取游标的每一行可能与从第一个表中进行另一个选择相同。

Locks also kill you. If you have cursors on a table you are locking up rows (in SQL server this is less severe for static and forward_only cursors...but the majority of cursor code I see just opens a cursor without specifying any of these options). If you do the operation in a set, the rows will still be locked up but for a lesser amount of time. Also the optimizer can see what you are doing and it may decide it is more efficient to lock the whole table instead of a bunch of rows or pages. But if you go line by line the optimizer has no idea.

锁也会杀了你。如果您在表上有游标,您将锁定行(在 SQL Server 中,这对于静态游标和 forward_only 游标来说不太严重……但是我看到的大多数游标代码只是打开一个游标而不指定任何这些选项)。如果您在一组中执行操作,行仍将被锁定,但时间较短。优化器也可以看到你在做什么,它可能会决定锁定整个表而不是一堆行或页更有效。但是如果你一行一行地进行,优化器就不知道了。

The other thing is I have heard that in Oracle's case it is super optimized to do cursor operations so it's nowhere near the same penalty for set based operations versus cursors in Oracle as it is in SQL Server. I'm not an Oracle expert so I can't say for sure. But more than one Oracle person has told me that cursors are way more efficient in Oracle. So if you sacrificed your firstborn son for Oracle you may not have to worry about cursors, consult your local highly paid Oracle DBA :)

另一件事是我听说在 Oracle 的情况下,它对游标操作进行了超级优化,因此对于基于集合的操作与 Oracle 中的游标相比,它远不及 SQL Server 中的相同惩罚。我不是 Oracle 专家,所以我不能肯定。但是不止一位 Oracle 人员告诉我,游标在 Oracle 中的效率更高。因此,如果您为 Oracle 牺牲了您的长子,您可能不必担心游标,请咨询您当地的高薪 Oracle DBA :)

回答by Cervo

The REAL answer is go get one of E.F. Codd's books and brush up on relational algebra. Then get a good book on Big O notation. After nearly two decades in IT this is, IMHO, one of the big tragedies of the modern MIS or CS degree: Very few actually study computation. You know...the "compute" part of "computer"? Structured Query Language (and all its supersets) is merely a practical application of relational algebra. Yes, the RDBMS have optimized memory management and read/write but the same could be said for procedural languages. As I read it, the original question is not about the IDE, the software, but rather about the efficiency of one method of computation vs. another.

真正的答案是去读一本EF Codd的书并复习关系代数。然后找一本关于Big O 符号的好书。在 IT 工作近 20 年后,恕我直言,这是现代 MIS 或 CS 学位的一大悲剧:实际上很少有人研究计算。你知道……“计算机”的“计算”部分吗?结构化查询语言(及其所有超集)只是关系代数的实际应用。是的,RDBMS 优化了内存管理和读/写,但对于过程语言也可以这样说。当我读到它时,最初的问题不是关于 IDE、软件,而是关于一种计算方法与另一种计算方法的效率。

Even a quick familiarization with Big O notation will begin to shed light on why, when dealing with sets of data, iteration is more expensive than a declarative statement.

即使快速熟悉 Big O 符号也会开始阐明为什么在处理数据集时迭代比声明性语句更昂贵。

回答by J. Polfer

Simply put, in most cases, it's faster/easier to let the database do it for you.

简而言之,在大多数情况下,让数据库为您完成它会更快/更容易。

The database's purpose in life is to store/retrieve/manipulate data in set formats and to be really fast. Your VB.NET/ASP.NET code is likely nowhere near as fast as a dedicated database engine. Leveraging this is a wise use of resources.

数据库在生活中的目的是以集合格式存储/检索/操作数据并且非常快。您的 VB.NET/ASP.NET 代码可能远不如专用数据库引擎那么快。利用这一点是对资源的明智利用。

回答by Anders Eurenius

The idea behind preferring to do the work in queries is that the database engine can optimize by reformulating it. That's also why you'd want to run EXPLAIN on your query, to see what the db is actuallydoing. (e.g. taking advantage of indices, table sizes and sometimes even knowledge about the distributions of values in columns.)

更喜欢在查询中完成工作背后的想法是数据库引擎可以通过重新制定它来优化。这也是为什么您要对查询运行 EXPLAIN 以查看数据库实际在做什么。(例如,利用索引、表大小,有时甚至了解列中值的分布。)

That said, to get good performance in your actual concrete case, you may have to bend or break rules.

也就是说,要在实际的具体案例中获得良好的性能,您可能不得不弯曲或违反规则。

Oh, another reason might be constraints: Incrementing a unique column by one might be okay if constraints are checked after allthe updates, but generates a collision if done one-by-one.

哦,另一个原因可能是约束:如果在所有更新后检查约束,则将唯一列递增一个可能没问题,但如果逐一完成,则会产生冲突。