即使 x 远高于所选行，使用 SELECT TOP x 的 SQL 巨大性能差异

Question

提问by Ray

I'm selecting some rows from a table valued function but have found an inexplicable massive performance difference by putting SELECT TOP in the query.

我正在从表值函数中选择一些行，但通过在查询中放置 SELECT TOP 发现了莫名其妙的巨大性能差异。

SELECT   col1, col2, col3 etc
FROM     dbo.some_table_function
WHERE    col1 = @parameter
--ORDER BY col1

is taking upwards of 5 or 6 mins to complete.

需要 5 或 6 分钟以上才能完成。

However

然而

SELECT   TOP 6000 col1, col2, col3 etc
FROM     dbo.some_table_function
WHERE    col1 = @parameter
--ORDER BY col1

completes in about 4 or 5 seconds.

在大约 4 或 5 秒内完成。

This wouldn't surprise me if the returned set of data were huge, but the particular query involved returns ~5000 rowsout of 200,000.

如果返回的数据集很大，这不会让我感到惊讶，但是所涉及的特定查询会返回200,000行中的约 5000 行。

So in both cases, the whole of the table is processed, as SQL Server continues to the end in search of 6000 rows which it will never get to. Why the massive difference then? Is this something to do with the way SQL Server allocates space in anticipation of the result set size (the TOP 6000 thereby giving it a low requirement which is more easily allocated in memory)? Has anyone else witnessed something like this?

因此，在这两种情况下，整个表都会被处理，因为 SQL Server 会继续搜索它永远不会到达的 6000 行。为什么会有巨大的差异呢？这是否与 SQL Server 分配空间以预期结果集大小的方式有关（TOP 6000 从而使其要求较低，更容易在内存中分配）？有没有其他人目睹过这样的事情？

Thanks

谢谢

Answer 1

回答by Quassnoi

Table valued functions can have a non-linear execution time.

表值函数可以具有非线性执行时间。

Let's consider function equivalent for this query:

让我们考虑这个查询的等效函数：

SELECT  (
        SELECT  SUM(mi.value)
        FROM    mytable mi
        WHERE   mi.id <= mo.id
        )
FROM    mytable mo
ORDER BY
        mo.value

This query (that calculates the running SUM) is fast at the beginning and slow at the end, since on each row from moit should sum all the preceding values which requires rewinding the rowsource.

这个查询（计算 running SUM）在开始时很快，在结束时很慢，因为它的每一行都mo应该对需要回滚行源的所有前面的值求和。

Time taken to calculate SUMfor each row increases as the row numbers increase.

SUM随着行号的增加，计算每一行所用的时间也会增加。

If you make mytablelarge enough (say, 100,000rows, as in your example) and run this query you will see that it takes considerable time.

如果您制作mytable足够大（例如，100,000行，如您的示例中所示）并运行此查询，您将看到它需要相当长的时间。

However, if you apply TOP 5000to this query you will see that it completes much faster than 1/20of the time required for the full table.

但是，如果您申请TOP 5000此查询，您将看到它完成的速度比1/20完整表所需的时间快得多。

Most probably, something similar happens in your case too.

最有可能的是，在您的情况下也会发生类似的事情。

To say something more definitely, I need to see the function definition.

更确切地说，我需要查看函数定义。

Update:

更新：

SQL Servercan push predicates into the function.

SQL Server可以将谓词推入函数中。

For instance, I just created this TVF:

例如，我刚刚创建了这个TVF：

CREATE FUNCTION fn_test()
RETURNS TABLE
AS
RETURN  (
        SELECT  *
        FROM    master
        );

These queries:

这些查询：

SELECT  *
FROM    fn_test()
WHERE   name = @name

SELECT  TOP 1000 *
FROM    fn_test()
WHERE   name = @name

yield different execution plans (the first one uses clustered scan, the second one uses an index seek with a TOP)

产生不同的执行计划（第一个使用聚集扫描，第二个使用带有的索引查找TOP）

Answer 2

回答by gbn

Your TOP has no ORDER BY, so it's simply the same as SET ROWCOUNT 6000 first. An ORDER BY would require all rows to be evaluated first, and it's would take a lot longer.

你的 TOP 没有 ORDER BY，所以它和 SET ROWCOUNT 6000 first 是一样的。ORDER BY 将要求首先评估所有行，并且需要更长的时间。

If dbo.some_table_functionis a inline table valued udf, then it's simply a macro that's expanded so it returns the first 6000 rows as mentioned in no particular order.

如果dbo.some_table_function是值 udf 的内联表，那么它只是一个扩展的宏，因此它返回前 6000 行，没有特定的顺序。

If the udf is multi valued, then it's a black box and will always pull in the full dataset before filtering. I don't think this is happening.

如果 udf 是多值的，那么它就是一个黑匣子，并且在过滤之前总是会拉入完整的数据集。我不认为这正在发生。

Not directly related, but another SO question on TVFs

不直接相关，但另一个关于 TVF 的 SO 问题

Answer 3

回答by Stefan Carlsson

I had the same problem, a simple query joining five tables returning 1000 rows took two minutes to complete. When I added "TOP 10000" to it it completed in less than one second. It turned out that the clustered index on one of the tables was heavily fragmented.

我遇到了同样的问题，连接五个表的简单查询需要两分钟才能完成，返回 1000 行。当我将“TOP 10000”添加到它时，它在不到一秒钟的时间内完成。事实证明，其中一张表上的聚集索引严重碎片化。

After rebuilding the index the query now completes in less than a second.

重建索引后，查询现在可以在不到一秒的时间内完成。

Answer 4

回答by Sven Olausson

I think Quassnois' suggestion seems very plausible. By adding TOP 6000 you are implicitly giving the optimizer a hint that a fairly small subset of the 200,000 rows are going to be returned. The optimizer then uses an index seek instead of an clustered index scan or table scan.

我认为 Quassnois 的建议似乎很有道理。通过添加 TOP 6000，您暗示优化器将返回 200,000 行中相当小的子集。然后优化器使用索引查找而不是聚集索引扫描或表扫描。

Another possible explanation could caching, as Jim davis suggests. This is fairly easy to rule out by running the queries again. Try running the one with TOP 6000 first.

另一种可能的解释是缓存，正如 Jim davis 所建议的那样。通过再次运行查询很容易排除这种情况。尝试先运行 TOP 6000。

Answer 5

回答by ericp

It's not necessarily true that the whole table is processed if col1 has an index.

如果 col1 有索引，则不一定要处理整个表。

The SQL optimization will choose whether or not to use an index. Perhaps your "TOP" is forcing it to use the index.

SQL 优化将选择是否使用索引。也许您的“TOP”迫使它使用索引。

If you are using the MSSQL Query Analyzer (The name escapes me) hit Ctrl-K. This will show the execution plan for the query instead of executing it. Mousing over the icons will show the IO/CPU usage, I believe.

如果您使用的是 MSSQL 查询分析器（这个名字让我不明白），请按 Ctrl-K。这将显示查询的执行计划而不是执行它。我相信，将鼠标悬停在图标上会显示 IO/CPU 使用率。

I bet one is using an index seek, while the other isn't.

我敢打赌，一个正在使用索引查找，而另一个则没有。

If you have a generic client: SET SHOWPLAN_ALL ON; GO select ...; go

如果您有通用客户端：SET SHOWPLAN_ALL ON; 去选择...; 走

see http://msdn.microsoft.com/en-us/library/ms187735.aspxfor details.

有关详细信息，请参阅http://msdn.microsoft.com/en-us/library/ms187735.aspx。

Answer 6

回答by Jim Davis

You may be running into something as simple as caching here - perhaps (for whatever reason) the "TOP" query is cached? Using an index that the other isn't?

您可能会遇到像缓存这样简单的事情——也许（出于某种原因）“TOP”查询被缓存了？使用另一个不是的索引？

In any case the best way to quench your curiosity is to examine the full execution plan for both queries. You can do this right in SQL Management Console and it'll tell you EXACTLY what operations are being completed and how long each is predicted to take.

在任何情况下，消除好奇心的最佳方法是检查两个查询的完整执行计划。您可以在 SQL 管理控制台中正确执行此操作，它会准确地告诉您正在完成哪些操作以及每个操作预计需要多长时间。

All SQL implementations are quirky in their own way - SQL Server's no exception. These kind of "whaaaaaa?!" moments are pretty common. ;^)

所有 SQL 实现都有自己的奇特之处——SQL Server 也不例外。这些“哇哇哇？！” 时刻很常见。;^)

即使 x 远高于所选行，使用 SELECT TOP x 的 SQL 巨大性能差异

提问by Ray

回答by Quassnoi

回答by gbn

回答by Stefan Carlsson

回答by Sven Olausson

回答by ericp

回答by Jim Davis

相关推荐

最近更新

标签

即使 x 远高于所选行，使用 SELECT TOP x 的 SQL 巨大性能差异

提问by Ray

回答by Quassnoi

回答by gbn

回答by Stefan Carlsson

回答by Sven Olausson

回答by ericp

回答by Jim Davis

相关推荐

SQL 如何在 Oracle 中创建唯一索引但忽略空值？

SQL MSSQL 正则表达式

SQL 如何在 Access 中使用子查询执行更新查询？

SQL CREATE DATABASE 或 ALTER DATABASE 失败，因为产生的累积数据库大小将超过每个数据库 10240 MB 的许可限制

相关推荐

最近更新

标签