即使 x 远高于所选行,使用 SELECT TOP x 的 SQL 巨大性能差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1393508/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL massive performance difference using SELECT TOP x even when x is much higher than selected rows
提问by Ray
I'm selecting some rows from a table valued function but have found an inexplicable massive performance difference by putting SELECT TOP in the query.
我正在从表值函数中选择一些行,但通过在查询中放置 SELECT TOP 发现了莫名其妙的巨大性能差异。
SELECT col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = @parameter
--ORDER BY col1
is taking upwards of 5 or 6 mins to complete.
需要 5 或 6 分钟以上才能完成。
However
然而
SELECT TOP 6000 col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = @parameter
--ORDER BY col1
completes in about 4 or 5 seconds.
在大约 4 或 5 秒内完成。
This wouldn't surprise me if the returned set of data were huge, but the particular query involved returns ~5000 rowsout of 200,000.
如果返回的数据集很大,这不会让我感到惊讶,但是所涉及的特定查询会返回200,000行中的约 5000 行。
So in both cases, the whole of the table is processed, as SQL Server continues to the end in search of 6000 rows which it will never get to. Why the massive difference then? Is this something to do with the way SQL Server allocates space in anticipation of the result set size (the TOP 6000 thereby giving it a low requirement which is more easily allocated in memory)? Has anyone else witnessed something like this?
因此,在这两种情况下,整个表都会被处理,因为 SQL Server 会继续搜索它永远不会到达的 6000 行。为什么会有巨大的差异呢?这是否与 SQL Server 分配空间以预期结果集大小的方式有关(TOP 6000 从而使其要求较低,更容易在内存中分配)?有没有其他人目睹过这样的事情?
Thanks
谢谢
回答by Quassnoi
Table valued functions can have a non-linear execution time.
表值函数可以具有非线性执行时间。
Let's consider function equivalent for this query:
让我们考虑这个查询的等效函数:
SELECT (
SELECT SUM(mi.value)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
mo.value
This query (that calculates the running SUM
) is fast at the beginning and slow at the end, since on each row from mo
it should sum all the preceding values which requires rewinding the rowsource.
这个查询(计算 running SUM
)在开始时很快,在结束时很慢,因为它的每一行都mo
应该对需要回滚行源的所有前面的值求和。
Time taken to calculate SUM
for each row increases as the row numbers increase.
SUM
随着行号的增加,计算每一行所用的时间也会增加。
If you make mytable
large enough (say, 100,000
rows, as in your example) and run this query you will see that it takes considerable time.
如果您制作mytable
足够大(例如,100,000
行,如您的示例中所示)并运行此查询,您将看到它需要相当长的时间。
However, if you apply TOP 5000
to this query you will see that it completes much faster than 1/20
of the time required for the full table.
但是,如果您申请TOP 5000
此查询,您将看到它完成的速度比1/20
完整表所需的时间快得多。
Most probably, something similar happens in your case too.
最有可能的是,在您的情况下也会发生类似的事情。
To say something more definitely, I need to see the function definition.
更确切地说,我需要查看函数定义。
Update:
更新:
SQL Server
can push predicates into the function.
SQL Server
可以将谓词推入函数中。
For instance, I just created this TVF
:
例如,我刚刚创建了这个TVF
:
CREATE FUNCTION fn_test()
RETURNS TABLE
AS
RETURN (
SELECT *
FROM master
);
These queries:
这些查询:
SELECT *
FROM fn_test()
WHERE name = @name
SELECT TOP 1000 *
FROM fn_test()
WHERE name = @name
yield different execution plans (the first one uses clustered scan, the second one uses an index seek with a TOP
)
产生不同的执行计划(第一个使用聚集扫描,第二个使用带有 的索引查找TOP
)
回答by gbn
Your TOP has no ORDER BY, so it's simply the same as SET ROWCOUNT 6000 first. An ORDER BY would require all rows to be evaluated first, and it's would take a lot longer.
你的 TOP 没有 ORDER BY,所以它和 SET ROWCOUNT 6000 first 是一样的。ORDER BY 将要求首先评估所有行,并且需要更长的时间。
If dbo.some_table_function
is a inline table valued udf, then it's simply a macro that's expanded so it returns the first 6000 rows as mentioned in no particular order.
如果dbo.some_table_function
是值 udf 的内联表,那么它只是一个扩展的宏,因此它返回前 6000 行,没有特定的顺序。
If the udf is multi valued, then it's a black box and will always pull in the full dataset before filtering. I don't think this is happening.
如果 udf 是多值的,那么它就是一个黑匣子,并且在过滤之前总是会拉入完整的数据集。我不认为这正在发生。
Not directly related, but another SO question on TVFs
不直接相关,但另一个关于 TVF 的 SO 问题
回答by Stefan Carlsson
I had the same problem, a simple query joining five tables returning 1000 rows took two minutes to complete. When I added "TOP 10000" to it it completed in less than one second. It turned out that the clustered index on one of the tables was heavily fragmented.
我遇到了同样的问题,连接五个表的简单查询需要两分钟才能完成,返回 1000 行。当我将“TOP 10000”添加到它时,它在不到一秒钟的时间内完成。事实证明,其中一张表上的聚集索引严重碎片化。
After rebuilding the index the query now completes in less than a second.
重建索引后,查询现在可以在不到一秒的时间内完成。
回答by Sven Olausson
I think Quassnois' suggestion seems very plausible. By adding TOP 6000 you are implicitly giving the optimizer a hint that a fairly small subset of the 200,000 rows are going to be returned. The optimizer then uses an index seek instead of an clustered index scan or table scan.
我认为 Quassnois 的建议似乎很有道理。通过添加 TOP 6000,您暗示优化器将返回 200,000 行中相当小的子集。然后优化器使用索引查找而不是聚集索引扫描或表扫描。
Another possible explanation could caching, as Jim davis suggests. This is fairly easy to rule out by running the queries again. Try running the one with TOP 6000 first.
另一种可能的解释是缓存,正如 Jim davis 所建议的那样。通过再次运行查询很容易排除这种情况。尝试先运行 TOP 6000。
回答by ericp
It's not necessarily true that the whole table is processed if col1 has an index.
如果 col1 有索引,则不一定要处理整个表。
The SQL optimization will choose whether or not to use an index. Perhaps your "TOP" is forcing it to use the index.
SQL 优化将选择是否使用索引。也许您的“TOP”迫使它使用索引。
If you are using the MSSQL Query Analyzer (The name escapes me) hit Ctrl-K. This will show the execution plan for the query instead of executing it. Mousing over the icons will show the IO/CPU usage, I believe.
如果您使用的是 MSSQL 查询分析器(这个名字让我不明白),请按 Ctrl-K。这将显示查询的执行计划而不是执行它。我相信,将鼠标悬停在图标上会显示 IO/CPU 使用率。
I bet one is using an index seek, while the other isn't.
我敢打赌,一个正在使用索引查找,而另一个则没有。
If you have a generic client: SET SHOWPLAN_ALL ON; GO select ...; go
如果您有通用客户端:SET SHOWPLAN_ALL ON; 去选择...; 走
see http://msdn.microsoft.com/en-us/library/ms187735.aspxfor details.
有关详细信息,请参阅http://msdn.microsoft.com/en-us/library/ms187735.aspx。
回答by Jim Davis
You may be running into something as simple as caching here - perhaps (for whatever reason) the "TOP" query is cached? Using an index that the other isn't?
您可能会遇到像缓存这样简单的事情——也许(出于某种原因)“TOP”查询被缓存了?使用另一个不是的索引?
In any case the best way to quench your curiosity is to examine the full execution plan for both queries. You can do this right in SQL Management Console and it'll tell you EXACTLY what operations are being completed and how long each is predicted to take.
在任何情况下,消除好奇心的最佳方法是检查两个查询的完整执行计划。您可以在 SQL 管理控制台中正确执行此操作,它会准确地告诉您正在完成哪些操作以及每个操作预计需要多长时间。
All SQL implementations are quirky in their own way - SQL Server's no exception. These kind of "whaaaaaa?!" moments are pretty common. ;^)
所有 SQL 实现都有自己的奇特之处——SQL Server 也不例外。这些“哇哇哇?!” 时刻很常见。;^)