SQL IN 对性能有害吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1013797/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is SQL IN bad for performance?
提问by Victor Rodrigues
I have a query doing something like:
我有一个查询,例如:
SELECT FieldX, FieldY FROM A
WHERE FieldW IN (108, 109, 113, 138, 146, 160,
307, 314, 370, 371, 441, 454 ,457, 458, 479, 480,
485, 488, 490, 492, 519, 523, 525, 534, 539, 543,
546, 547, 550, 564, 573, 629, 642, 643, 649, 650,
651, 694, 698, 699, 761, 762, 768, 772, 773, 774,
775, 778, 784, 843, 844, 848, 851, 852, 853, 854,
855, 856, 857, 858, 859, 860, 861, 862, 863, 864,
865, 868, 869, 871, 872, 873, 891)
Having an IN clause with so many options, is it bad for query performance? I'm experiencing many timeouts in my application, and I believe it could be a source of this kind of problem. Can I optimize the query without removing the numbers, using any good SQL Hint?
有一个包含这么多选项的 IN 子句,对查询性能是否不利?我在我的应用程序中遇到了很多超时,我相信这可能是此类问题的根源。我可以使用任何好的 SQL 提示在不删除数字的情况下优化查询吗?
EDIT:
编辑:
@KM these are keys in a different table. This is a forum application, explaining briefly: c# gets all forums from database and stores it in app cache. Before C# calls a procedure that gets the threads for these forums and for this user, c# does some logic filtering the "all forums" collection, considering permissions and some business logic. The timeout happens on database and not on application itself. Doing all this logic on the query will require a lot of inner joins and I'm not 100% sure I can do all this inside the procedure.
@KM 这些是不同表中的键。这是一个论坛应用程序,简单解释一下:c#从数据库中获取所有论坛并将其存储在应用程序缓存中。在 C# 调用获取这些论坛和此用户的线程的过程之前,C# 会执行一些逻辑过滤“所有论坛”集合,考虑权限和一些业务逻辑。超时发生在数据库而不是应用程序本身。在查询上执行所有这些逻辑将需要大量内部连接,我不能 100% 确定我可以在过程中完成所有这些。
I'm using SQL Server 2000
我正在使用SQL Server 2000
回答by LBushkin
There are several considerations when writing a query using the IN operator that can have an affect on performance.
使用 IN 运算符编写查询时,有几个注意事项可能会影响性能。
First, IN clauses are generally internally rewritten by most databases to use the OR logical connective.So col IN ('a','b','c')
is rewritten to: (COL = 'a') OR (COL = 'b') or (COL = 'c')
. The execution plan for both queries will likelybe equivalent assuming that you have an index on col
.
首先,大多数数据库通常在内部重写 IN 子句以使用 OR 逻辑连接词。所以col IN ('a','b','c')
改写为:(COL = 'a') OR (COL = 'b') or (COL = 'c')
. 假设您在 上有索引,这两个查询的执行计划可能是等效的col
。
Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change.Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommendedas a way to ensure optimal query performance.
其次,当使用具有可变数量参数的 IN 或 OR 时,您会导致数据库在每次参数更改时都必须重新解析查询并重建执行计划。为查询构建执行计划可能是一个昂贵的步骤。大多数数据库使用 EXACT 查询文本作为键来缓存它们运行的查询的执行计划。如果您执行类似的查询,但在谓词中使用不同的参数值 - 您很可能会导致数据库花费大量时间来解析和构建执行计划。这就是为什么强烈推荐绑定变量作为确保最佳查询性能的一种方式。
Third, many database have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate.In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.
第三,许多数据库对它们可以执行的查询的复杂性有限制——这些限制之一是谓词中可以包含的逻辑连接词的数量。在您的情况下,几十个值不太可能达到数据库的内置限制,但如果您希望将数百或数千个值传递给 IN 子句 - 它肯定会发生。在这种情况下,数据库将简单地取消查询请求。
Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment.There are various cases where parallel server optimization do not get applied - MSDN has a decent introductionto optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.
第四,在谓词中包含 IN 和 OR 的查询不能总是在并行环境中以最佳方式重写。在许多情况下,并行服务器优化没有得到应用——MSDN 有一篇关于优化并行性查询的不错的介绍。不过,通常情况下,使用 UNION ALL 运算符的查询在大多数数据库中都可以轻松并行化 - 并且在可能的情况下优先于逻辑连接词(如 OR 和 IN)。
回答by tekBlues
If you have a good index on FieldW, using that IN is perfectly right.
如果您在 FieldW 上有一个很好的索引,那么使用 IN 是完全正确的。
I have just tested and SQL 2000 does a Clustered Index Scan when using the IN.
我刚刚测试过,SQL 2000 在使用 IN 时会执行聚集索引扫描。
回答by Quassnoi
You can try creating a temporary table, insert your values to it and use the table instead in the IN
predicate.
您可以尝试创建一个临时表,将您的值插入其中并在IN
谓词中使用该表。
AFAIK, SQL Server 2000
cannot build a hash table of the set of constants, which deprives the optimizer of possibility to use a HASH SEMI JOIN
.
AFAIK,SQL Server 2000
无法构建常量集的哈希表,这剥夺了优化器使用HASH SEMI JOIN
.
This will help only if you don't have an index on FieldW
(which you should have).
只有当您没有索引FieldW
(您应该拥有)时,这才会有所帮助。
You can also try to include your FieldX
and FieldY
columns into the index:
您还可以尝试将您的FieldX
和FieldY
列包含在索引中:
CREATE INDEX ix_a_wxy ON a (FieldW, FieldX, FieldY)
so that the query could be served only by using the index.
以便只能通过使用索引来提供查询。
SQL Server 2000
lacks INCLUDE
option for CREATE INDEX
and this may degrade DML
performance a little but improve the query performance.
SQL Server 2000
缺少INCLUDE
选项CREATE INDEX
,这可能会降低DML
性能,但会提高查询性能。
Update:
更新:
From your execution plan I see than you need a composite index on (SettingsID, SectionID)
从你的执行计划中我看到你需要一个复合索引 (SettingsID, SectionID)
SQL Server 2000
indeed can built a hash table out of a constant list (and does it), but Hash Semi Join
most probably will be less efficient than a Nested Loop
for query query.
SQL Server 2000
确实可以从一个常量列表中构建一个哈希表(并且这样做),但Hash Semi Join
很可能比Nested Loop
for 查询查询效率低。
And just a side note: if you need to know the count of rows satisfying the WHERE
condition, don't use COUNT(column)
, use COUNT(*)
instead.
只是一个旁注:如果您需要知道满足WHERE
条件的行数,请不要使用COUNT(column)
,COUNT(*)
而是使用。
A COUNT(column)
does not count the rows for which the column
value is NULL
.
ACOUNT(column)
不计算column
值为的行NULL
。
This means that, first, you can get the results you didn't expect, and, second, the optimizer will need to do an extra Key Lookup
/ Bookmark Lookup
if your column is not covered by an index that serves the WHERE
condition.
这意味着,首先,您可以获得意想不到的结果,其次,如果您的列没有被满足条件的索引覆盖,优化器将需要做一个额外的Key Lookup
/ 。Bookmark Lookup
WHERE
Since ThreadId
seems to be a CLUSTERED PRIMARY KEY
, it's all right for this very query, but try to avoid it in general.
由于ThreadId
似乎是 a CLUSTERED PRIMARY KEY
,因此对于这个查询来说是可以的,但一般情况下尽量避免使用它。
回答by dkretz
There are better ways to code it, but I doubt it's the cause of your timeouts, especially if it's only a SELECT. You should be able to determine that by looking at your query traces though. But recoding this would be optimization by guessing, and an unlikely guess at that.
有更好的编码方法,但我怀疑这是超时的原因,特别是如果它只是一个 SELECT。不过,您应该能够通过查看查询跟踪来确定这一点。但是重新编码这将是通过猜测进行优化,并且不太可能是猜测。
Let's start with a query plan for the query that is actually timing out. Do you know for sure which query it is?
让我们从实际超时的查询的查询计划开始。你确定它是哪个查询吗?
回答by Steve Broberg
Depending on your data distribution, additional predicates in your WHERE clause may improve performance. For example, if the set of ids is small relative to the total number in the table, and you know that the ids are relatively close together (perhaps they will usually be recent additions, and therefore clustered at the high end of the range), you could try and include the predicate "AND FieldW BETWEEN 109 AND 891" (after determining the min & max id in your set in the C# code). It may be that doing a range scan on those columns (if indexed) works faster than what is currently being used.
根据您的数据分布,WHERE 子句中的附加谓词可能会提高性能。例如,如果 id 集合相对于表中的总数较小,并且您知道这些 id 相对靠近(也许它们通常是最近添加的,因此聚集在范围的高端),您可以尝试包含谓词“AND FieldW BETWEEN 109 AND 891”(在 C# 代码中确定您的集合中的最小和最大 id 之后)。对这些列(如果已编入索引)进行范围扫描的速度可能比当前使用的要快。
回答by Remus Rusanu
IN is exactly the same thing as writing a big list of ORs. And OR often makes queries unSARGable, so your indexes may be ignored and the plan goes for a full scan.
IN 与编写一个大的 OR 列表完全相同。并且 OR 通常会使查询变得不可SARGable,因此您的索引可能会被忽略并且计划进行全面扫描。
回答by infocyde
Here is your answer...
这是你的答案...
http://www.4guysfromrolla.com/webtech/031004-1.shtml
http://www.4guysfromrolla.com/webtech/031004-1.shtml
Basically, you want to create a function that will split a string and populate a temp table with the split contents. Then you can join to that temp table and manipulate your data. The above explains things pretty well. I use this technique a lot.
基本上,您想要创建一个函数来拆分字符串并使用拆分内容填充临时表。然后您可以加入该临时表并操作您的数据。上面已经很好地解释了事情。我经常使用这种技术。
In your specific case use a join to the temp table instead of an in clause, much faster.
在您的特定情况下,使用连接到临时表而不是 in 子句,速度要快得多。
回答by Bryan Migliorisi
Typically the IN clause is harmful to performance, but what is "bad" depends on the application, data, database size, etc. You need to test your own app to see what is best.
通常 IN 子句对性能有害,但什么是“坏”取决于应用程序、数据、数据库大小等。您需要测试自己的应用程序以查看什么是最好的。
回答by Tommi
Basically what that where clause does is "FieldW = 108 OR FieldW = 109 OR FieldW = 113...". Sometimes you can get better performance by doing multiple selects, and combining them with union. For example:
基本上,where 子句的作用是“FieldW = 108 OR FieldW = 109 OR FieldW = 113...”。有时您可以通过执行多个选择并将它们与联合相结合来获得更好的性能。例如:
SELECT FieldX, FieldY FROM A WHERE FieldW = 108
UNION ALL
SELECT FieldX, FieldY FROM A WHERE FieldW = 109
But of course that is impractical when you're comparing to so many values.
但是,当您要与如此多的值进行比较时,这当然是不切实际的。
Another option might be to insert those values into a temporary table and then joining the A table to that temp table.
另一种选择可能是将这些值插入到临时表中,然后将 A 表连接到该临时表。
回答by Eric
the size of your table will determine the speed when using this statement. If it's not a very large table...this statement isn't affecting your performance.
使用此语句时,表的大小将决定速度。如果它不是一个非常大的表......这个语句不会影响你的表现。