SQL WHERE ID IN (1, 2, 3, 4, 5, ...) 是最有效的吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1522119/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is WHERE ID IN (1, 2, 3, 4, 5, ...) the most efficient?
提问by Jan Zich
I know that this topic has been beaten to death, but it seems that many articles on the Internet are often looking for the most elegant way instead of the most efficient way how to solve it. Here is the problem. We are building an application where one of the common database querys will involve manipulation (SELECT's and UPDATE's) based on a user supplied list of ID's. The table in question is expected to have hundreds of thousands of rows, and the user provided lists of ID's can potentially unbounded, bust they will be most likely in terms of tens or hundreds (we may limit it for performance reasons later).
我知道这个话题已经被打死了,但似乎网上很多文章往往都在寻找最优雅的方法而不是最有效的方法来解决它。这是问题所在。我们正在构建一个应用程序,其中一个常见的数据库查询将涉及基于用户提供的 ID 列表的操作(SELECT 和 UPDATE)。有问题的表预计有数十万行,用户提供的 ID 列表可能是无限的,最有可能是数十或数百(我们稍后可能会出于性能原因对其进行限制)。
If my understanding of how databases work in general is correct, the most efficient is to simply use the WHERE ID IN (1, 2, 3, 4, 5, ...)
construct and build queries dynamically. The core of the problem is the input lists of ID's will be really arbitrary, and so no matter how clever the database is or how cleverly we implement it, we always have an random subset of integers to start with and so eventually every approach has to internally boil down to something like WHERE ID IN (1, 2, 3, 4, 5, ...)
anyway.
如果我对数据库工作原理的理解是正确的,最有效的方法是简单地使用WHERE ID IN (1, 2, 3, 4, 5, ...)
构造并动态构建查询。问题的核心是 ID 的输入列表将是任意的,因此无论数据库多么聪明或我们如何巧妙地实现它,我们总是有一个随机的整数子集作为开始,因此最终每种方法都必须内部归结为WHERE ID IN (1, 2, 3, 4, 5, ...)
无论如何。
One can find many approaches all over the web. For instance, one involves declaring a table variable, passing the list of ID's to a store procedure as a comma delimited string, splitting it in the store procedure, inserting the ID's into the table variable and joining the master table on it, i.e. something like this:
人们可以在网络上找到许多方法。例如,涉及声明一个表变量,将 ID 列表作为逗号分隔的字符串传递给存储过程,在存储过程中将其拆分,将 ID 插入到表变量中并在其上加入主表,即类似这个:
-- 1. Temporary table for ID's:
DECLARE @IDS TABLE (ID int);
-- 2. Split the given string of ID's, and each ID to @IDS.
-- Omitted for brevity.
-- 3. Join the main table to @ID's:
SELECT MyTable.ID, MyTable.SomeColumn
FROM MyTable INNER JOIN @IDS ON MyTable.ID = @IDS.ID;
Putting the problems with string manipulation aside, I think what essentially happens in this case is that in the third step the SQL Server says: “Thank you, that's nice, but I just need a list of the ID's”, and it scans the table variable @IDS
, and then does n seeks in MyTable
where n is the number of the ID's. I've done some elementary performance evaluations and inspected the query plan, and it seems that this is what happens. So the table variable, the string concatenation and splitting and all the extra INSERT's are for nothing.
撇开字符串操作的问题不谈,我认为在这种情况下本质上发生的是,在第三步 SQL Server 说:“谢谢,这很好,但我只需要一个 ID 列表”,然后它扫描表variable @IDS
,然后 n 寻找MyTable
其中 n 是 ID 的数量。我已经做了一些基本的性能评估并检查了查询计划,看起来就是这样。所以表变量、字符串连接和拆分以及所有额外的 INSERT 都是徒劳的。
Am I correct? Or am I missing anything? Is there really some clever and more efficient way? Basically, what I'm saying is that the SQL Server has to do n index seeks no matter what and formulating the query as WHERE ID IN (1, 2, 3, 4, 5, ...)
is the most straightforward way to ask for it.
我对么?还是我错过了什么?真的有一些更聪明更有效的方法吗?基本上,我要说的是,SQL Server 无论如何都必须执行 n 索引查找,并且制定查询WHERE ID IN (1, 2, 3, 4, 5, ...)
是最直接的请求方式。
采纳答案by Joel Coehoorn
Well, it depends on what's really going on. How is the user choosing these IDs?
好吧,这取决于实际情况。用户如何选择这些 ID?
Also, it's not just efficiency; there's also security and correctness to worry about. When and how does the user tell the database about their ID choices? How do you incorporate them into the query?
此外,这不仅仅是效率;还有安全性和正确性需要担心。用户何时以及如何将其 ID 选择告知数据库?您如何将它们合并到查询中?
It might be much better to put the selected IDs into a separate table that you can join against (or use a WHERE EXISTS against).
将选定的 ID 放入一个单独的表中可能会更好,您可以对其进行连接(或使用 WHERE EXISTS 进行连接)。
I'll give you that you're not likely to do much better performance-wise than IN (1,2,3..n)
for a small (user-generated) n. But you need to think about how you generate that query. Are you going to use dynamic SQL? If so, how will you secure it from injection? Will the server be able to cache the execution plan?
我会告诉你,在性能方面,你不太可能比IN (1,2,3..n)
一个小的(用户生成的)n做得更好。但是您需要考虑如何生成该查询。您打算使用动态 SQL 吗?如果是这样,您将如何保护它免受注射?服务器是否能够缓存执行计划?
Also, using an extra table is often just easier. Say you're building a shopping cart for an eCommerce site. Rather than worrying up keeping track of the cart client side or in a session, it's likely better to update the ShoppingCart table every time the user makes a selection. This also avoids the whole problem of how to safely set the parameter value for your query, because you're only making one change at a time.
此外,使用额外的表通常更容易。假设您正在为电子商务网站构建购物车。与其担心在客户端或会话中跟踪购物车客户端,不如在用户每次进行选择时更新 ShoppingCart 表。这也避免了如何安全地为您的查询设置参数值的整个问题,因为您一次只进行一项更改。
Don't forget to old adage (with apologies to Benjamin Franklin):
不要忘记古老的格言(向本杰明富兰克林道歉):
He who would trade correctness for performance deserves neither
用正确性来换取性能的人不值得
回答by Dean J
Be careful; on many databases, IN (...) is limited to a fixed number of things in the IN clause. For example, I think it's 1000 in Oracle. That's big, but possibly worth knowing.
当心; 在许多数据库上,IN (...) 仅限于 IN 子句中固定数量的内容。例如,我认为在 Oracle 中是 1000。这很大,但可能值得了解。
回答by Rodrigo
The IN
clause does not guaranties a INDEX SEEK
. I faced this problem before using SQL Mobile edition in a Pocket with very few memory. Replacing IN (list) with a list of OR clauses boosted my query by 400% aprox.
该IN
条款不保证 a INDEX SEEK
。在内存很少的 Pocket 中使用 SQL Mobile 版本之前,我遇到了这个问题。用 OR 子句列表替换 IN(列表)使我的查询提高了 400% 左右。
Another approach is to have a temp table that stores the ID's and join it against the target table, but if this operation is used too often a permanent/indexed table can help the optimizer.
另一种方法是使用一个临时表来存储 ID 并将其连接到目标表,但如果此操作使用过于频繁,则永久/索引表可以帮助优化器。
回答by van
For me the IN (...) is not the preferred option due to many reasons, including the limitation on the number of parameters.
对我来说 IN (...) 不是首选选项,原因有很多,包括参数数量的限制。
Following up on a note from Jan Zichregarding the performance using various temp-table implementations, here are some numbers from SQL execution plan:
跟进Jan Zich关于使用各种临时表实现的性能的说明,以下是 SQL 执行计划中的一些数字:
- XML solution: 99% time - xml parsing
- comma-separated procedure using UDF from CodeProject: 50% temp table scan, 50% index seek. One can agrue if this is the most optimal implementation of string parsing, but I did not want to create one myself (I will happily test another one).
- CLR UDF to split string: 98% - index seek.
- XML 解决方案:99% 的时间 - xml 解析
- 使用CodeProject 中的UDF 的逗号分隔过程:50% 临时表扫描,50% 索引查找。如果这是字符串解析的最佳实现,人们可能会同意,但我不想自己创建一个(我很乐意测试另一个)。
- CLR UDF 拆分字符串:98% - 索引查找。
Here is the code for CLR UDF:
这是 CLR UDF 的代码:
public class SplitString
{
[SqlFunction(FillRowMethodName = "FillRow")]
public static IEnumerable InitMethod(String inputString)
{
return inputString.Split(',');
}
public static void FillRow(Object obj, out int ID)
{
string strID = (string)obj;
ID = Int32.Parse(strID);
}
}
So I will have to agree with Jan that XML solution is not efficient. Therefore if comma-separated list is to be passed as a filter, simple CLR UDF seems be optimal in terms of performance.
所以我不得不同意 Jan 的观点,即 XML 解决方案效率不高。因此,如果将逗号分隔列表作为过滤器传递,则简单的 CLR UDF 似乎在性能方面是最佳的。
I tested the search of 1K record in a table of 200K.
我在一个200K的表中测试了1K记录的搜索。
回答by gbn
A table var has issues: using a temp table with index has benefits for statistics.
表 var 有问题:使用带有索引的临时表有利于统计。
A table var is assumed to always have one row, whereas a temp table has stats the optimiser can use.
假定表 var 始终只有一行,而临时表具有优化器可以使用的统计信息。
Parsing a CSV is easy: see questions on right...
解析 CSV 很容易:请参阅右侧的问题...
回答by Jonathan Leffler
Once upon a long time ago, I found that on the particular DBMS I was working with, the IN list was more efficient up to some threshold (which was, IIRC, something like 30-70), and after that, it was more efficient to use a temp table to hold the list of values and join with the temp table. (The DBMS made creating temp tables very easy, but even with the overhead of creating and populating the temp table, the queries ran faster overall.) This was with up-to-date statistics on the main data tables (but it also helped to update the statistics for the temp table too).
很久以前,我发现在我使用的特定 DBMS 上,IN 列表在达到某个阈值(即 IIRC,类似于 30-70)时效率更高,之后效率更高使用临时表来保存值列表并与临时表连接。(DBMS 使创建临时表变得非常容易,但即使有创建和填充临时表的开销,查询总体运行速度也更快。)这是对主要数据表的最新统计数据(但它也有助于也更新临时表的统计信息)。
There is likely to be a similar effect in modern DBMS; the threshold level may well have changed (I am talking about depressingly close to twenty years ago), but you need to do your measurements and consider your strategy or strategies. Note that optimizers have improved since then - they may be able to make sensible use of bigger IN lists, or automatically convert an IN list into an anonymous temp table. But measurement will be key.
在现代 DBMS 中可能有类似的效果;阈值水平很可能已经改变(我说的令人沮丧的是接近 20 年前),但您需要进行测量并考虑您的策略。请注意,优化器从那时起得到了改进 - 他们可能能够明智地使用更大的 IN 列表,或自动将 IN 列表转换为匿名临时表。但测量将是关键。
回答by George Filippakos
In SQL Server 2008or later you should be looking to use table-valued parameters.
在SQL Server 2008或更高版本中,您应该考虑使用表值参数。
2008 makes it simple to pass a comma-separated list to SQL Server using this method.
2008 使使用此方法将逗号分隔的列表传递给 SQL Server 变得简单。
Here is an excellent source of information and performance tests on the subject:
这是有关该主题的出色信息和性能测试来源:
Here is a great tutorial:
这是一个很棒的教程:
回答by Stuart Ainsworth
Essentially, I would agree with your observation; SQL Server's optimizer will ultimately pick the best plan for analyzing a list of values and it will typically equate to the same plan, regardless of whether or not you are using
基本上,我同意你的观察;SQL Server 的优化器最终会选择分析值列表的最佳计划,它通常等同于相同的计划,无论您是否使用
WHERE IN
or
或者
WHERE EXISTS
or
或者
JOIN someholdingtable ON ...
Obviously, there are other factors which influence plan choice (like covering indexes, etc). The reason that people have various methods for passing in this list of values to a stored procedure is that before SQL 2008, there really was no simple way of passing in multiple values. You could do a list of parameters (WHERE IN (@param1, @param2)...), or you could parse a string (the method you show above). As of SQL 2008, you can also pass table variables around, but the overall result is the same.
显然,还有其他影响计划选择的因素(如覆盖指数等)。人们使用各种方法将这个值列表传递给存储过程的原因是,在 SQL 2008 之前,确实没有简单的方法来传递多个值。你可以做一个参数列表(WHERE IN (@param1, @param2)...),或者你可以解析一个字符串(你上面显示的方法)。从 SQL 2008 开始,您还可以传递表变量,但总体结果是相同的。
So yes, it doesn't matter how you get the list of variables to the query; however, there are other factors which may have some effect on the performance of said query once you get the list of variables in there.
所以是的,如何获取查询的变量列表并不重要;但是,一旦您获得其中的变量列表,还有其他因素可能会对所述查询的性能产生一些影响。
回答by van
To answer the question directly, there is no way to pass a (dynamic) list of arguments to an SQL Server 2005 procedure. Therefore what most people do in these cases is passing a comma-delimited list of identifiers, which I did as well.
要直接回答这个问题,无法将(动态)参数列表传递给 SQL Server 2005 过程。因此,大多数人在这些情况下所做的是传递一个以逗号分隔的标识符列表,我也这样做了。
Since sql 2005 though I prefer passing and XML string, which is also very easy to create on a client side (c#, python, another SQL SP), and "native" to work with since 2005:
从 sql 2005 开始,虽然我更喜欢传递和 XML 字符串,这也很容易在客户端(c#、python、另一个 SQL SP)上创建,并且自 2005 年以来“本机”使用:
CREATE PROCEDURE myProc(@MyXmlAsSTR NVARCHAR(MAX)) AS BEGIN
DECLARE @x XML
SELECT @x = CONVERT(XML, @MyXmlAsSTR)
Then you can join your base query directly with the XML select as (not tested):
然后,您可以直接使用 XML 选择作为(未测试)加入您的基本查询:
SELECT t.*
FROM myTable t
INNER JOIN @x.nodes('/ROOT/ROW') AS R(x)
ON t.ID = x.value('@ID', 'INTEGER')
when passing <ROOT><ROW ID="1"/><ROW ID="2"/></ROOT>
. Just remember that XML is CaSe-SensiTiv.
路过的时候<ROOT><ROW ID="1"/><ROW ID="2"/></ROOT>
。请记住,XML 是 CaSe-SensiTiv。
回答by yfeldblum
select t.*
from (
select id = 35 union all
select id = 87 union all
select id = 445 union all
...
select id = 33643
) ids
join my_table t on t.id = ids.id
If the set of ids
to search on is small, this may improve performance by permitting the query engine to do an index seek. If the optimizer judges that a table scan would be faster than, say, one hundred index seeks, then the optimizer will so instruct the query engine.
如果ids
要搜索的集合很小,这可以通过允许查询引擎进行索引查找来提高性能。如果优化器判断表扫描比一百次索引查找要快,那么优化器将这样指示查询引擎。
Note that query engines tend to treat
请注意,查询引擎倾向于处理
select t.*
from my_table t
where t.id in (35, 87, 445, ..., 33643)
as equivalent to
相当于
select t.*
from my_table t
where t.id = 35 or t.id = 87 or t.id = 445 or ... or t.id = 33643
and note that query engines tend not to be able to perform index seeks on queries with disjunctive search criteria. As an example, Google AppEngine datastore will not execute a query with a disjunctive search criteria at all, because it will only execute queries for which it knows how to perform an index seek.
请注意,查询引擎往往无法对具有分离搜索条件的查询执行索引查找。例如,Google AppEngine 数据存储根本不会执行具有分离搜索条件的查询,因为它只会执行知道如何执行索引查找的查询。