SQL 连接与 SQL 子查询(性能)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3856164/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL Joins Vs SQL Subqueries (Performance)?
提问by Vishal
I wish to know if I have a joinquery something like this -
我想知道我是否有这样的连接查询 -
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquerysomething like this -
和一个像这样的子查询-
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performancewhich of the two queries would be faster and why?
当我考虑性能时,这两个查询中哪一个会更快,为什么?
Also is there a time when I should prefer one over the other?
还有什么时候我应该更喜欢一个而不是另一个?
Sorry if this is too trivial and asked before but I am confused about it. Also, it would be great if you guys can suggest me toolsi should use to measure performance of two queries. Thanks a lot!
对不起,如果这太琐碎并且之前问过,但我对此感到困惑。另外,如果你们能建议我应该用来衡量两个查询性能的工具,那就太好了。非常感谢!
采纳答案by JNK
I would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN
is a very slow operator, since SQL normally evaluates it as a series of WHERE
clauses separated by "OR" (WHERE x=Y OR x=Z OR...
).
我希望第一个查询更快,主要是因为你有一个等价和一个显式的 JOIN。根据我的经验,这IN
是一个非常慢的运算符,因为 SQL 通常将它评估为一系列WHERE
由“OR”( WHERE x=Y OR x=Z OR...
)分隔的子句。
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
与 ALL THINGS SQL 一样,您的里程可能会有所不同。速度将在很大程度上取决于索引(您是否在两个 ID 列上都有索引?这将有很大帮助......)等等。
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
100%确定哪个更快的唯一真正方法是打开性能跟踪(IO统计特别有用)并同时运行它们。确保在运行之间清除缓存!
回答by linuxatico
Well, I believe it's an "Old but Gold" question. The answer is: "It depends!". The performances are such a delicate subject that it would be too much silly to say: "Never use subqueries, always join". In the following links, you'll find some basic best practices that I have found to be very helpful:
好吧,我相信这是一个“古老但金子”的问题。答案是:“视情况而定!”。性能是一个如此微妙的主题,说“永远不要使用子查询,总是加入”就太愚蠢了。在以下链接中,您将找到一些我认为非常有用的基本最佳实践:
- Optimizing Subqueries
- Optimizing Subqueries with Semijoin Transformations
- Rewriting Subqueries as Joins
I have a table with 50000 elements, the result i was looking for was 739 elements.
我有一个包含 50000 个元素的表,我要查找的结果是 739 个元素。
My query at first was this:
我最初的查询是这样的:
SELECT p.id,
p.fixedId,
p.azienda_id,
p.categoria_id,
p.linea,
p.tipo,
p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND p.anno = (
SELECT MAX(p2.anno)
FROM prodotto p2
WHERE p2.fixedId = p.fixedId
)
and it took 7.9s to execute.
执行耗时 7.9 秒。
My query at last is this:
我的查询最后是这样的:
SELECT p.id,
p.fixedId,
p.azienda_id,
p.categoria_id,
p.linea,
p.tipo,
p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND (p.fixedId, p.anno) IN
(
SELECT p2.fixedId, MAX(p2.anno)
FROM prodotto p2
WHERE p.azienda_id = p2.azienda_id
GROUP BY p2.fixedId
)
and it took 0.0256s
花了 0.0256 秒
Good SQL, good.
不错的 SQL,不错。
回答by HLGEM
Start to look at the execution plans to see the differences in how the SQl Server will interpret them. You can also use Profiler to actually run the queries multiple times and get the differnce.
开始查看执行计划以了解 SQL Server 如何解释它们的差异。您还可以使用 Profiler 多次实际运行查询并获取差异。
I would not expect these to be so horribly different, where you can get get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.
我不希望这些有如此可怕的不同,当您使用相关子查询时,使用连接而不是子查询可以获得真正的、巨大的性能提升。
EXISTS is often better than either of these two and when you are talking left joins where you want to all records not in the left join table, then NOT EXISTS is often a much better choice.
EXISTS 通常比这两个都好,当您谈论左连接时,您希望所有不在左连接表中的记录,那么 NOT EXISTS 通常是更好的选择。
回答by JP Emvia
Performance is based on the amount of data you are executing on...
性能取决于您正在执行的数据量...
If it is less data around 20k. JOIN works better.
如果是20k左右的数据较少。JOIN 效果更好。
If the data is more like 100k+ then IN works better.
如果数据更像是 100k+,则 IN 效果更好。
If you do not need the data from the other table, IN is good, But it is alwys better to go for EXISTS.
如果您不需要其他表中的数据,IN 是好的,但最好选择 EXISTS。
All these criterias I tested and the tables have proper indexes.
我测试了所有这些标准,并且表格有适当的索引。
回答by onedaywhen
The two queries may not be semantically equivalent. If a employee works for more than one department (possible in the enterprise I work for; admittedly, this would imply your table is not fully normalized) then the first query would return duplicate rows whereas the second query would not. To make the queries equivalent in this case, the DISTINCT
keyword would have to be added to the SELECT
clause, which may have an impact on performance.
这两个查询在语义上可能不等价。如果一名员工为多个部门工作(可能在我工作的企业中;诚然,这意味着您的表没有完全规范化),那么第一个查询将返回重复的行,而第二个查询则不会。在这种情况下,为了使查询等效,必须将DISTINCT
关键字添加到SELECT
子句中,这可能会影响性能。
Note there is a design rule of thumb that states a table should model an entity/class or a relationship between entities/classes but not both. Therefore, I suggest you create a third table, say OrgChart
, to model the relationship between employees and departments.
请注意,有一条设计经验法则规定,表应该为实体/类或实体/类之间的关系建模,但不能同时为两者建模。因此,我建议您创建第三个表,例如OrgChart
,来建模员工和部门之间的关系。
回答by Arvin Amir
I know this is an old post, but I think this is a very important topic, especially nowadays where we have 10M+ records and talk about terabytes of data.
我知道这是一篇旧帖子,但我认为这是一个非常重要的话题,尤其是现在我们有 1000 万条记录并谈论 TB 级数据。
I will also weight in with the following observations. I have about 45M records in my table ([data]), and about 300 records in my [cats] table. I have extensive indexing for all of the queries I am about to talk about.
我还将权衡以下意见。我的表([数据])中有大约 4500 万条记录,[cats] 表中有大约 300 条记录。对于我将要讨论的所有查询,我都有广泛的索引。
Consider Example 1:
考虑示例 1:
UPDATE d set category = c.categoryname
FROM [data] d
JOIN [cats] c on c.id = d.catid
versus Example 2:
与示例 2 对比:
UPDATE d set category = (SELECT TOP(1) c.categoryname FROM [cats] c where c.id = d.catid)
FROM [data] d
Example 1 took about 23 mins to run. Example 2 took around 5 mins.
示例 1 花费了大约 23 分钟来运行。示例 2 花费了大约 5 分钟。
So I would conclude that sub-query in this case is much faster. Of course keep in mind that I am using M.2 SSD drives capable of i/o @ 1GB/sec (thats bytes not bits), so my indexes are really fast too. So this may affect the speeds too in your circumstance
所以我会得出结论,在这种情况下,子查询要快得多。当然请记住,我使用的 M.2 SSD 驱动器能够进行 i/o @ 1GB/sec(这是字节而不是位),所以我的索引也非常快。因此,在您的情况下,这也可能会影响速度
If its a one-off data cleansing, probably best to just leave it run and finish. I use TOP(10000) and see how long it takes and multiply by number of records before I hit the big query.
如果它是一次性数据清理,可能最好让它运行并完成。我使用 TOP(10000) 并查看在我点击大查询之前需要多长时间并乘以记录数。
If you are optimizing production databases, I would strongly suggest pre-processing data, i.e. use triggers or job-broker to async update records, so that real-time access retrieves static data.
如果您正在优化生产数据库,我强烈建议对数据进行预处理,即使用触发器或作业代理来异步更新记录,以便实时访问检索静态数据。
回答by Lucero
The performance should be the same; it's much more important to have the correct indexes and clustering applied on your tables (there exist some good resourceson that topic).
性能应该是一样的;在您的表上应用正确的索引和集群更为重要(有一些关于该主题的好资源)。
(Edited to reflect the updated question)
(编辑以反映更新的问题)
回答by Snekse
You can use an Explain Plan to get an objective answer.
您可以使用解释计划来获得客观的答案。
For your problem, an Exists filterwould probably perform the fastest.
对于您的问题,Exists 过滤器可能执行得最快。