MySQL 加入与子查询
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2577174/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Join vs. sub-query
提问by Your Common Sense
I am an old-school MySQL user and have always preferred JOIN
over sub-query. But nowadays everyone uses sub-query, and I hate it; I don't know why.
我是一个老派的 MySQL 用户,并且总是JOIN
比子查询更喜欢。但是现在每个人都使用子查询,我讨厌它;我不知道为什么。
I lack the theoretical knowledge to judge for myself if there is any difference. Is a sub-query as good as a JOIN
and therefore is there nothing to worry about?
我缺乏理论知识来判断是否有任何差异。子查询是否与 a 一样好JOIN
,因此没有什么可担心的?
采纳答案by simhumileco
Taken from the MySQL manual(13.2.10.11 Rewriting Subqueries as Joins):
摘自 MySQL 手册(13.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
LEFT [OUTER] JOIN 可以比等效的子查询更快,因为服务器可能能够更好地优化它——这一事实不仅仅针对 MySQL 服务器。
So subqueries can be slower than LEFT [OUTER] JOIN
, but in my opinion their strength is slightly higher readability.
因此子查询可能比 慢LEFT [OUTER] JOIN
,但在我看来,它们的优势在于可读性稍高。
回答by Marcelo Cantos
Sub-queries are the logically correct way to solve problems of the form, "Get facts from A, conditional on facts from B". In such instances, it makes more logical sense to stick B in a sub-query than to do a join. It is also safer, in a practical sense, since you don't have to be cautious about getting duplicated facts from A due to multiple matches against B.
子查询是解决“从 A 获取事实,以来自 B 的事实为条件”形式的问题的逻辑正确方法。在这种情况下,将 B 放在子查询中比执行连接更合乎逻辑。从实际意义上讲,它也更安全,因为您不必因与 B 的多次匹配而对从 A 获取重复的事实保持谨慎。
Practically speaking, however, the answer usually comes down to performance. Some optimisers suck lemons when given a join vs a sub-query, and some suck lemons the other way, and this is optimiser-specific, DBMS-version-specific and query-specific.
然而,实际上,答案通常归结为性能。一些优化器在给定连接与子查询时会吃柠檬,而另一些优化器则会以另一种方式吸食柠檬,这是特定于优化器、特定于 DBMS 版本和特定于查询的。
Historically, explicit joins usually win, hence the established wisdom that joins are better, but optimisers are getting better all the time, and so I prefer to write queries first in a logically coherent way, and then restructure if performance constraints warrant this.
从历史上看,显式连接通常会获胜,因此连接的既定智慧更好,但优化器一直在变得更好,因此我更喜欢首先以逻辑连贯的方式编写查询,然后在性能限制允许的情况下进行重组。
回答by Kronass
In most cases JOIN
s are faster than sub-queries and it is very rare for a sub-query to be faster.
在大多数情况下,JOIN
s 比子查询更快,并且子查询更快是非常罕见的。
In JOIN
s RDBMS can create an execution plan that is better for your query and can predict what data should be loaded to be processed and save time, unlike the sub-query where it will run all the queries and load all their data to do the processing.
在JOIN
s RDBMS 中可以创建一个更适合您的查询的执行计划,并且可以预测应该加载哪些数据进行处理并节省时间,这与子查询不同,它将运行所有查询并加载所有数据进行处理.
The good thing in sub-queries is that they are more readable than JOIN
s: that's why most new SQL people prefer them; it is the easy way; but when it comes to performance, JOINS are better in most cases even though they are not hard to read too.
子查询的好处是它们比JOIN
s更具可读性:这就是大多数 SQL 新手更喜欢它们的原因;这是最简单的方法;但是在性能方面,JOINS 在大多数情况下更好,即使它们也不难阅读。
回答by Frank Heikens
Use EXPLAIN to see how your database executes the query on your data. There is a huge "it depends" in this answer...
使用 EXPLAIN 查看您的数据库如何对您的数据执行查询。这个答案中有一个巨大的“取决于”......
PostgreSQL can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other. It all depends on the data, indexes, correlation, amount of data, query, etc.
当 PostgreSQL 认为一个比另一个更快时,它可以将子查询重写为连接或连接到子查询。这一切都取决于数据、索引、相关性、数据量、查询等。
回答by Trendfischer
In the year 2010 I would have joined the author of this questions and would have strongly voted for JOIN
, but with much more experience (especially in MySQL) I can state: Yes subqueries can be better. I've read multiple answers here; some stated subqueries are faster, but it lacked a good explanation. I hope I can provide one with this (very) late answer:
在 2010 年,我会加入这个问题的作者,并且会强烈投票支持JOIN
,但是有了更多的经验(尤其是在 MySQL 中),我可以声明:是的,子查询可以更好。我在这里阅读了多个答案;一些声明的子查询更快,但缺乏很好的解释。我希望我能提供一个(非常)迟到的答案:
First of all, let me say the most important: There are different forms of sub-queries
首先说最重要的:子查询有不同的形式
And the second important statement: Size matters
第二个重要声明:尺寸很重要
If you use sub-queries, you should be awareof how the DB-Server executes the sub-query. Especially if the sub-query is evaluated once or for every row!On the other side, a modern DB-Server is able to optimize a lot. In some cases a subquery helps optimizing a query, but a newer version of the DB-Server might make the optimization obsolete.
如果您使用子查询,您应该了解DB-Server 如何执行子查询。特别是如果子查询被评估一次或每一行!另一方面,现代 DB-Server 能够进行很多优化。在某些情况下,子查询有助于优化查询,但更新版本的 DB-Server 可能会使优化过时。
Sub-queries in Select-Fields
Select-Fields 中的子查询
SELECT moo, (SELECT roger FROM wilco WHERE moo = me) AS bar FROM foo
Be aware that a sub-query is executed for every resulting row from foo
.
Avoid this if possible; it may drastically slow down your query on huge datasets. However, if the sub-query has no reference to foo
it can be optimized by the DB-server as static content and could be evaluated only once.
请注意,对来自 的每个结果行执行子查询foo
。
如果可能,请避免这种情况;它可能会大大减慢您对大型数据集的查询速度。但是,如果子查询没有引用foo
它,则 DB 服务器可以将其优化为静态内容,并且只能评估一次。
Sub-queries in the Where-statement
Where语句中的子查询
SELECT moo FROM foo WHERE bar = (SELECT roger FROM wilco WHERE moo = me)
If you are lucky, the DB optimizes this internally into a JOIN
. If not, your query will become very, very slow on huge datasets because it will execute the sub-query for every row in foo
, not just the results like in the select-type.
如果幸运的话,DB 会在内部将其优化为JOIN
. 如果没有,您的查询在庞大的数据集上将变得非常非常缓慢,因为它将为 中的每一行执行子查询foo
,而不仅仅是像选择类型中的结果。
Sub-queries in the Join-statement
Join语句中的子查询
SELECT moo, bar
FROM foo
LEFT JOIN (
SELECT MIN(bar), me FROM wilco GROUP BY me
) ON moo = me
This is interesting. We combine JOIN
with a sub-query. And here we get the real strength of sub-queries. Imagine a dataset with millions of rows in wilco
but only a few distinct me
. Instead of joining against a huge table, we have now a smaller temporary table to join against. This can result in much faster queries depending on database size. You can have the same effect with CREATE TEMPORARY TABLE ...
and INSERT INTO ... SELECT ...
, which might provide better readability on very complex queries (but can lock datasets in a repeatable read isolation level).
这很有趣。我们结合JOIN
子查询。在这里,我们得到了子查询的真正优势。想象一个包含数百万行wilco
但只有几个不同的数据集me
。我们现在有一个较小的临时表要加入,而不是加入一个巨大的表。这可能会导致查询速度更快,具体取决于数据库大小。您可以使用CREATE TEMPORARY TABLE ...
and获得相同的效果INSERT INTO ... SELECT ...
,这可能为非常复杂的查询提供更好的可读性(但可以在可重复读取隔离级别锁定数据集)。
Nested sub-queries
嵌套子查询
SELECT moo, bar
FROM (
SELECT moo, CONCAT(roger, wilco) AS bar
FROM foo
GROUP BY moo
HAVING bar LIKE 'SpaceQ%'
) AS temp_foo
ORDER BY bar
You can nest sub-queries in multiple levels. This can help on huge datasets if you have to group or sort the results. Usually the DB-Server creates a temporary table for this, but sometimes you do not need sorting on the whole table, only on the resultset. This might provide much better performance depending on the size of the table.
您可以在多个级别嵌套子查询。如果您必须对结果进行分组或排序,这可以帮助处理庞大的数据集。通常 DB-Server 会为此创建一个临时表,但有时您不需要对整个表进行排序,只需要对结果集进行排序。这可能会提供更好的性能,具体取决于表的大小。
Conclusion
结论
Sub-queries are no replacement for a JOIN
and you should not use them like this (although possible). In my humble opinion, the correct use of a sub-query is the use as a quick replacement of CREATE TEMPORARY TABLE ...
. A good sub-query reduces a dataset in a way you cannot accomplish in an ON
statement of a JOIN
. If a sub-query has one of the keywords GROUP BY
or DISTINCT
and is preferably not situated in the select fields or the where statement, then it might improve performance a lot.
子查询不能替代 aJOIN
并且您不应该像这样使用它们(尽管可能)。在我看来,子查询的正确使用是作为CREATE TEMPORARY TABLE ...
. 一个良好的子查询减少你不能在一个实现方式的数据集ON
的声明JOIN
。如果子查询具有关键字GROUP BY
or之一,DISTINCT
并且最好不在选择字段或 where 语句中,那么它可能会大大提高性能。
回答by Unreason
First of all, to compare the two first you should distinguish queries with subqueries to:
首先,要先比较两者,您应该将查询与子查询区分开来:
- a class of subqueries that always have corresponding equivalent query written with joins
- a class of subqueries that can not be rewritten using joins
- 一类子查询,总是有对应的等价查询,用连接写
- 一类不能使用连接重写的子查询
For the first classof queries a good RDBMS will see joins and subqueries as equivalent and will produce same query plans.
对于第一类查询,一个好的 RDBMS 会将连接和子查询视为等效,并将生成相同的查询计划。
These days even mysql does that.
如今,即使是 mysql 也是如此。
Still, sometimes it does not, but this does not mean that joins will always win - I had cases when using subqueries in mysql improved performance. (For example if there is something preventing mysql planner to correctly estimate the cost and if the planner doesn't see the join-variant and subquery-variant as same then subqueries can outperform the joins by forcing a certain path).
尽管如此,有时它不会,但这并不意味着连接总是会获胜 - 我在 mysql 中使用子查询时遇到了一些情况来提高性能。(例如,如果有什么阻止 mysql 规划器正确估计成本,并且如果规划器没有看到 join-variant 和 subquery-variant 相同,那么子查询可以通过强制某个路径来优于连接)。
Conclusion is that you should test your queries for both join and subquery variants if you want to be sure which one will perform better.
结论是,如果您想确定哪一个会表现得更好,您应该针对连接和子查询变体测试您的查询。
For the second classthe comparison makes no sense as those queries can not be rewritten using joins and in these cases subqueries are natural way to do the required tasks and you should not discriminate against them.
对于第二类,比较没有意义,因为这些查询不能使用连接重写,在这些情况下,子查询是完成所需任务的自然方式,您不应该歧视它们。
回答by pkaramol
I think what has been under-emphasized in the cited answers is the issue of duplicatesand problematic results that may arise from specific (use) cases.
我认为引用的答案中没有强调的是重复问题和特定(使用)案例可能产生的有问题的结果。
(although Marcelo Cantos does mention it)
(虽然马塞洛·坎托斯确实提到过)
I will cite the example from Stanford's Lagunita courses on SQL.
我将引用斯坦福大学关于 SQL 的 Lagunita 课程中的示例。
Student Table
学生桌
+------+--------+------+--------+
| sID | sName | GPA | sizeHS |
+------+--------+------+--------+
| 123 | Amy | 3.9 | 1000 |
| 234 | Bob | 3.6 | 1500 |
| 345 | Craig | 3.5 | 500 |
| 456 | Doris | 3.9 | 1000 |
| 567 | Edward | 2.9 | 2000 |
| 678 | Fay | 3.8 | 200 |
| 789 | Gary | 3.4 | 800 |
| 987 | Helen | 3.7 | 800 |
| 876 | Irene | 3.9 | 400 |
| 765 | Jay | 2.9 | 1500 |
| 654 | Amy | 3.9 | 1000 |
| 543 | Craig | 3.4 | 2000 |
+------+--------+------+--------+
Apply Table
申请表
(applications made to specific universities and majors)
(针对特定大学和专业的申请)
+------+----------+----------------+----------+
| sID | cName | major | decision |
+------+----------+----------------+----------+
| 123 | Stanford | CS | Y |
| 123 | Stanford | EE | N |
| 123 | Berkeley | CS | Y |
| 123 | Cornell | EE | Y |
| 234 | Berkeley | biology | N |
| 345 | MIT | bioengineering | Y |
| 345 | Cornell | bioengineering | N |
| 345 | Cornell | CS | Y |
| 345 | Cornell | EE | N |
| 678 | Stanford | history | Y |
| 987 | Stanford | CS | Y |
| 987 | Berkeley | CS | Y |
| 876 | Stanford | CS | N |
| 876 | MIT | biology | Y |
| 876 | MIT | marine biology | N |
| 765 | Stanford | history | Y |
| 765 | Cornell | history | N |
| 765 | Cornell | psychology | Y |
| 543 | MIT | CS | N |
+------+----------+----------------+----------+
Let's try to find the GPA scores for students that have applied to CS
major (regardless of the university)
让我们试着找出申请CS
专业的学生的GPA分数(不分大学)
Using a subquery:
使用子查询:
select GPA from Student where sID in (select sID from Apply where major = 'CS');
+------+
| GPA |
+------+
| 3.9 |
| 3.5 |
| 3.7 |
| 3.9 |
| 3.4 |
+------+
The average value for this resultset is:
此结果集的平均值为:
select avg(GPA) from Student where sID in (select sID from Apply where major = 'CS');
+--------------------+
| avg(GPA) |
+--------------------+
| 3.6800000000000006 |
+--------------------+
Using a join:
使用连接:
select GPA from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
+------+
| GPA |
+------+
| 3.9 |
| 3.9 |
| 3.5 |
| 3.7 |
| 3.7 |
| 3.9 |
| 3.4 |
+------+
average value for this resultset:
此结果集的平均值:
select avg(GPA) from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
+-------------------+
| avg(GPA) |
+-------------------+
| 3.714285714285714 |
+-------------------+
It is obvious that the second attempt yields misleading results in our use case, given that it counts duplicates for the computation of the average value.
It is also evident that usage of distinct
with the join - based statement will noteliminate the problem, given that it will erroneously keep one out of three occurrences of the 3.9
score. The correct case is to account for TWO (2)occurrences of the 3.9
score given that we actually have TWO (2)students with that score that comply with our query criteria.
很明显,第二次尝试在我们的用例中产生了误导性的结果,因为它计算平均值的重复次数。很明显,使用distinct
基于连接的语句不会消除该问题,因为它会错误地保留三分之一出现的3.9
分数。正确的情况是考虑到分数的两 (2)次出现,3.9
因为我们实际上有两 (2)名学生的分数符合我们的查询标准。
It seems that in some cases a sub-query is the safest way to go, besides any performance issues.
似乎在某些情况下,除了任何性能问题之外,子查询是最安全的方法。
回答by U?ur Gümü?han
MSDN Documentation for SQL Server says
Many Transact-SQL statements that include subqueries can be alternatively formulated as joins. Other questions can be posed only with subqueries. In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results.
许多包含子查询的 Transact-SQL 语句也可以表述为连接。其他问题只能通过子查询提出。在 Transact-SQL 中,包含子查询的语句与不包含子查询的语义等效版本之间通常没有性能差异。但是,在某些必须检查存在性的情况下,连接会产生更好的性能。否则,必须为外部查询的每个结果处理嵌套查询,以确保消除重复项。在这种情况下,连接方法会产生更好的结果。
so if you need something like
所以如果你需要类似的东西
select * from t1 where exists select * from t2 where t2.parent=t1.id
try to use join instead. In other cases, it makes no difference.
尝试使用 join 代替。在其他情况下,它没有区别。
I say: Creating functionsfor subqueries eliminate the problem of cluttter and allows you to implement additional logic to subqueries. So I recommend creating functions for subqueries whenever possible.
我说:为子查询创建函数可以消除混乱的问题,并允许您为子查询实现额外的逻辑。所以我建议尽可能为子查询创建函数。
Clutter in code is a big problem and the industry has been working on avoiding it for decades.
代码混乱是一个大问题,几十年来,业界一直在努力避免它。
回答by Jason
Run on a very large database from an old Mambo CMS:
从旧的 Mambo CMS 在一个非常大的数据库上运行:
SELECT id, alias
FROM
mos_categories
WHERE
id IN (
SELECT
DISTINCT catid
FROM mos_content
);
0 seconds
0 秒
SELECT
DISTINCT mos_content.catid,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;
~3 seconds
~3 秒
An EXPLAIN shows that they examine the exact same number of rows, but one takes 3 seconds and one is near instant. Moral of the story? If performance is important (when isn't it?), try it multiple ways and see which one is fastest.
EXPLAIN 显示他们检查的行数完全相同,但一次需要 3 秒,而一次几乎是即时的。故事的道德启示?如果性能很重要(什么时候不重要?),请尝试多种方式,看看哪种方式最快。
And...
和...
SELECT
DISTINCT mos_categories.id,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;
0 seconds
0 秒
Again, same results, same number of rows examined. My guess is that DISTINCT mos_content.catid takes far longer to figure out than DISTINCT mos_categories.id does.
同样,相同的结果,相同数量的检查行。我的猜测是 DISTINCT mos_content.catid 比 DISTINCT mos_categories.id 花费的时间长得多。
回答by Vijay Gajera
As per my observation like two cases, if a table has less then 100,000 records then the join will work fast.
根据我对两种情况的观察,如果一个表的记录少于 100,000 条,那么连接将快速运行。
But in the case that a table has more than 100,000 records then a subquery is best result.
但是在表有超过 100,000 条记录的情况下,子查询是最好的结果。
I have one table that has 500,000 records on that I created below query and its result time is like
我有一张有 500,000 条记录的表,我在查询下面创建了它,它的结果时间就像
SELECT *
FROM crv.workorder_details wd
inner join crv.workorder wr on wr.workorder_id = wd.workorder_id;
Result : 13.3 Seconds
结果:13.3 秒
select *
from crv.workorder_details
where workorder_id in (select workorder_id from crv.workorder)
Result : 1.65 Seconds
结果:1.65 秒