WHERE 子句中字段的顺序是否会影响 MySQL 的性能?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4035760/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Does the order of fields in a WHERE clause affect performance in MySQL?
提问by serg
I have two indexed fields in a table - type
and userid
(individual indexes, not a composite).
我在一个表中有两个索引字段 -type
和userid
(单个索引,而不是复合索引)。
type
s field values are very limited (let's say it is only 0 or 1), so 50% of table records have the same type
. userid
values, on the other hand, come from a much larger set, so the amount of records with the same userid
is small.
type
s 字段值非常有限(假设只有 0 或 1),因此 50% 的表记录具有相同的type
. userid
另一方面,values 来自一个更大的集合,因此具有相同的记录数量userid
很少。
Will any of these queries run faster than the other:
这些查询中的任何一个是否会比另一个运行得更快:
select * from table where type=1 and userid=5
select * from table where userid=5 and type=1
Also if both fields were not indexed, would it change the behavior?
此外,如果两个字段都没有索引,它会改变行为吗?
回答by Jim Ferrans
SQL was designed to be a declarative language, not a procedural one. So the query optimizer should notconsider the order of the where clause predicates in determining how to apply them.
SQL 被设计为一种声明性语言,而不是过程性语言。所以查询优化器在决定如何应用它们时不应该考虑 where 子句谓词的顺序。
I'm probably going to waaaay over-simplify the following discussion of an SQL query optimizer. I wrote one years ago, along these lines (it was tons of fun!). If you really want to dig into modern query optimization, see Dan Tow's SQL Tuning, from O'Reilly.
我可能会过度简化以下对 SQL 查询优化器的讨论。我一年前写的,沿着这些路线(这很有趣!)。如果您真的想深入研究现代查询优化,请参阅来自 O'Reilly 的Dan Tow 的SQL Tuning。
In a simple SQL query optimizer, the SQL statement first gets compiled into a tree of relational algebraoperations. These operations each take one or more tables as input and produce another table as output. Scanis a sequential scan that reads a table in from the database. Sortproduces a sorted table. Selectproduces a table whose rows are selected from another table according to some selection condition. Projectproduces a table with only certain columns of another table. Cross Producttakes two tables and produces an output table composed of every conceivable pairing of their rows.
在简单的 SQL 查询优化器中,SQL 语句首先被编译成关系代数运算树。这些操作均以一个或多个表作为输入,并生成另一个表作为输出。 扫描是从数据库中读取表的顺序扫描。 Sort生成一个排序表。Select生成一个表,其行是根据某些选择条件从另一个表中选择的。Project生成一个表,其中仅包含另一个表的某些列。 交叉乘积采用两个表并生成一个输出表,由它们的每个可能的行配对组成。
Confusingly, the SQL SELECT clause is compiled into a relational algebra Project, while the WHERE clause turns into a relational algebra Select. The FROM clause turns into one or more Joins, each taking two tables in and producing one table out. There are other relational algebra operations involving set union, intersection, difference, and membership, but let's keep this simple.
令人困惑的是,SQL SELECT 子句被编译成一个关系代数Project,而 WHERE 子句变成了一个关系代数Select。FROM 子句变成了一个或多个Joins,每个Joins取两个表并生成一个表。还有其他关系代数运算涉及集合并、交、差和隶属关系,但让我们保持简单。
This tree really needs to be optimized. For example, if you have:
这棵树确实需要优化。例如,如果您有:
select E.name, D.name
from Employee E, Department D
where E.id = 123456 and E.dept_id = D.dept_id
with 5,000 employees in 500 departments, executing an unoptimized tree will blindly produce all possible combinations of one Employee and one Department (a Cross Product) and then Selectout just the one combination that was needed. The Scanof Employee will produce a 5,000 record table, the Scanof Department will produce a 500 record table, the Cross Productof those two tables will produce a 2,500,000 record table, and the Selecton E.id will take that 2,500,000 record table and discard all but one, the record that was wanted.
有 500 个部门的 5,000 名员工,执行未优化的树将盲目地产生一个员工和一个部门的所有可能组合(交叉产品),然后只选择需要的一个组合。Employee的扫描将产生一个 5,000 条记录表,部门扫描将产生一个 500 条记录表,这两个表的叉积将产生一个 2,500,000 条记录表,E.id 上的Select将取该 2,500,000 条记录表并丢弃所有,只有一个,即想要的记录。
[Real query processors will try not to materialize all of these intermediate tables in memory of course.]
[当然,真正的查询处理器将尽量不在内存中实现所有这些中间表。]
So the query optimizer walks the tree and applies various optimizations. One is to break up each Selectinto a chain of Selects, one for each of the original Select's top level conditions, the ones and-ed together. (This is called "conjunctive normal form".) Then the individual smaller Selectsare moved around in the tree and merged with other relational algebra operations to form more efficient ones.
因此查询优化器遍历树并应用各种优化。一种是将每个Select分解成一系列Selects,一个用于原始Select的顶级条件中的每一个,这些条件和-ed 一起。(这被称为“联合范式”。)然后各个较小的选择在树中移动并与其他关系代数运算合并以形成更有效的选择。
In the above example, the optimizer first pushes the Selecton E.id = 123456 down below the expensive Cross Productoperation. This means the Cross Productjust produces 500 rows (one for each combination of that employee and one department). Then the top level Selectfor E.dept_id = D.dept_id filters out the 499 unwanted rows. Not bad.
在上面的例子中,优化器首先将E.id = 123456 上的Select推到昂贵的Cross Product操作之下。这意味着交叉产品只生成 500 行(该员工和一个部门的每个组合对应一个行)。然后顶层Selectfor E.dept_id = D.dept_id 过滤掉 499 个不需要的行。不错。
If there's an an index on Employee's id field, then the optimizer can combine the Scanof Employee with the Selecton E.id = 123456 to form a fast index Lookup. This means that only one Employee row is read into memory from disk instead of 5,000. Things are looking up.
如果 Employee 的 id 字段上有索引,那么优化器可以将Employee的Scan与E.id = 123456 上的Select结合起来,形成一个快速索引Lookup。这意味着只有一个 Employee 行从磁盘读入内存,而不是 5,000。事情在好转。
The final major optimization is to take the Selecton E.dept_id = D.dept_id and combine it with the Cross Product. This turns it into a relational algebra Equijoinoperation. This doesn't do much by itself. But if there's an index on Department.dept_id, then the lower level sequential Scanof Department feeding the Equijoincan be turned into a very fast index Lookupof our one employee's Department record.
最后的主要优化是将Selecton E.dept_id = D.dept_id 与Cross Product结合起来。这将其转换为关系代数Equijoin运算。这本身并没有多大作用。但是如果Department.dept_id 上有一个索引,那么提供Equijoin的Department的较低级别顺序扫描可以变成我们一个员工的Department 记录的非常快速的索引查找。
Lesser optimizations involve pushing Projectoperations down. If the top level of your query just needs E.name and D.name, and the conditions need E.id, E.dept_id, and D.dept_id, then the Scanoperations don't have to build intermediate tables with all the other columns, saving space during the query execution. We've turned a horribly slow query into two index lookups and not much else.
较小的优化涉及推动项目运营。如果查询的顶层只需要 E.name 和 D.name,而条件需要 E.id、E.dept_id 和 D.dept_id,则Scan操作不必与所有其他构建中间表列,在查询执行期间节省空间。我们已经将一个非常慢的查询变成了两个索引查找,而不是其他太多。
Getting more towards the original question, let's say you've got:
更接近原始问题,假设您有:
select E.name
from Employee E
where E.age > 21 and E.state = 'Delaware'
The unoptimized relational algebra tree, when executed, would Scan in the 5,000 employees and produce, say, the 126 ones in Delaware who are older than 21. The query optimizer also has some rough idea of the values in the database. It might know that the E.state column has the 14 states that the company has locations in, and something about the E.age distributions. So first it sees if either field is indexed. If E.state is, it makes sense to use that index to just pick out the small number of employees the query processor suspects are in Delaware based on its last computed statistics. If only E.age is, the query processor likely decides that it's not worth it, since 96% of all employees are 22 and older. So if E.state is indexed, our query processor breaks the Selectand merges the E.state = 'Delaware' with the Scanto turn it into a much more efficient Index Scan.
未优化的关系代数树在执行时将扫描 5,000 名员工,并生成特拉华州 21 岁以上的 126 名员工。查询优化器还对数据库中的值有一些粗略的了解。它可能知道 E.state 列包含公司所在的 14 个州,以及有关 E.age 分布的一些信息。所以首先它查看是否有任何一个字段被索引。如果 E.state 是,那么使用该索引仅根据查询处理器上次计算的统计数据来挑选出查询处理器怀疑在特拉华州的少数员工是有意义的。如果只有 E.age,查询处理器可能会认为它不值得,因为 96% 的员工都在 22 岁及以上。所以如果 E.state 被索引了,我们的查询处理器就会破坏Select并将 E.state = 'Delaware' 与Scan合并,将其变成更高效的Index Scan。
Let's say in this example that there are no indexes on E.state and E.age. The combined Selectoperation takes place after the sequential "Scan" of Employee. Does it make a difference which condition in the Selectis done first? Probably not a great deal. The query processor might leave them in the original order in the SQL statement, or it might be a bit more sophisticated and look at the expected expense. From the statistics, it would again find that the E.state = 'Delaware' condition should be more highly selective, so it would reverse the conditions and do that first, so that there are only 126 E.age > 21 comparisons instead of 5,000. Or it might realize that string equality comparisons are much more expensive than integer compares and leave the order alone.
假设在此示例中,E.state 和 E.age 上没有索引。组合的Select操作发生在 Employee 的顺序“扫描”之后。首先完成Select 中的哪个条件有区别吗?可能不是很多。查询处理器可能会将它们保留在 SQL 语句中的原始顺序,或者它可能会更复杂一些并查看预期的费用。从统计数据中,它会再次发现 E.state = 'Delaware' 条件应该具有更高的选择性,因此它会反转条件并首先执行此操作,因此只有 126 个 E.age > 21 比较而不是 5,000 . 或者它可能意识到字符串相等比较比整数比较昂贵得多,而不管顺序。
At any rate, all this is very complex and your syntactic condition order is very unlikely to make a difference. I wouldn't worry about it unless you have a real performance problem and your database vendor uses the condition order as a hint.
无论如何,所有这些都非常复杂,您的句法条件顺序不太可能产生影响。除非您有真正的性能问题并且您的数据库供应商使用条件顺序作为提示,否则我不会担心它。
回答by Andomar
Most query optimizers use the order in which conditions appear as a hint. If everything else is equal, they will follow that order.
大多数查询优化器使用条件出现的顺序作为提示。如果其他一切都相同,他们将遵循该顺序。
However, many things can override that:
但是,许多事情可以覆盖:
- the second field has an index and the first has not
- there are statistics to suggest that field 2 is more selective
- the second field is easier to search (
varchar(max)
vsint
)
- 第二个字段有一个索引,第一个没有
- 有统计数据表明字段 2 更具选择性
- 第二个字段更容易搜索(
varchar(max)
vsint
)
So (and this is true for all SQL optimization questions) unless you observe a performance issue, it's better to optimize for clarity, not for (imagined) performance.
因此(对于所有 SQL 优化问题都是如此)除非您观察到性能问题,否则最好进行优化以提高清晰度,而不是(想象的)性能。
回答by Cfreak
It shouldn't in your small example. The query optimizer should do the right thing. You can check for sure by adding explain
to the front of the query. MySQL will tell you how it's joining things together and how many rows it needs to search in order to do the join. For example:
它不应该在你的小例子中。查询优化器应该做正确的事情。您可以通过添加explain
到查询的前面来确定。MySQL 会告诉您它是如何将事物连接在一起以及它需要搜索多少行才能进行连接。例如:
explain select * from table where type=1 and userid=5
explain select * from table where type=1 and userid=5
If they were not indexed it would probably change behavior.
如果它们没有被索引,它可能会改变行为。