SQL 索引中列的顺序有多重要?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2292662/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:25:49  来源:igfitidea点击:

How important is the order of columns in indexes?

sqlsql-serversql-server-2005tsqlindexing

提问by Abe Miessler

I've heard that you should put columns that will be the most selective at the beginning of the index declaration. Example:

我听说您应该将最具选择性的列放在索引声明的开头。例子:

CREATE NONCLUSTERED INDEX MyINDX on Table1
(
   MostSelective,
   SecondMost,
   Least
)

First off, is what I'm saying correct? If so, am i likely to see large differences in performance by rearranging the order of the columns in my index or is it more of a "nice to do" practice?

首先,我说的对吗?如果是这样,通过重新排列索引中列的顺序,我是否可能会看到性能上的巨大差异,还是更像是一种“好做”的做法?

The reason I'm asking is because after putting a query through the DTA it recommended that I create an index that had almost all of the same columns in it as an existing index, just in a different order. I was considering just adding the missing columns to the existing index and calling it good. Thoughts?

我问的原因是因为在通过 DTA 进行查询后,它建议我创建一个索引,该索引中几乎所有的列都与现有索引相同,只是顺序不同。我正在考虑将缺失的列添加到现有索引中并称其为好。想法?

回答by Nick Craver

Look at an index like this:

看看这样的索引:

Cols
  1   2   3
-------------
|   | 1 |   |
| A |---|   |
|   | 2 |   |
|---|---|   |
|   |   |   |
|   | 1 | 9 |
| B |   |   |
|   |---|   |
|   | 2 |   |
|   |---|   |
|   | 3 |   |
|---|---|   |

See how restricting on A first, as your first column eliminates more results than restricting on your second column first? It's easier if you picture how the index must be traversed across, column 1, then column 2, etc...you see that lopping off most of the results in the fist pass makes the 2nd step that much faster.

看看如何限制 A 首先,因为您的第一列消除的结果多于首先限制第二列的结果?如果您想象必须如何遍历索引、第 1 列、然后第 2 列等,则更容易……您会看到在第一次传递中删除大部分结果会使第二步更快。

Another case, if you queried on column 3, the optimizer wouldn't even use the index, because it's not helpful at all in narrowing down the result sets. Anytime you're in a query, narrowing down the number of results to deal with before the next step means better performance.

另一种情况,如果您查询第 3 列,优化器甚至不会使用索引,因为它对缩小结果集毫无帮助。 无论何时进行查询,在下一步之前缩小要处理的结果数量都意味着更好的性能。

Since the index is also stored this way, there's no backtracking across the index to find the first column when you're querying on it.

由于索引也是以这种方式存储的,因此当您查询第一列时,无需回溯索引即可找到第一列。

In short: No, it's not for show, there are real performance benefits.

简而言之:不,这不是为了表演,有真正的性能优势。

回答by Remus Rusanu

The order of columns is critical. Now which order is correct it depends on how you are going to query it. An index can be used to do an exact seek or an range scan. An exact seek is when values for all columns in the index are specified and the query lands exactly on the row is interested in. For seeks the order of columns is irrelevant. A range scan is when only some columns are specified, and in this case when the order becomes important. SQL Server can use an index for a range scan only if the leftmost column is specified, and then only if the next leftmost column is specified, and so on. If you have an index on (A,B,C) it can be used to range scan for A=@a, for A=@a AND B=@bbut notfor B=@b, for C=@cnorB=@b AND C=@c. The case A=@a AND C=@cis mixed one, as in the A=@aportion will use the index, but the C=@cnot (the query will scan all B values for A=@a, will not 'skip' to C=@c). Other database systems have the so called 'skip scan' operator that can take some advantage of inner columns in an index when the outer columns are not specified.

列的顺序很关键。现在哪个顺序是正确的,这取决于您将如何查询它。索引可用于进行精确查找或范围扫描。精确查找是指指定索引中所有列的值并且查询正好位于感兴趣的行上。对于查找,列的顺序无关紧要。范围扫描是指只指定了一些列,在这种情况下,顺序变得很重要。仅当指定了最左边的列,然后仅当指定了下一个最左边的列时,SQL Server 才能使用索引进行范围扫描,依此类推。如果您有 (A,B,C) 上的索引,它可用于范围扫描 for A=@a、 forA=@a AND B=@b不是for B=@b、 forC=@cB=@b AND C=@c。这种情况A=@a AND C=@c是混合的,如A=@a部分将使用索引,但C=@c不使用(查询将扫描 的所有 B 值A=@a,不会“跳过”到C=@c)。其他数据库系统具有所谓的“跳过扫描”运算符,当未指定外部列时,它可以利用索引中的内部列。

With that knowledge in hand you can look at the index definitions again. An index on (MostSelective, SecondMost, Least)will be effective only when MostSelectivecolumn is specified. But that being the most selective, the relevance of the inner columns will quickly degrade. Very often you'll find that a better index is on (MostSelective) include (SecondMost, Least)or on (MostSelective, SecondMost) include (Least). Because the inner columns are less relevant, placing low selectivity columns in such right positions in the index makes them nothing but noise for a seek, so it makes sense to move them out of the intermediate pages and keep them only on the leaf pages, for query coverability purposes. In other words, move them to INCLUDE. This becomes more important as the size of Leastcolumn increases. The idea is that this index can only benefit queries that specify MostSelectiveeither as an exact value or a range, and that column being the most selective it already restricts the candidate rows to great extent.

掌握了这些知识后,您可以再次查看索引定义。(MostSelective, SecondMost, Least)只有在MostSelective指定了 column时,索引才有效。但这是最具选择性的,内部列的相关性将迅速降低。很多时候你会发现更好的索引是 on(MostSelective) include (SecondMost, Least)或 on (MostSelective, SecondMost) include (Least)。由于内部列的相关性较低,将低选择性列放置在索引中的正确位置只会使它们成为搜索的噪音,因此将它们移出中间页面并仅将它们保留在叶页面上是有意义的,因为查询可覆盖性目的。换句话说,将它们移动到 INCLUDE。随着Least列的大小增加,这变得更加重要。这个想法是这个索引只能有利于指定的查询MostSelective无论是作为精确值还是范围,并且该列是最具选择性的,它已经在很大程度上限制了候选行。

On the other hand an index on (Least, SecondMost, MostSelective)may seem a mistake, but it actually quite a powerful index. Because it has the Leastcolumn as its outermost query, it can be used for queries that have to aggregate results on low selectivity columns. Such queries are prevalent in OLAP and analysis data warehouses, and this is exactly where such indexes have a very good case going for them. Such indexes actually make excellent clusteredindexes, exactly because they organize the physical layout on large chunks of related rows (same Leastvalue, which usually indicate some sort of category or type) and they facilitate analysis queries.

另一方面,索引(Least, SecondMost, MostSelective)似乎是一个错误,但它实际上是一个非常强大的索引。因为它将Least列作为其最外层查询,所以它可用于必须在低选择性列上聚合结果的查询。此类查询在 OLAP 和分析数据仓库中很普遍,而这正是此类索引非常适合它们的地方。这样的索引实际上是极好的聚集索引,正是因为它们在相关行的大块(相同Least值,通常表示某种类别或类型)上组织物理布局,并且它们便于分析查询。

So, unfortunately, there is no 'correct' order. You shouldn't follow any cookie cutter recipe but instead analyze the query pattern you are going to use against those tables and decide which index column order is right.

因此,不幸的是,没有“正确”的顺序。您不应该遵循任何千篇一律的方法,而是分析您将针对这些表使用的查询模式,并确定哪个索引列顺序是正确的。

回答by Martin Smith

As Remus says it depends on your workload.

正如 Remus 所说,这取决于您的工作量。

I want to address a misleading aspect of the accepted answer though.

不过,我想解决已接受答案的误导性方面。

For queries that are performing an equality search on all columns in the index there is no significant difference.

对于对索引中的所有列执行相等搜索的查询,没有显着差异。

The below creates two tables and populates them with identical data. The only difference is that one has the keys ordered from most to least selective and the other the reverse.

下面创建了两个表并用相同的数据填充它们。唯一的区别是一个键的顺序是从选择性最多到最少,而另一个则相反。

CREATE TABLE Table1(MostSelective char(800), SecondMost TINYINT, Least  CHAR(1), Filler CHAR(4000) null);
CREATE TABLE Table2(MostSelective char(800), SecondMost TINYINT, Least  CHAR(1), Filler CHAR(4000) null);

CREATE NONCLUSTERED INDEX MyINDX on Table1(MostSelective,SecondMost,Least);
CREATE NONCLUSTERED INDEX MyINDX2 on Table2(Least,SecondMost,MostSelective);

INSERT INTO Table1 (MostSelective, SecondMost, Least)
output inserted.* into Table2
SELECT TOP 26 REPLICATE(CHAR(number + 65),800), number/5, '~'
FROM master..spt_values
WHERE type = 'P' AND number >= 0
ORDER BY number;

Now doing a query against both of the tables...

现在对两个表进行查询...

SELECT *
FROM   Table1
WHERE  MostSelective = REPLICATE('P', 800)
       AND SecondMost = 3
       AND Least = '~';

SELECT *
FROM   Table2
WHERE  MostSelective = REPLICATE('P', 800)
       AND SecondMost = 3
       AND Least = '~'; 

... Both of them use an index fine and both are given the exact same cost.

...他们都使用索引罚款,并且都给出了完全相同的成本。

enter image description here

在此处输入图片说明

The ASCII art in the accepted answer is not in fact how indexes are structured. The index pages for Table1 are represented below (click the image to open in full size).

接受的答案中的 ASCII 艺术实际上并不是索引的结构方式。Table1 的索引页如下所示(单击图像以全尺寸打开)。

enter image description here

在此处输入图片说明

The index pages contain rows containing the whole key (in this case there is actually an additional key column appended for the row identifier as the index was not declared as unique but that can be disregarded further information about this can be found here).

索引页包含包含整个键的行(在这种情况下,实际上为行标识符附加了一个额外的键列,因为索引没有被声明为唯一的,但可以忽略关于此的更多信息可以在这里找到)。

For the query above SQL Server doesn't care about the selectivity of the columns. It does a binary search of the root page and discovers that the Key(PPP...,3,~ )is >=(JJJ...,1,~ )and < (SSS...,3,~ )so it should read page 1:118. It then does a binary search of the key entries on that page and locates the leaf page to travel down to.

对于上面的查询,SQL Server 不关心列的选择性。它对根页面进行二分搜索并发现Key(PPP...,3,~ )>=(JJJ...,1,~ )< (SSS...,3,~ )因此它应该读取 page 1:118。然后它对该页面上的关键条目进行二分搜索并定位要向下移动到的叶页面。

Altering the index in order of selectivity doesn't affect either the expected number of key comparisons from the binary search or the number of pages that need to be navigated to do an index seek. At best it mightmarginally speed up the key comparison itself.

按选择性顺序更改索引不会影响二进制搜索的预期键比较数或需要导航以进行索引查找的页面数。充其量它可能会略微加快关键比较本身。

Sometimes ordering the most selective index first will make sense for other queries in your workload though.

有时,首先对最具选择性的索引进行排序对于工作负载中的其他查询是有意义的。

E.g if the workload contains queries of both the following forms.

例如,如果工作负载包含以下两种形式的查询。

SELECT * ... WHERE  MostSelective = 'P'

SELECT * ...WHERE Least = '~'

The indexes above aren't covering for either of them. MostSelectiveis selective enough to make a plan with a seek and lookups worthwhile but the query against Leastisn't.

以上索引均未涵盖其中任何一个。MostSelective具有足够的选择性,可以制定有价值的搜索和查找计划,但查询Least却不是。

However this scenario (non covering index seek on subset of leading column(s) of a composite index) is only one possible class of query that can be helped by an index. If you never actually search by MostSelectiveon its own or a combination of MostSelective, SecondMostand always search by a combination of all three columns then this theoretical advantage is useless to you.

然而,这种情况(对复合索引的前导列的子集进行非覆盖索引查找)只是索引可以帮助的一类可能的查询。如果您从不实际单独搜索MostSelective或组合MostSelective, SecondMost搜索并且始终按所有三列的组合进行搜索,那么这种理论上的优势对您来说毫无用处。

Conversely queries such as

相反的查询,例如

SELECT MostSelective,
       SecondMost,
       Least
FROM   Table2
WHERE  Least = '~'
ORDER  BY SecondMost,
          MostSelective 

Would be helped by having the reverse order of the commonly prescribed one - as it covers the query, can support a seek and returns rows in the desired order to boot.

使用通常规定的相反顺序会有所帮助 - 因为它涵盖了查询,可以支持查找并以所需的顺序返回行以进行引导。

So this is an often repeated piece of advice but at most it's a heuristic about the potential benefit to otherqueries - and it is no substitute for actually looking at yourworkload.

因此,这是一条经常重复的建议,但至多是对其他查询的潜在好处的启发式方法- 它不能替代实际查看您的工作量。

回答by OMG Ponies

you should put columns that will be the most selective at the beginning of the index declaration.

您应该将最具选择性的列放在索引声明的开头。

Correct. Indexes can be composites - composed of multiple columns - and the order is important because of the leftmost principle. Reason is, that the database checks the list from left to right, and has to find a corresponding column reference matching the order defined. For example, having an index on an address table with columns:

正确的。索引可以是复合的——由多列组成——由于最左原则,顺序很重要。原因是,数据库从左到右检查列表,并且必须找到与定义的顺序匹配的相应列引用。例如,在具有列的地址表上建立索引:

  • Address
  • City
  • State
  • 地址
  • 城市
  • 状态

Any query using the addresscolumn can utilize the index, but if the query only has either cityand/or statereferences - the index can not be used. This is because the leftmost column isn't referenced. Query performance should tell you which is optimal - individual indexes, or multiple composites with different orders. Good read: The Tipping Point, by Kimberley Tripp

任何使用该address列的查询都可以使用索引,但如果查询只有一个city和/或state引用 - 则不能使用索引。这是因为最左边的列没有被引用。查询性能应该告诉您哪个是最佳的 - 单个索引,或具有不同顺序的多个组合。好读物:引爆点,金伯利·特里普着

回答by Rick James

All the other answers are wrong.

所有其他答案都是错误的。

Selectivity of the individual columns in a composite index does notmatter when picking the order.

选择顺序时,复合索引各个列的选择性无关紧要。

Here is the simple thought process: Effectively, an index is the concatenation of the columns involved.

这是一个简单的思考过程: 实际上,索引是所涉及的列的串联。

Giving that rationale, the only difference is comparing two 'strings' that differ earlier versus later in the string. This is a tiny part of the total cost. There is no "first pass / second pass", as mentioned in one Answer.

给出这个基本原理,唯一的区别是比较字符串中较早与较晚不同的两个“字符串”。这只是总成本的一小部分。正如一个答案中提到的那样,没有“第一次通过/第二次通过”。

So, what order should be used?

那么,应该使用什么顺序呢?

  1. Start with column(s) tested with =, in anyorder.
  2. Then tack on one range column.
  1. =任何顺序从用,测试的列开始。
  2. 然后添加一个范围列。

For example, the very-low selectivity column mustcome first in this:

例如,极低选择性列必须排在第一位:

WHERE deleted = 0  AND  the_datetime > NOW() - INTERVAL 7 DAY
INDEX(deleted, the_datetime)

Swapping the order in the index would have it totally ignore deleted.

交换索引中的顺序将使其完全忽略deleted

(There are a lot more rules for ordering the columns.)

(对列进行排序有更多规则。)