SQL 集群与非集群
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7605707/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Clustered vs Non-Clustered
提问by Craig
My lower level knowledge of SQL (Server 2008) is limited, and is now being challanged by our DBAs. Let me explain (I have mentioned obvious statements in the hope that I am right, but if you see something wrong, please tell me) the scenario:
我对 SQL (Server 2008) 的低级知识是有限的,现在正受到我们的 DBA 的挑战。让我解释一下(我已经提到了明显的陈述,希望我是对的,但如果你看到错误,请告诉我)场景:
We have a table which holds 'Court Orders' for people. When I created the table, (Name: CourtOrder), I created it like:
我们有一张桌子,上面放着人们的“法庭命令”。当我创建表时,(名称:CourtOrder),我像这样创建它:
CREATE TABLE dbo.CourtOrder
(
CourtOrderID INT NOT NULL IDENTITY(1,1), (Primary Key)
PersonId INT NOT NULL,
+ around 20 other fields of different types.
)
I then applied a non-clustered index to the primary key (for efficiency). My reasons is that it is a unique field (primary key), and should be indexed, mainly for selection purposes, as we often Select from table where primary key = ...
然后我对主键应用了一个非聚集索引(为了效率)。我的理由是它是一个唯一的字段(主键),应该被索引,主要是为了选择目的,就像我们经常Select from table where primary key = ...
I then applied a CLUSTERED index on PersonId. The reason was to group orders for a particular person physically, as the vast majority of work is getting orders for a person. So, select from mytable where personId = ...
然后我在 PersonId 上应用了一个 CLUSTERED 索引。原因是在物理上对特定人的订单进行分组,因为绝大多数工作都是为一个人获得订单。所以,select from mytable where personId = ...
I have been pulled up on this now. I have been told that we should put the clustered index on the primary key, and the normal index on the personId. That seems very strange to me. First off, why would you put a clustered index on a unique column? what is it clustering? Surely that's a waste of the clustered index? I'd have believed a normal index would be used on a unique column. Also, clustering the index would mean we can't cluster a different column (One per table, right?).
我现在已经被拉上来了。有人告诉我,我们应该将聚集索引放在主键上,将普通索引放在 personId 上。这对我来说似乎很奇怪。首先,为什么要将聚集索引放在唯一列上?什么是聚类?这肯定是对聚集索引的浪费吗?我曾相信一个普通的索引会被用在一个唯一的列上。此外,对索引进行聚类意味着我们不能对不同的列进行聚类(每个表一个,对吗?)。
The reasoning for me being told I have made a mistake is that they believe putting a clustered index on the PersonId would make inserts slow. For the 5% gain in speed of a select, we would be getting a 95% degradation in speed on inserts and updates. Is that correct and valid?
我被告知我犯了一个错误的原因是他们认为在 PersonId 上放置聚集索引会使插入变慢。对于选择速度提高 5% 的情况,插入和更新的速度将降低 95%。这是正确和有效的吗?
They say that because we cluster the personId, SQL Server has to rearrange data when ever we insert or make a change to the PersonId.
他们说,因为我们对 personId 进行了聚类,所以每当我们插入或更改 PersonId 时,SQL Server 都必须重新排列数据。
So then I have asked, why would SQL have the concept of a CLUSTERED INDEX, if it's so slow? Is it as slow as they're saying? How should I have setup my indexes to achieve optimum performance? I'd have thought SELECT is used more than INSERT... but they say that we're having locking issues on INSERTS...
那么我问,如果 SQL 这么慢,为什么会有集群索引的概念?有他们说的那么慢吗?我应该如何设置索引以实现最佳性能?我原以为 SELECT 比 INSERT 使用得更多……但他们说我们在 INSERTS 上遇到了锁定问题……
Hope someone can help me.
希望可以有人帮帮我。
回答by Adam Robinson
The distinction between a clustered vs. non-clustered index is that the clustered index determines the physical order of the rows in the database. In other words, applying the clustered index to PersonId
means that the rows will be physically sorted by PersonId
in the table, allowing an index search on this to go straight to the row (rather than a non-clustered index, which would direct you to the row's location, adding an extra step).
聚集索引与非聚集索引的区别在于聚集索引决定了数据库中行的物理顺序。换句话说,应用聚集索引PersonId
意味着行将PersonId
在表中物理排序,允许索引搜索直接到行(而不是非聚集索引,它将引导您到行的位置,添加一个额外的步骤)。
That said, it's unusualfor the primary key not to be the clustered index, but not unheard of. The issue with your scenario is actually the opposite of what you're assuming: you want uniquevalues in a clustered index, not duplicates. Because the clustered index determines the physical order of the row, if the index is on a non-unique column, then the server has to add a background value to rows who have a duplicate key value (in your case, any rows with the same PersonId
) so that the combined value (key + background value) is unique.
也就是说,主键不是聚集索引是不寻常的,但并非闻所未闻。您的方案的问题实际上与您假设的相反:您希望聚集索引中的唯一值,而不是重复值。因为聚集索引决定了行的物理顺序,如果索引位于非唯一列上,那么服务器必须为具有重复键值的行添加背景值(在您的情况下,任何具有相同键值的行)PersonId
) 以便组合值(键 + 背景值)是唯一的。
The only thing I would suggest is notusing a surrogate key (your CourtOrderId
) column as the primary key, but instead use a compound primary key of the PersonId
and some other uniquely-identifying column or set of columns. If that's not possible (or not practical), though, then put the clustered index on CourtOrderId
.
我唯一建议不要使用代理键(您的CourtOrderId
)列作为主键,而是使用PersonId
和其他一些唯一标识列或列集的复合主键。但是,如果这不可能(或不切实际),则将聚集索引放在CourtOrderId
.
回答by Darian Miller
I am by no means a SQL Expert...so take this as a developer's view rather than a DBA view..
我绝不是 SQL 专家......所以把它作为开发人员的观点而不是 DBA 的观点......
Inserts on clustered (physically ordered) indexes that aren't in sequential order cause extra work for inserts/updates. Also, if you have many inserts happening at once and they are all occurring in the same location, you end up with contention. Your specific performance varies based on your data and how you access it. The general rule of thumb is to build your clustered index on the most unique narrow value in your table (typically the PK)
在不按顺序排列的聚集(物理排序)索引上的插入会导致插入/更新的额外工作。此外,如果您同时发生许多插入并且它们都发生在同一位置,则最终会发生争用。您的具体表现因您的数据和访问方式而异。一般的经验法则是在表中最独特的窄值(通常是 PK)上构建聚簇索引
I'm assuming your PersonId won't be changing, so Updates don't come into play here. But consider a snapshot of a few rows with PersonId of 1 2 3 3 4 5 6 7 8 8
我假设您的 PersonId 不会改变,因此更新不会在这里发挥作用。但是考虑一个 PersonId 为 1 2 3 3 4 5 6 7 8 8 的几行的快照
Now insert 20 new rows for PersonId of 3. First, since this is not a unique key, the server adds some extra bytes to your value (behind the scenes) to make it unique (which also adds extra space) and then the location where these will reside has to be altered. Compare that to inserting an auto-incrementing PK where the inserts happen at the end. The non technical explanation would likely come down to this: there is less 'leaf-shuffling' work to do if it's naturally progressing higher values at the end of the table versus reworking location of the existing items at that location while inserting your items.
现在为 3 的 PersonId 插入 20 个新行。首先,由于这不是唯一键,服务器会向您的值(在幕后)添加一些额外的字节以使其唯一(这也增加了额外的空间),然后是位置这些将驻留必须改变。将其与插入最后发生的自动递增 PK 进行比较。非技术性解释可能归结为:如果在表格末尾自然地取得更高的值,而不是在插入项目时在该位置重新处理现有项目的位置,则需要做的“叶子洗牌”工作较少。
Now, if you are having issues with Inserts then you are likely inserting a bunch of the same (or similar) PersonId values at once which is causing this extra work in various places throughout the table and the fragmentation is killing you. The downside of switching to the PK being clustered in your case, is if you are having insert issues today on PersonIds that vary in value spread throughout the table, if you switch your clustered index to the PK and all of the inserts now happen in one location then your problem may actually get worse due to increased contention concentration. (On the flip side, if your inserts today are not spread out all over, but are all typically bunched in similar areas, then your problem will likely ease by switching your clustered index away from PersonId to your PK because you'll be minimizing the fragmentation.)
现在,如果您遇到插入问题,那么您可能会一次插入一堆相同(或相似)的 PersonId 值,这会导致在整个表的各个位置进行额外的工作,并且碎片正在杀死您。在您的情况下切换到集群的 PK 的缺点是,如果您今天在 PersonIds 上遇到插入问题,这些问题的值分布在整个表中,如果您将集群索引切换到 PK 并且所有插入现在都发生在一个位置那么您的问题实际上可能会由于争用集中度的增加而变得更糟。(另一方面,如果您今天的插入内容没有分散开来,而是通常都聚集在相似的区域,那么您的问题可能会通过将聚集索引从 PersonId 切换到您的 PK 来缓解,因为您将最大限度地减少碎片化。)
Your performance problems should be analyzed to your unique situation and take these types of answers as general guidelines only. Your best bet is to rely on a DBA that can validate exactly where your problems lie. It sounds like you have resource contention issues that may be beyond a simple index tweak. This could be a symptom of a much larger problem. (Likely design issues...otherwise resource limitations.)
您的性能问题应该根据您的独特情况进行分析,并将这些类型的答案仅作为一般指导。您最好的选择是依靠 DBA,他可以准确地验证您的问题所在。听起来您的资源争用问题可能超出了简单的索引调整。这可能是一个更大问题的征兆。(可能是设计问题……否则资源限制。)
In any case, good luck!
无论如何,祝你好运!
回答by Martin Smith
Some authors do suggest not "wasting" the CI
on an identity
column if there is an alternative that would benefit range queries.
一些作者确实建议不要CI
在identity
列上“浪费” ,如果有一个有利于范围查询的替代方法。
From MSDN Clustered Index Design Guidelinesthe key should be chosen according to the following criteria
从 MSDN Clustered Index Design Guidelines 中,应根据以下标准选择键
- Can be used for frequently used queries.
- Provide a high degree of uniqueness.
- Can be used in range queries.
- 可用于经常使用的查询。
- 提供高度的独特性。
- 可用于范围查询。
Your CourtOrderID
column meets 2
. Your PersonId
meets 1
and 3
. As most rows will end up with the uniqueifier
added anyway you might as well just declare it as unique and use PersonId,CourtOrderID
as this will be the same width but be more useful as the clustered index key is added to all NCIs as the row locator and this will allow them to cover more queries.
您的CourtOrderID
专栏符合2
. 你的PersonId
遇见1
和3
。由于大多数行最终都会以uniqueifier
添加方式结束,因此您最好将其声明为唯一并使用,PersonId,CourtOrderID
因为这将具有相同的宽度但更有用,因为聚集索引键作为行定位器添加到所有 NCI 中,这将允许以涵盖更多查询。
The main issue with using PersonId,CourtOrderID
as the CI is that logical fragmentation will likely ensue (and this particularly affects the range queries you are trying to help) so you would need to monitor fill factor, and fragmentation levels and perform index maintenance more often.
PersonId,CourtOrderID
用作 CI的主要问题是可能会出现逻辑碎片(这尤其会影响您尝试帮助的范围查询),因此您需要监视填充因子和碎片级别并更频繁地执行索引维护。
回答by user2191454
It's explained in the following link: https://msdn.microsoft.com/en-us/ms190457.aspx
以下链接对此进行了解释:https: //msdn.microsoft.com/en-us/ms190457.aspx
Clustered
集群
Clustered indexes sort and store the data rows in the table or viewbased on their key values. These are the columns included in the index definition. There can be only one clustered index per table, because the data rows themselves can be sorted in only one order.
The only time the data rows in a table are stored in sorted order is when the table contains a clustered index. When a table has a clustered index, the table is called a clustered table. If a table has no clustered index, its data rows are stored in an unordered structure called a heap.
聚集索引根据键值对表或视图中的数据行进行排序和存储。这些是索引定义中包含的列。每个表只能有一个聚集索引,因为数据行本身只能按一种顺序排序。
只有当表包含聚集索引时,表中的数据行才会按排序顺序存储。当表具有聚集索引时,该表称为聚集表。如果表没有聚集索引,则其数据行存储在称为堆的无序结构中。
Nonclustered
非聚集
Nonclustered indexes have a structure separate from the data rows. A nonclustered index contains the nonclustered index key values and each key value entry has a pointer to the data row that contains the key value.
The pointer from an index row in a nonclustered index to a data row is called a row locator. The structure of the row locator depends on whether the data pages are stored in a heap or a clustered table. For a heap, a row locator is a pointer to the row. For a clustered table, the row locator is the clustered index key.
You can add nonkey columns to the leaf level of the nonclustered index to by-pass existing index key limits, 900 bytes and 16 key columns, and execute fully covered, indexed, queries.
非聚集索引具有与数据行分离的结构。非聚簇索引包含非聚簇索引键值,每个键值条目都有一个指向包含键值的数据行的指针。
从非聚集索引中的索引行到数据行的指针称为行定位符。行定位器的结构取决于数据页是存储在堆中还是聚簇表中。对于堆,行定位符是指向行的指针。对于聚簇表,行定位符是聚簇索引键。
您可以将非键列添加到非聚集索引的叶级别以绕过现有索引键限制、900 字节和 16 个键列,并执行完全覆盖的索引查询。
回答by toLucky
Some db with some nasty selects, joins in a stored procedure - only diffrence is the index
一些带有一些讨厌的选择的数据库,加入存储过程 - 唯一的区别是索引
INDEXES - clustered vs nonclustered
索引 - 集群与非集群
891 rows
10 sec
NONCLUSTERED
OR
891 rows
14 sec
CLUSTERED