复合主键应该在 SQL Server 中群集吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/389348/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Should a Composite Primary Key be clustered in SQL Server?
提问by Neil Barnwell
Consider this example table (assuming SQL Server 2005):
考虑这个示例表(假设 SQL Server 2005):
create table product_bill_of_materials
(
parent_product_id int not null,
child_product_id int not null,
quantity int not null
)
I'm considering a composite primary key containing the two product_id columns (I'll definitely want a unique constraint) as opposed to a separate unique ID column. Question is, from a performance point of view, should that primary key be clustered?
我正在考虑一个包含两个 product_id 列的复合主键(我肯定需要一个唯一约束),而不是一个单独的唯一 ID 列。问题是,从性能的角度来看,该主键是否应该集群?
Should I also create an index on each ID column so that lookups for the foreign keys are faster? I believe this table is going to get hit much more on reads than writes.
我还应该在每个 ID 列上创建一个索引,以便更快地查找外键吗?我相信这张表在读取时会比写入时受到更多的打击。
回答by Tom H
As has already been said by several others, it depends on how you will access the table. Keep in mind though, that any RDBMS out there should be able to use the clustered index for searching by a single column as long as that column appears first. For example, if your clustered index is on (parent_id, child_id) you don't need another separate index on (parent_id).
正如其他几个人已经说过的那样,这取决于您将如何访问该表。但请记住,只要该列首先出现,任何 RDBMS 都应该能够使用聚集索引按单个列进行搜索。例如,如果您的聚集索引位于 (parent_id, child_id) 上,则您不需要 (parent_id) 上的另一个单独索引。
Your best bet may be a clustered index on (parent_id, child_id), which also happens to be the primary key, with a separate non-clustered index on (child_id).
你最好的选择可能是 (parent_id, child_id) 上的聚集索引,它也恰好是主键,在 (child_id) 上有一个单独的非聚集索引。
Ultimately, indexing should be addressed after you've got an idea of how the database will be accessed. Come up with some standard performance stress tests if you can and then analyze the behavior using a profiling tool (SQL Profiler for SQL Server) and performance tune from there. If you don't have the expertise or knowledge to do that ahead of time, then try for a (hopefully limited) release of the application, collect the performance metrics, and see where you need to improve performance and figure out what indexes will help.
最终,应该在您了解如何访问数据库之后解决索引问题。如果可以的话,提出一些标准的性能压力测试,然后使用分析工具(SQL Profiler for SQL Server)分析行为并从那里进行性能调整。如果您没有提前执行此操作的专业知识或知识,请尝试发布(希望是有限的)应用程序,收集性能指标,并查看您需要在哪些方面提高性能并找出哪些索引会有所帮助.
If you do things right, you should be able to capture the "typical" profile of how the database is accessed and you can then rerun that over and over again on a test server as you try various indexing approaches.
如果您做对了,您应该能够捕获数据库访问方式的“典型”配置文件,然后您可以在尝试各种索引方法时在测试服务器上一遍又一遍地重新运行它。
In your case I would probably just put a clustered PK on (parent_id, child_id) to start with and then add the non-clustered index only if I saw a performance problem that would be helped by it.
在您的情况下,我可能只会在 (parent_id, child_id) 上放置一个聚集 PK,然后仅当我看到可以帮助它的性能问题时才添加非聚集索引。
回答by dkretz
"What you query on most often" is not necessarily the best reason to choose an index for clustering. What matters most is what you query on to obtain multiple rows. Clustering is the strategy appropriate for making it efficient to obtain multiple rows in the fewest number of disk reads.
“您最常查询的内容”不一定是选择集群索引的最佳理由。最重要的是您查询什么以获得多行。集群是一种适用于以最少的磁盘读取次数高效获取多行的策略。
The best example is sales history for a customer.
最好的例子是客户的销售历史。
Say you have two indexes on the Sales table, one on Customer (and maybe date, but the point applies either way). If you query the table most often on CustomerID, then you'll want all the customer's Sales records together to give you one or two disk reads for all the records.
假设您在 Sales 表上有两个索引,一个在 Customer 上(可能还有日期,但无论哪种方式都适用)。如果您最常在 CustomerID 上查询表,那么您需要将所有客户的 Sales 记录放在一起,以便为所有记录提供一两次磁盘读取。
The primary key, OTOH, might be a surrogate key, or SalesId, - but a unique value in any case. If this were clustered, it would be of no benefit compared to a normal unique index.
主键 OTOH 可能是代理键或 SalesId,但无论如何都是唯一值。如果这是聚集的,与普通的唯一索引相比,它没有任何好处。
EDIT: Let's take this particular table for discussion - it will reveal yet more subtleties.
编辑:让我们讨论这个特定的表格 - 它会揭示更多的微妙之处。
The "natural" primary key is in all likelihood parentid + childid. But in what sequence? Parentid + childid is no more unique than childid + parentid. For clustering purposes, which ordering is more appropriate? One would assume it must be parentid + childid, since we will want to ask: "For a given item, what are its constituents"? But is not that unlikely to want to go the other way, and ask "For a given constuent, of what items is it a component?".
“自然”主键很可能是 parentid + childid。但是按照什么顺序呢?Parentid + childid 并不比 childid + parentid 更独特。出于聚类目的,哪种排序更合适?人们会假设它必须是 parentid + childid,因为我们会问:“对于给定的项目,它的组成部分是什么”?但不是不太可能想走另一条路,并问“对于给定的选民,它是哪些项目的组成部分?”。
Add in the consideration of "covering indexes", which contain, within the index, all the information needed to satisfy the query. If that's true, then you never need to read the rest of the record; so clustering is of no benefit; just reading the index is sufficient. (BTW, that means two indexes on the same pair of fields, in opposite order; which may be the proper thing to do in cases like this. Or at least a composite index on one, and a single-field index on the other.)
添加“覆盖索引”的考虑,它在索引中包含满足查询所需的所有信息。如果这是真的,那么你永远不需要阅读记录的其余部分;所以聚类没有任何好处;只需阅读索引就足够了。(顺便说一句,这意味着同一对字段上的两个索引,顺序相反;在这种情况下,这可能是正确的做法。或者至少是一个复合索引,另一个是单字段索引。 )
But that still doesn't dictate which should be clustered; which would finally probably be determined by which queries will, in fact, need to grab the record for the Quantity field.
但这仍然没有决定哪个应该聚类;这最终可能取决于哪些查询实际上需要获取 Quantity 字段的记录。
Even for such a clear example, in principle it's best to leave decidintg about other indexes until you can test them with realistic data (obviously before production); but asking here for speculation is pointless. Testing always will give you the proper answer.
即使是这样一个清晰的例子,原则上最好不要决定其他索引,直到你可以用真实的数据测试它们(显然是在生产之前);但在这里要求猜测是没有意义的。测试总是会给你正确的答案。
Forget worrying about slowing down inserts until you have a problem (which in most cases will never happen), and can test to make sure giving up useful indexes for a measurable benefit.
在遇到问题(在大多数情况下永远不会发生)之前,不要担心会减慢插入速度,并且可以进行测试以确保放弃有用的索引以获得可衡量的好处。
Things still aren't certain, though, because junction tables like this one are also frequently involved in lots of other types of queries. So I'd just pick one and test as needed as the application gels, and data volume for testing becomes available.
然而,事情仍然不确定,因为像这样的联结表也经常涉及许多其他类型的查询。因此,我会选择一个并根据应用程序的需要进行测试,并且测试数据量可用。
BTW, I'd expect it to end up with a PK on parentid + childid; a non-unique index on childid; and the first clustered. If you prefer a surrogate PK, then you'll still want a unique index on parentid + childid, clustered. Clustering the surrogate key is very unlikely to be optimal.
顺便说一句,我希望它最终会在 parentid + childid 上进行 PK;childid 的非唯一索引;和第一个集群。如果您更喜欢代理 PK,那么您仍然需要在 parentid + childid 上的唯一索引,聚集。对代理键进行聚类不太可能是最佳的。
回答by Mitchel Sellers
The real question here is what will you be querying on the most? If you will be looking for both values all the time, then the clustered should be on the pair. If you are going to query more heavily on one or the other you would want the clustered on that specific one.
这里真正的问题是你最想查询什么?如果您一直在寻找这两个值,那么集群应该在对上。如果您要更多地查询一个或另一个,您会希望聚集在那个特定的一个上。
回答by ScottStonehouse
Since you say "I'm considering a composite primary key" - there still might be time to change your mind. I've used many composite keys and I keep finding reasons to wish I hadn't. Maybe others will disagree with me.
既然您说“我正在考虑使用复合主键”,那么您可能还有时间改变主意。我用过很多复合键,我一直在找理由希望我没有。也许其他人会不同意我的看法。
I agree with Mitchel's answer, the cluster goes on whatever you will query on most often.
我同意 Mitchel 的回答,集群会继续您最常查询的任何内容。
回答by Eric Sabine
I'd like to zero-in on your last statement. "I believe this table is going to get hit much more on reads than writes." If this is the case then you may want to go index-heavy. The reason we don't go index-heavy on everything is you pay performance penalties for updates & inserts to the table. When we have tables that are serving more reading than writing then pay the price for the indexes.
我想对你的最后一句话进行归零。“我相信这张表在读取时会比写入时受到更多的打击。” 如果是这种情况,那么您可能需要使用大量索引。我们不对所有内容都使用索引的原因是您要为更新和插入表支付性能损失。当我们的表服务于读多于写时,那么就要为索引付出代价。
As for what to cluster you should think of how the table will be best used. If your table is subject to a lot of range queries (WHERE col1 IS BETWEEN a AND b) then cluster the table so that the range queries will already be set up in order on the disk. In SQL Server sometimes we get the cluster for free with the PKs and we forget about what's best to cluster to begin with.
至于要聚类什么,您应该考虑如何最好地使用该表。如果您的表受到大量范围查询(WHERE col1 IS BETWEEN a AND b)的影响,则对表进行集群,以便在磁盘上按顺序设置范围查询。在 SQL Server 中,有时我们可以免费获得带有 PK 的集群,而我们忘记了从什么开始最好集群。
As for the FK constraints on the table, since you said more reads than writes this may be acceptable. If this were a table with a lot of inserts each FK constraint requires validation against the parent table and that might not give you the performance you desire.
至于表上的 FK 约束,因为你说读多于写,这可能是可以接受的。如果这是一个有大量插入的表,每个 FK 约束都需要针对父表进行验证,这可能无法提供您想要的性能。
Great question.
很好的问题。