SQL Server - 分区表与聚集索引?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/120731/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL Server - Partitioned Tables vs. Clustered Index?
提问by David Kreps
Let's assume you have one massive table with three columns as shown below:
假设您有一个包含三列的大表,如下所示:
[id] INT NOT NULL,
[date] SMALLDATETIME NOT NULL,
[sales] FLOAT NULL
Also assume you are limited to one physical disk and one filegroup (PRIMARY). You expect this table to hold sales for 10,000,000+ ids, across 100's of dates (easily 1B+ records).
还假设您仅限于一个物理磁盘和一个文件组 (PRIMARY)。您希望该表可以保存 10,000,000 多个 id 的销售额,跨越 100 个日期(轻松 1B+ 记录)。
As with many data warehousing scenarios, the data will typically grow sequentially by date (i.e., each time you perform a data load, you will be inserting new dates, and maybe updating some of the more recent dates of data). For analytic purposes, the data will often be queried and aggregated for a random set of ~10,000 ids which will be specified via a join with another table. Often, these queries don't specify date ranges, or specify very wide date ranges, which leads me to my question: What is the best way to index / partition this table?
与许多数据仓库场景一样,数据通常会按日期顺序增长(即,每次执行数据加载时,都会插入新日期,并且可能会更新一些较新的数据日期)。出于分析目的,数据通常会被查询和聚合为一组约 10,000 个随机 ID,这些 ID 将通过与另一个表的连接指定。通常,这些查询不指定日期范围,或指定非常宽的日期范围,这引出了我的问题:索引/分区此表的最佳方法是什么?
I have thought about this for a while, but am stuck with conflicting solutions:
我已经考虑了一段时间,但遇到了相互矛盾的解决方案:
Option #1:As data will be loaded sequentially by date, define the clustered index (and primary key) as [date], [id]. Also create a "sliding window" partitioning function / scheme on date allowing rapid movement of new data in / out of the table. Potentially create a non-clustered index on id to help with querying.
选项#1:由于数据将按日期顺序加载,因此将聚集索引(和主键)定义为 [date], [id]。还要在日期上创建一个“滑动窗口”分区函数/方案,允许新数据快速移入/移出表。可能会在 id 上创建一个非聚集索引来帮助查询。
Expected Outcome #1:This setup will be very fast for data loading purposes, but sub-optimal when it comes to analytic reads as, in a worst case scenario (no limiting by dates, unlucky with set of id's queried), 100% of the data pages may be read.
预期结果 #1:对于数据加载目的,此设置将非常快,但在分析读取方面并非最佳,因为在最坏的情况下(不受日期限制,查询 id 集很不幸),100%可以读取数据页。
Option #2:As the data will be queried for only a small subset of ids at a time, define the clustered index (and primary key) as [id], [date]. Do not bother to create a partitioned table.
选项#2:由于每次只查询一小部分 id 的数据,因此将聚集索引(和主键)定义为 [id], [date]。不要费心创建分区表。
Expected Outcome #2:Expected huge performance hit when it comes to loading data as we can no longer quickly limit by date. Expected huge performance benefit when it comes to my analytic queries as it will minimize the number of data pages read.
预期结果 #2:在加载数据时预期会出现巨大的性能下降,因为我们无法再快速限制日期。当涉及到我的分析查询时,预期会有巨大的性能优势,因为它将最大限度地减少读取的数据页数。
Option #3:Clustered (and primary key) as follows: [id], [date]; "sliding window" partition function / scheme on date.
选项#3:集群(和主键)如下:[id], [date]; 日期的“滑动窗口”分区函数/方案。
Expected Outcome #3:Not sure what to expect. Given that the first column in the clustered index is [id] and thus (it is my understanding) the data is arranged by ID, I would expect good performance from my analytic queries. However, the data is partitioned by date, which is contrary to the definition of the clustered index (but still aligned as date is part of the index). I haven't found much documentation that speaks to this scenario and what, if any, performance benefits I may get from this, which brings me to my final, bonus question:
预期结果#3:不确定会发生什么。鉴于聚集索引中的第一列是 [id],因此(这是我的理解)数据按 ID 排列,我希望我的分析查询有良好的性能。但是,数据按日期分区,这与聚集索引的定义相反(但仍然对齐,因为日期是索引的一部分)。我还没有找到太多说明这种情况的文档,以及我可以从中获得哪些性能优势(如果有的话),这让我想到了最后一个额外的问题:
If I am creating a table on one filegroup on one disk, with a clustered index on one column, is there any benefit (besides partition switching when loading the data) that comes from defining a partition on the same column?
如果我在一个磁盘上的一个文件组上创建一个表,在一个列上有一个聚集索引,那么在同一列上定义一个分区是否有任何好处(除了加载数据时的分区切换)?
回答by Amy B
This table is awesomely narrow. If the real table will be this narrow, you should be happy to have table scans instead of index->lookups.
这张桌子非常窄。如果真正的表如此狭窄,您应该很高兴使用表扫描而不是索引->查找。
I would do this:
我会这样做:
CREATE TABLE Narrow
(
[id] INT NOT NULL,
[date] SMALLDATETIME NOT NULL,
[sales] FLOAT NULL,
PRIMARY KEY(id, date) --EDIT, just noticed your id is not unique.
)
CREATE INDEX CoveringNarrow ON Narrow(date, id, sales)
This handles point queries with seeks and wide-range queries with limited scans against date criteria and id criteria. There is no per-record lookup from index. Yes, I've doubled the write time (and space used) but that's fine, imo.
这可以处理带有搜索的点查询和具有针对日期标准和 id 标准的有限扫描的大范围查询。没有来自索引的每条记录查找。是的,我已经将写入时间(和使用的空间)翻了一番,但这很好,imo。
If there's some need for a specific piece of data (and that need is demonstrated by profiling!!), I'd create a clustered view targetting that section of the table.
如果需要特定的数据(并且这种需求通过分析来证明!!),我会创建一个针对表的该部分的集群视图。
CREATE VIEW Narrow200801
AS
SELECT * FROM Narrow WHERE '2008-01-01' <= [date] AND [date] < '2008-02-01'
--There is some command that I don't have at my finger tips to make this a clustered view.
Clustered views can be used in queries by name, or the optimizer will choose to use the clustered views when the FROM and WHERE clause are appropriate. For example, this query will use the clustered view. Note that the base table is referred to in the query.
可以在查询中按名称使用集群视图,或者优化器将在 FROM 和 WHERE 子句合适时选择使用集群视图。例如,此查询将使用集群视图。请注意,查询中引用了基表。
SELECT SUM(sales) FROM Narrow WHERE '2008-01-01' <= [date] AND [date] < '2008-02-01'
As indexlets you make specific columns conveniently accessible... Clustered viewlets you make specific rows conveniently accessible.
由于索引让您可以方便地访问特定列...集群视图让您可以方便地访问特定行。
回答by ConcernedOfTunbridgeWells
A clustered index will give you performance benefits for queries when localising the I/O. Date is a traditional partitioning strategy as many D/W queries look at movements by date.
在本地化 I/O 时,聚集索引将为查询带来性能优势。日期是一种传统的分区策略,因为许多 D/W 查询按日期查看变动。
A rule of thumb for a partitioned table suggests that partitions should be around 10m rows in size.
分区表的经验法则建议分区的大小应在 10m 左右。
It would be somewhat unusual to see much performance gain from a clustered index on a diverse analytic workload. The query optimiser will use a technique called 'Index Intersection'to select rows without even hitting the fact table. See Herefor a post I did on another question that explains this in more depth with some links. A clustered index may or may not participate in the index intersection, so you may find that it gains you relatively little on a general query workload.
在不同的分析工作负载上从聚集索引中看到很多性能提升是有点不寻常的。查询优化器将使用一种称为“索引交集”的技术来选择行,甚至不会命中事实表。请参阅此处查看我在另一个问题上所做的帖子,该帖子通过一些链接更深入地解释了这一点。聚集索引可能会也可能不会参与索引交集,因此您可能会发现它在一般查询工作负载上的收益相对较小。
You may find circumstances in loading where clustered indexes give you some gain, particularly if you have derived calculations (such as Earned Premium) that are computed within the ETL process. In this case you may get some benefits. If you have a specific query that you know will be executed all the time it might make sense to use clustered indexes for this. Options #2 and #3 are only going to significantly benefit you if you expect this type of query to be the overwhelming majority of the work done by the application.
您可能会发现在加载时聚集索引会给您带来一些好处,特别是如果您有在 ETL 过程中计算的派生计算(例如Earned Premium)。在这种情况下,您可能会获得一些好处。如果您知道某个特定查询将一直执行,那么为此使用聚集索引可能是有意义的。如果您希望这种类型的查询成为应用程序完成的绝大多数工作,则选项 #2 和 #3 只会使您显着受益。
For a flexible system, a simple date range partition with an index on the ID (and date if the partitions hold a range would probably get you as good a performance as any. You might get some benefit from clustering the index limited circumstances. You might also get some mileage from building a cube over the data and ensuring that the aggregations are set up correctly for this query.
对于一个灵活的系统,一个简单的日期范围分区,在 ID 上有一个索引(如果分区包含一个范围,那么日期可能会给你带来和任何一样好的性能。你可能会从索引有限的情况下获得一些好处。你可能会还可以通过在数据上构建多维数据集并确保为此查询正确设置聚合而获得一些帮助。
回答by GateKiller
I would do the following:
我会做以下事情:
- Non-Clustered Index on [Id]
- Clustered Index on [Date]
- Convert the [sales] datatype to numeric instead of float
- [Id] 上的非聚集索引
- [日期] 的聚集索引
- 将 [sales] 数据类型转换为数字而不是浮点数
回答by Biri
If you are using the partitions in the select statements, then you cn gain some speed.
如果您在 select 语句中使用分区,那么您 cn 会获得一些速度。
If you are not using it, only using "standard" selects, then you have no benefit.
如果你不使用它,只使用“标准”选择,那么你没有任何好处。
On your original problem: I would recommend you option #1 with the non-clustered index on id included.
关于您的原始问题:我建议您使用包含 id 上的非聚集索引的选项 #1。
回答by Thomas Wagner
Partition the table by date. Several horizontal partitions will be more performant than one large table with that many rows.
按日期对表进行分区。多个水平分区将比一个具有这么多行的大表的性能更高。
回答by Mladen
Clustered index on the date column isn't good if you'll have inserts that will be inserted faster that the datetime resolution of 3.33 ms is. if you do you'll get 2 keys with the same value and your index will have to get another internal uniquifier which will increase its size.
如果您将插入的插入速度比 3.33 毫秒的日期时间分辨率快,则日期列上的聚集索引并不好。如果这样做,您将获得 2 个具有相同值的键,并且您的索引将不得不获得另一个内部 uniquifier,这将增加其大小。
i'd go with #2 of your options.
我会选择 #2 你的选项。