oracle 对只有 5 个不同值的列进行索引 - 值得吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1879626/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Index over a column with only 5 distinct values - Worth it?
提问by svrist
I have a table with a potential of up to 5.000.000 rows. One of the columns in this table is used alone in queries, but there is only 5 possible values of this column, and currently I got 10.000 rows and according to the explain plan it makes no sense to use my index on that column.
我有一个最多可能有 5.000.000 行的表。该表中的一列在查询中单独使用,但该列只有 5 个可能的值,目前我有 10.000 行,根据解释计划,在该列上使用我的索引是没有意义的。
Will it ever, or shouldn't I bother with an index
会不会,或者我不应该打扰索引
Edit: This is the two explain plans at the moment Without index http://img706.imageshack.us/img706/1903/noindex.pngvs. With forced index via hints http://img692.imageshack.us/img692/8205/indexp.pngThe latter image I force the usage of the index with a hint.
编辑:这是目前的两个解释计划 没有索引 http://img706.imageshack.us/img706/1903/noindex.png与 通过提示强制索引 http://img692.imageshack.us/img692/8205 /indexp.png后一张图片我用一个提示强制使用索引。
采纳答案by David Aldridge
It depends on a couple of things.
这取决于几件事。
Firstly, the distribution of values. If you only have five distinct values but one of them accounts for 99.9999% of rows in the table then obviously you would not want the optimiser to use the index for that value but you mightwant it to use it for the others. In some cases like this it's worth using a function-based index to ensure that you only index the values of interest and not the ones that are just taking up space.
第一,价值观的分布。如果您只有五个不同的值,但其中一个占表中 99.9999% 的行,那么显然您不希望优化器使用该值的索引,但您可能希望它为其他值使用它。在像这样的某些情况下,值得使用基于函数的索引来确保您只索引感兴趣的值而不是那些只占用空间的值。
Secondly, are there queries that can be answered using that index without accessing the table?
其次,是否可以在不访问表的情况下使用该索引回答查询?
Note that it's not just the percentage of rows that will be accessed that matters, but the number of blocks of the table that will need to be accessed. For example if you have a table of 1000 blocks and 30 rows per block on average, and one column has 30 distinct values (each one being present in 1000 rows), then the number of blocks that need to be visited to read every row for a single value varies between 1000/30=34 (worth using an index) and 1000 (not worth using an index) depending on how the rows are distributed. this is expressed by the clustering factor of the index -- if it's value is close to the number of rows in the table then the index is less likely to be used, and if it's close to the number of blocks then it's more likely to be used.
请注意,重要的不仅仅是将被访问的行的百分比,还有需要访问的表的块数。例如,如果您有一个包含 1000 个块和平均每个块 30 行的表,并且一列有 30 个不同的值(每个值出现在 1000 行中),那么需要访问以读取每一行的块数单个值在 1000/30=34(值得使用索引)和 1000(不值得使用索引)之间变化,具体取决于行的分布方式。这由索引的聚集因子表示——如果它的值接近表中的行数,那么索引不太可能被使用,如果它接近块数,那么它更有可能被使用用过的。
also, you might look at index compression to see if that saves you space.
此外,您可能会查看索引压缩,看看是否可以节省空间。
Be careful with bitmap indexes -- they are not friendly to systems where they are subject to modification by multiple sessions at the same time (eg. two people both inserting rows at the same time into the indexed table).
使用位图索引时要小心——它们对同时受到多个会话修改的系统不友好(例如,两个人同时将行插入到索引表中)。
A more effective strategy if you do want to improve the efficieny of queries with predicates on these five values is to use partitioning, partly because of partition pruning in the query but also because of the improvement in statistics available to the optimiser when it knows that only one partition will be accessed and can use partition-level statistics instead of global statistics.
如果您确实想提高对这五个值的谓词查询的效率,一个更有效的策略是使用分区,部分原因是查询中的分区修剪,但也因为优化器知道只有将访问一个分区,并且可以使用分区级统计信息而不是全局统计信息。
回答by Quassnoi
The index will be useful in the following cases:
该索引将在以下情况下有用:
When you search for infrequent
FREQUENCYID
's. Like, only10
of your10,000,000
rows haveFREQUENCYID = 1
and you search for it.When you do not use other columns except
FREQUENCYID
in your queries. This query:SELECT FREQUENCYID, COUNT(*) FROM mytable GROUP BY FREQUENCYID
will benefit from the index (actually,
INDEX FAST FULL SCAN
along withHASH AGGREGATE
will most probably be used)When your table rows are large and all columns you use in the query are indexed. This way, all indexes will be joined instead of making a
FULL TABLE SCAN
. Say, this query:SELECT FREQUENCYID, OTHERCOLUMN FROM mytable WHERE FREQUENCYID = 2
can be performed by joining the values from the indexes on
FREQUENCYID
andOTHERCOLUMN
onROWID
.
当您搜索不常见
FREQUENCYID
的 时。就像,只有10
您的10,000,000
行有FREQUENCYID = 1
并且您搜索它。当您不使用除
FREQUENCYID
查询之外的其他列时。这个查询:SELECT FREQUENCYID, COUNT(*) FROM mytable GROUP BY FREQUENCYID
将会从索引中获益(实际上,
INDEX FAST FULL SCAN
伴随着HASH AGGREGATE
将最有可能被使用)当您的表行很大并且您在查询中使用的所有列都被索引时。这样,所有索引都将被连接起来,而不是创建一个
FULL TABLE SCAN
. 说,这个查询:SELECT FREQUENCYID, OTHERCOLUMN FROM mytable WHERE FREQUENCYID = 2
可以通过连接来自 on
FREQUENCYID
和OTHERCOLUMN
on索引的值来执行ROWID
。
回答by Egor Rogov
Possibly the easiest way it not to guess but actually try.
可能是最简单的方法,不是猜测而是实际尝试。
But it seems to me that you're comparing execution plans in order to find the best approach. It's not reliable. Optimizer may not have appropriate information to choose the best plan (for example, if you have a nonuniform distribution of values and haven't got histogram). Also looking at "cost" in explain plan makes no sense.
但在我看来,您是在比较执行计划以找到最佳方法。这不可靠。优化器可能没有合适的信息来选择最佳计划(例如,如果您的值分布不均匀并且没有直方图)。在解释计划中查看“成本”也是没有意义的。
The better way is to compare logical IOs. Run SQL*Plus, say set autotrace traceonly
, then run your query (with and without index) and compare "consistent gets" number. The less the better.
更好的方法是比较逻辑IO。运行 SQL*Plus,例如set autotrace traceonly
,然后运行您的查询(带和不带索引)并比较“一致获取”数。越少越好。
About importance of LIOs: article by Cary Millsap.
关于 LIO 的重要性:Cary Millsap 的文章。
回答by Adriaan Stander
If it is goin to increase in size as you mention
如果它会像你提到的那样变大
up to 5.000.000 rows
多达 5.000.000 行
I would recomend creating an index.
我建议创建一个索引。
回答by Jeffrey Kemp
Test it with typical queries, see which way is faster.
使用典型查询对其进行测试,看看哪种方式更快。
You may find that a Full Table Scan is faster on average than an Index Range Scan + Table Access by Rowid - in which case Oracle got it right.
您可能会发现全表扫描平均比索引范围扫描 + Rowid 表访问要快——在这种情况下,Oracle 做对了。
On the other hand, perhaps there are patterns of data which for most of your queries it's better to use the index - in which case you'll probably want to add the INDEX hint.
另一方面,也许存在数据模式,对于您的大多数查询,最好使用索引 - 在这种情况下,您可能想要添加 INDEX 提示。