SQL 我可以通过在列 x 上创建索引来优化 SELECT DISTINCT x FROM hugeTable 查询吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5973850/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 10:30:09  来源:igfitidea点击:

Can I optimize a SELECT DISTINCT x FROM hugeTable query by creating an index on column x?

sqlsql-server-2005tsqlindexingquery-optimization

提问by polygenelubricants

I have a huge table, having a much smaller number (by orders of magnitude) of distinct values on some column x.

我有一个巨大的表格,在某些列上具有更少数量(按数量级)的不同值x

I need to do a query like SELECT DISTINCT x FROM hugeTable, and I want to do this relatively fast.

我需要做一个像 的查询SELECT DISTINCT x FROM hugeTable,而且我想相对较快地执行此操作。

I did something like CREATE INDEX hugeTable_by_x ON hugeTable(x), but for some reason, even though the output is small, the query execution is not as fast. The query plan shows that 97% of the time is spent on Index Scan of hugeTable_by_x, with an estimated number of rows equal to the size of the entire table. This is followed by, among other things, a Hash Match operation.

我做了类似的事情CREATE INDEX hugeTable_by_x ON hugeTable(x),但由于某种原因,即使输出很小,查询执行也没有那么快。查询计划显示 97% 的时间花在 Index Scan of 上hugeTable_by_x,估计行数等于整个表的大小。紧随其后的是哈希匹配操作等。

Since I created an index on column x, can I not expect this query to run very quickly?

因为我在 column 上创建了一个索引x,我能不能指望这个查询运行得非常快?

Note that I'm using Microsoft SQL Server 2005.

请注意,我使用的是 Microsoft SQL Server 2005。

回答by Remus Rusanu

This is likely not a problem of indexing, but one of data design. Normalization, to be precise. The fact that you need to query distinct values of a field, and even willing to add an index, is a strong indicator that the field should be normalized into a separate table with a (small) join key. Then the distinct values will be available immediately by scanning the much smaller lookup foreign table.

这很可能不是索引的问题,而是数据设计的问题。规范化,准确地说。您需要查询字段的不同值,甚至愿意添加索引,这一事实强烈表明该字段应该被规范化为具有(小)连接键的单独表。然后通过扫描小得多的查找外部表,不同的值将立即可用。

Update
As a workaround, you can create an indexed viewon an aggregate by the 'distinct' field. COUNT_BIGis an aggregate that is allowed in indexed views:

更新
作为一种解决方法,您可以通过“distinct”字段在聚合上创建索引视图COUNT_BIG是索引视图中允许的聚合:

create view vwDistinct
with schemabinding
as select x, count_big(*)
from schema.hugetable
group by x;

create clustered index cdxDistinct on vwDistinct(x);

select x from vwDistinct with (noexpand);

回答by Martin Smith

SQL Server does not implement any facility to seek directly to the next distinct value in an index skipping duplicates along the way.

SQL Server 没有实现任何工具来直接寻找索引中的下一个不同值,并在此过程中跳过重复项。

If you have many duplicates then you may be able to use a recursive CTE to simulate this. The technique comes from here. ("Super-fast DISTINCT using a recursive CTE"). For example:

如果您有很多重复项,那么您可以使用递归 CTE 来模拟这一点。技术来自这里。(“使用递归 CTE 的超快 DISTINCT”)。例如:

with recursivecte as (
  select min(t.x) as x
  from hugetable t
  union all
  select ranked.x
  from (
    select t.x,
           row_number() over (order by t.x) as rnk
    from hugetable t
    join recursivecte r
      on r.x < t.x
  ) ranked
  where ranked.rnk = 1
)
select *
from recursivecte
option (maxrecursion 0)

回答by Denis de Bernardy

If you know the values in advance and there is an index on column x (or if each value is likely to appear quickly on a seq scan of the whole table), it is much faster to query each one individually:

如果您事先知道这些值并且在 x 列上有一个索引(或者如果每个值可能会在整个表的 seq 扫描中快速出现),那么单独查询每个值会快得多:

select vals.x
from [values] as vals (x)
where exists (select 1 from bigtable where bigtable.x = vals.x);

Proceeding using exists() will do as many index lookups as there are valid values.

继续使用exists() 将进行与有效值一样多的索引查找。

The way you've written it (which is correct if the values are not known in advance), the query engine will need to read the whole table and hash aggregate the mess to extract the values. (Which makes the index useless.)

您编写它的方式(如果事先不知道这些值是正确的),查询引擎将需要读取整个表并散列聚合混乱以提取值。(这使得索引无用。)

回答by crokusek

No. But there are some workarounds (excluding normalization):

不,但有一些解决方法(不包括规范化):

Once the index is in place, then its possible to implement in SQL what the optimizer could be doing automatically:

一旦索引就位,就可以在 SQL 中实现优化器可以自动执行的操作:

https://stackoverflow.com/a/29286754/538763(multiple workarounds cited)

https://stackoverflow.com/a/29286754/538763(引用了多种解决方法)

Other answers say you can normalize which would solve your issue but even once its normalized SQL Server still likes to perform a scan to find the max() within group(s). Workarounds:

其他答案说您可以规范化这将解决您的问题,但即使其规范化的 SQL Server 仍然喜欢执行扫描以查找组内的 max()。解决方法:

https://dba.stackexchange.com/questions/48848/efficiently-query-max-over-multiple-ranges?rq=1

https://dba.stackexchange.com/questions/48848/efficiently-query-max-over-multiple-ranges?rq=1

回答by Srinivasarao Kotipatruni

If your column x has low cardinality, creating local bitmap index would increase the performance many fold.

如果您的列 x 具有低基数,则创建本地位图索引会将性能提高许多倍。

回答by Jerad Rose

When doing a SELECT DISTINCTon an indexed field, an index scan makes sense, as execution still has to scan each value in the index for the entire table (assuming no WHEREclause, as seems to be the case by your example).

SELECT DISTINCT索引字段上执行 a时,索引扫描是有意义的,因为执行仍然必须扫描整个表的索引中的每个值(假设没有WHERE子句,就像您的示例那样)。

Indexes usually have more of an impact on WHEREconditions, JOINS, and ORDER BYclauses.

索引通常对WHERE条件JOINS、 和ORDER BY子句的影响更大。

回答by Markus Winand

As per your description of the execution plan, I would believe it's the best possible execution.

根据您对执行计划的描述,我相信这是最好的执行。

The Index Scan reads the entire index as stored (not in index order), the HASH MATCH does the distinct.

索引扫描读取存储的整个索引(不按索引顺序),HASH MATCH 执行不同的操作。

There might be other ways around your problem. In SQL Server, Indexed Views come to my mind. However, that might give you a big hit for write's on that table.

可能还有其他方法可以解决您的问题。在 SQL Server 中,我想到了索引视图。然而,这可能会给你在那个表上写的一个很大的打击。

回答by Oded

Possibly. Though it is not guaranteed - it entirely depends on the query.

可能。虽然不能保证 - 它完全取决于查询。

I suggest reading this article by Gail Shaw (part 1and part 2).

我建议阅读 Gail Shaw 的这篇文章(第 1部分第 2 部分)。