postgresql 用于“不相等”搜索的 SQL 索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2864267/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL indexes for "not equal" searches
提问by bortzmeyer
The SQL index allows to find quickly a string which matches my query. Now, I have to search in a big table the strings which do notmatch. Of course, the normal index does not help and I have to do a slow sequential scan:
SQL 索引允许快速找到与我的查询匹配的字符串。现在,我必须在一个大表中搜索不匹配的字符串。当然,普通的索引也无济于事,我不得不做一个缓慢的顺序扫描:
essais=> \d phone_idx
Index "public.phone_idx"
Column | Type
--------+------
phone | text
btree, for table "public.phonespersons"
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone = '+33 1234567';
QUERY PLAN
-------------------------------------------------------------------------------
Index Scan using phone_idx on phonespersons (cost=0.00..8.41 rows=1 width=4)
Index Cond: (phone = '+33 1234567'::text)
(2 rows)
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone != '+33 1234567';
QUERY PLAN
----------------------------------------------------------------------
Seq Scan on phonespersons (cost=0.00..18621.00 rows=999999 width=4)
Filter: (phone <> '+33 1234567'::text)
(2 rows)
I understand (see Mark Byers' very good explanations) that PostgreSQL can decide not to use an index when it sees that a sequential scan would be faster (for instance if almost all the tuples match). But, here, "not equal" searches are really slower.
我理解(参见 Mark Byers 的非常好的解释)当 PostgreSQL 看到顺序扫描会更快时(例如,如果几乎所有元组都匹配),它可以决定不使用索引。但是,在这里,“不等于”搜索真的很慢。
Any way to make these "is not equal to" searches faster?
有什么方法可以使这些“不等于”搜索更快?
Here is another example, to address Mark Byers' excellent remarks. The index is used for the '=' query (which returns the vast majority of tuples) but not for the '!=' query:
这是另一个例子,以解决 Mark Byers 的精彩评论。索引用于 '=' 查询(返回绝大多数元组),但不用于 '!=' 查询:
essais=> \d tld_idx
Index "public.tld_idx"
Column | Type
-----------------+------
pg_expression_1 | text
btree, for table "public.emailspersons"
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) = 'fr';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using tld_idx on emailspersons (cost=0.25..4010.79 rows=97033 width=4) (actual time=0.137..261.123 rows=97110 loops=1)
Index Cond: (tld(email) = 'fr'::text)
Total runtime: 444.800 ms
(3 rows)
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) != 'fr';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on emailspersons (cost=0.00..27129.00 rows=2967 width=4) (actual time=1.004..1031.224 rows=2890 loops=1)
Filter: (tld(email) <> 'fr'::text)
Total runtime: 1037.278 ms
(3 rows)
DBMS is PostgreSQL 8.3 (but I can upgrade to 8.4).
DBMS 是 PostgreSQL 8.3(但我可以升级到 8.4)。
采纳答案by araqnid
Possibly it would help to write:
可能它会有助于写:
SELECT person FROM PhonesPersons WHERE phone < '+33 1234567'
UNION ALL
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
or simply
或者干脆
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
OR phone < '+33 1234567'
PostgreSQL should be able to determine that the selectivity of the range operation is very high and to consider using an index for it.
PostgreSQL 应该能够确定范围操作的选择性非常高,并考虑为其使用索引。
I don't think it can use an index directly to satisfy a not-equals predicate, although it would be nice if it could try re-writing the not-equals as above (if it helps) during planning. If it works, suggest it to the developers ;)
我不认为它可以直接使用索引来满足不等于谓词,尽管如果它可以在计划期间尝试重写上面的不等于(如果有帮助的话)会很好。如果可行,请将其推荐给开发人员;)
Rationale: searching an index for all values not equal to a certain one requires scanning the full index. By contrast, searching for all elements less than a certain key means finding the greatest non-matching item in the tree and scanning backwards. Similarly, searching for all elements greater than a certain key in the opposite direction. These operations are easy to fulfill using b-tree structures. Also, the statistics that PostgreSQL collects should be able to point out that "+33 1234567" is a known frequent value: by removing the frequency of those and nulls from 1, we have the proportion of rows left to select: the histogram bounds will indicate whether those are skewed to one side or not. But if the exclusion of nulls and that frequent value pushes the proportion of rows remaining low enough (Istr about 20%), an index scan should be appropriate. Check the stats for the column in pg_stats to see what proportion it's actually calculated.
基本原理:为不等于某个值的所有值搜索索引需要扫描完整索引。相比之下,搜索小于某个键的所有元素意味着找到树中最大的不匹配项并向后扫描。同理,反方向搜索所有大于某个key的元素。使用 b 树结构很容易实现这些操作。另外,PostgreSQL 收集的统计数据应该可以指出“+33 1234567”是一个已知的频繁值:通过从 1 中去除那些和空值的频率,我们有剩余行的比例可供选择:直方图边界将指示那些是否偏向一侧。但是如果排除空值和那个频繁值使得行的比例保持足够低(Istr 大约 20%),索引扫描应该是合适的。检查 pg_stats 中列的统计数据,看看它实际计算的比例是多少。
Update: I tried this on a local table with a vaguely similar distribution, and both forms of the above produced something other than a plain seq scan. The latter (using "OR") was a bitmap scan that may actually devolve to just being a seq scan if the bias towards your common value is particularly extreme... although the planner can see that, I don't think it will automatically rewrite to an "Append(Index Scan,Index Scan)" internally. Turning "enable_bitmapscan" off just made it revert to a seq scan.
更新:我在一个分布模糊的本地表上尝试了这个,上面的两种形式都产生了不同于普通 seq 扫描的东西。后者(使用“OR”)是位图扫描,如果对您的共同价值的偏见特别极端,它实际上可能会转变为仅作为 seq 扫描......虽然计划者可以看到这一点,但我认为它不会自动在内部重写为“Append(Index Scan,Index Scan)”。关闭“enable_bitmapscan”只会让它恢复到 seq 扫描。
PS: indexing a text column and using the inequality operators can be an issue, if your database location is not C. You may need to add an extra index that uses text_pattern_ops or varchar_pattern_ops; this is similar to the problem of indexing for column LIKE 'prefix%'
predicates.
PS:索引文本列并使用不等式运算符可能是一个问题,如果您的数据库位置不是 C。您可能需要添加一个使用 text_pattern_ops 或 varchar_pattern_ops 的额外索引;这类似于column LIKE 'prefix%'
谓词索引问题。
Alternative: you could create a partial index:
替代方案:您可以创建部分索引:
CREATE INDEX PhonesPersonsOthers ON PhonesPersons(phone) WHERE phone <> '+33 1234567'
this will make the <>
-using select statement just scan through that partial index: since it excludes most of the entries in the table, it should be small.
这将使<>
-using select 语句只扫描该部分索引:因为它排除了表中的大部分条目,所以它应该很小。
回答by Mark Byers
The database is able use the index for this query, but it chooses not to because it would be slower. Update: This is not quite right: you have to rewrite the query slightly. See Araqnid's answer.
数据库能够使用该查询的索引,但它选择不使用,因为它会更慢。更新:这不太正确:您必须稍微重写查询。参见 Araqnid 的回答。
Your where clause selects almost all rows in your table (rows = 999999). The database can see that a table scan would be faster in this case and therefore ignores the index. It is faster because the column person
is not in your index so it would have to make two lookups for each row, once in the index to check the WHERE clause, and then again in the main table to fetch the column person
.
您的 where 子句选择表中的几乎所有行(行 = 999999)。数据库可以看到在这种情况下表扫描会更快,因此会忽略索引。它更快,因为该列person
不在您的索引中,因此必须对每一行进行两次查找,一次在索引中检查 WHERE 子句,然后再次在主表中获取该列person
。
If you had a different type of data where there were most values were foo
and just a few were bar
and you said WHERE col <> 'foo'
then it probably would use the index.
如果您有不同类型的数据,其中有大多数值foo
,只有少数值,bar
并且您WHERE col <> 'foo'
说它可能会使用索引。
Any way to make these "is not equal to" searches faster?
有什么方法可以使这些“不等于”搜索更快?
Any query that selects almost 1 million rows is going to be slow. Try adding a limit clause.
任何选择近 100 万行的查询都会很慢。尝试添加限制条款。