PostgreSQL:创建索引以快速区分 NULL 和非 NULL 值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31966218/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PostgreSQL: Create an index to quickly distinguish NULL from non-NULL values
提问by Adam Matan
Consider a SQL query with the following WHERE
predicate:
考虑具有以下WHERE
谓词的 SQL 查询:
...
WHERE name IS NOT NULL
...
Where name
is a textual field in PostgreSQL.
name
PostgreSQL 中的文本字段在哪里。
No other query checks any textual property of this value, just whether it is NULL
or not. Therefore, a full btree indexseems like an overkill, even though it supports this distinction:
没有其他查询会检查该值的任何文本属性,只是检查它是否NULL
存在。因此,完整的 btree 索引似乎有点矫枉过正,尽管它支持这种区别:
Also, an IS NULL or IS NOT NULL condition on an index column can be used with a B-tree index.
此外,索引列上的 IS NULL 或 IS NOT NULL 条件可以与 B 树索引一起使用。
What's the right PostgreSQL index to quickly distinguish NULL
s from non-NULL
s?
什么是快速区分NULL
s 和非NULL
s的正确 PostgreSQL 索引?
采纳答案by jpmc26
I'm interpreting you claim that it's "overkill" in two ways: in terms of complexity (using a B-Tree instead of just a list) and space/performance.
我正在解释你声称它在两个方面“矫枉过正”:在复杂性(使用 B 树而不仅仅是列表)和空间/性能方面。
For complexity, it's not overkill. A B-Tree index is preferable because deletesfrom it will be faster than some kind of "unordered" index (for lack of a better term). (An unordered index would require a full index scan just to delete.) In light of that fact, any gains from an unordered index would be usually be outweighed by the detriments, so the development effort isn't justified.
对于复杂性,这并不过分。B-Tree 索引更可取,因为从中删除将比某种“无序”索引(因为缺少更好的术语)更快。(无序索引需要完整索引扫描才能删除。)鉴于这一事实,从无序索引中获得的任何好处通常会被损害所抵消,因此开发工作是不合理的。
For space and performance, though, if you want a highly selective index for efficiency, you can include a WHERE
clause on an index, as noted in the fine manual:
但是,对于空间和性能,如果您想要一个高度选择性的索引以提高效率,您可以WHERE
在索引上包含一个子句,如精美手册中所述:
CREATE INDEX ON my_table (name) WHERE name IS NOT NULL;
Note that you'll only see benefits from this index if it can allow PostgreSQL to ignore a largeamount of rows when executing your query. E.g., if 99% of the rows have name IS NOT NULL
, the index isn't buying you anything over just letting a full table scan happen; in fact, it would be less efficient (as @CraigRingernotes) since it would require extra disk reads. If however, only 1% of rows have name IS NOT NULL
, then this represents huge savings as PostgreSQL can ignore most of the table for your query. If your table is very large, even eliminating 50% of the rows might be worth it. This is a tuning problem, and whether the index is valuable is going to depend heavily on the size and distribution of the data.
请注意,只有当 PostgreSQL在执行查询时可以忽略大量行时,您才会看到此索引的好处。例如,如果 99% 的行都有name IS NOT NULL
,那么索引不会比让全表扫描发生更重要;事实上,它的效率会降低(如@CraigRinger指出的那样),因为它需要额外的磁盘读取。然而,如果只有 1% 的行有name IS NOT NULL
,那么这代表了巨大的节省,因为 PostgreSQL 可以在查询时忽略大部分表。如果您的表非常大,即使删除 50% 的行也是值得的。这是一个调优问题,索引是否有价值将在很大程度上取决于数据的大小和分布。
Additionally, there is very little gain in terms of space if you still need another index for the name IS NULL
rows. See Craig Ringer's answerfor details.
此外,如果您仍然需要为name IS NULL
行创建另一个索引,那么空间方面的收益将非常小。有关详细信息,请参阅Craig Ringer 的回答。
回答by Craig Ringer
You could use an expression index, but you shouldn't. Keep it simple, and use a plain b-tree.
您可以使用表达式索引,但您不应该使用。保持简单,并使用普通的 b 树。
An expression index can be created on colname IS NOT NULL
:
可以在以下位置创建表达式索引colname IS NOT NULL
:
test=> CREATE TABLE blah(name text);
CREATE TABLE
test=> CREATE INDEX name_notnull ON blah((name IS NOT NULL));
CREATE INDEX
test=> INSERT INTO blah(name) VALUES ('a'),('b'),(NULL);
INSERT 0 3
test=> SET enable_seqscan = off;
SET
craig=> SELECT * FROM blah WHERE name IS NOT NULL;
name
------
a
b
(2 rows)
test=> EXPLAIN SELECT * FROM blah WHERE name IS NOT NULL;
QUERY PLAN
-----------------------------------------------------------------------------
Bitmap Heap Scan on blah (cost=9.39..25.94 rows=1303 width=32)
Filter: (name IS NOT NULL)
-> Bitmap Index Scan on name_notnull (cost=0.00..9.06 rows=655 width=0)
Index Cond: ((name IS NOT NULL) = true)
(4 rows)
test=> SET enable_bitmapscan = off;
SET
test=> EXPLAIN SELECT * FROM blah WHERE name IS NOT NULL;
QUERY PLAN
------------------------------------------------------------------------------
Index Scan using name_notnull on blah (cost=0.15..55.62 rows=1303 width=32)
Index Cond: ((name IS NOT NULL) = true)
Filter: (name IS NOT NULL)
(3 rows)
... but Pg doesn't realise that it's also usable for IS NULL
:
...但 Pg 没有意识到它也可用于IS NULL
:
test=> EXPLAIN SELECT * FROM blah WHERE name IS NULL;
QUERY PLAN
-------------------------------------------------------------------------
Seq Scan on blah (cost=10000000000.00..10000000023.10 rows=7 width=32)
Filter: (name IS NULL)
(2 rows)
and even transforms NOT (name IS NOT NULL)
into name IS NULL
, which is usually what you want.
甚至转换NOT (name IS NOT NULL)
为name IS NULL
,这通常是您想要的。
test=> EXPLAIN SELECT * FROM blah WHERE NOT (name IS NOT NULL);
QUERY PLAN
-------------------------------------------------------------------------
Seq Scan on blah (cost=10000000000.00..10000000023.10 rows=7 width=32)
Filter: (name IS NULL)
(2 rows)
so you're actually better off with two disjoint expression indexes, one on the null and one on the non-null set.
所以实际上你最好使用两个不相交的表达式索引,一个在空集上,一个在非空集上。
test=> DROP INDEX name_notnull ;
DROP INDEX
test=> CREATE INDEX name_notnull ON blah((name IS NOT NULL)) WHERE (name IS NOT NULL);
CREATE INDEX
test=> EXPLAIN SELECT * FROM blah WHERE name IS NOT NULL;
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using name_notnull on blah (cost=0.13..8.14 rows=3 width=32)
Index Cond: ((name IS NOT NULL) = true)
(2 rows)
test=> CREATE INDEX name_null ON blah((name IS NULL)) WHERE (name IS NULL);
CREATE INDEX
craig=> EXPLAIN SELECT * FROM blah WHERE name IS NULL;
QUERY PLAN
-----------------------------------------------------------------------
Index Scan using name_null on blah (cost=0.12..8.14 rows=1 width=32)
Index Cond: ((name IS NULL) = true)
(2 rows)
This is pretty gruesome though. For most sensible uses I'd just use a plain b-tree index. The index size improvement isn't too exciting, at least for small-ish inputs, like the dummy I created with a bunch of md5 values:
虽然这很可怕。对于大多数明智的用途,我只使用普通的 b 树索引。索引大小的改进并不太令人兴奋,至少对于小型输入,比如我用一堆 md5 值创建的虚拟:
test=> SELECT pg_size_pretty(pg_relation_size('blah'));
pg_size_pretty
----------------
9416 kB
(1 row)
test=> SELECT pg_size_pretty(pg_relation_size('blah_name'));
pg_size_pretty
----------------
7984 kB
(1 row)
test=> SELECT pg_size_pretty(pg_relation_size('name_notnull'));
pg_size_pretty
----------------
2208 kB
(1 row)
test=> SELECT pg_size_pretty(pg_relation_size('name_null'));
pg_size_pretty
----------------
2208 kB
(1 row)
回答by fxtentacle
You can use an expression like (title IS NULL) as the indexed column. So this works as expected:
您可以使用像 (title IS NULL) 这样的表达式作为索引列。所以这按预期工作:
CREATE INDEX index_articles_on_title_null ON articles ( (title IS NULL) );
SELECT * FROM articles WHERE (title IS NULL)='t';
This has the big advantage over using a predicate that in this case the value stored in the index is only a yes/no boolean and not the full column value. So especially if your NULL-checked column tends to contain large values (like a title text field here), then this way of indexing is much more space-efficient than using a predicated index.
这比使用谓词有很大的优势,在这种情况下,存储在索引中的值只是一个是/否布尔值,而不是完整的列值。因此,特别是如果您的 NULL 检查列往往包含大值(如此处的标题文本字段),那么这种索引方式比使用谓词索引更节省空间。