postgresql 当索引扫描是更好的选择时，Postgres 不使用索引

Question

提问by Ryan Her

I have a simple query to join two tables that's being really slow. I found out that the query plan does a seq scan on the large table email_activities(~10m rows) while I think using indexes doing nested loops will actually be faster.

我有一个简单的查询来连接两个非常慢的表。我发现查询计划在大表email_activities（~10m 行）上执行 seq 扫描，而我认为使用索引执行嵌套循环实际上会更快。

I rewrote the query using a subquery in an attempt to force the use of index, then noticed something interesting. If you look at the two query plans below, you will see that when I limit the result set of subquery to 43k, query plan does use index on email_activities while setting the limit in subquery to even 44k will cause query plan to use seq scan on email_activities. One is clearly more efficient than the other, but Postgres doesn't seem to care.

我使用子查询重写了查询以试图强制使用索引，然后注意到一些有趣的事情。如果您查看下面的两个查询计划，您会发现当我将子查询的结果集限制为 43k 时，查询计划确实在 email_activities 上使用索引，而将子查询中的限制设置为 44k 会导致查询计划使用 seq scan on email_activities. 一个显然比另一个更有效，但 Postgres 似乎并不关心。

What could cause this? Does it have a configs somewhere that forces the use of hash join if one of the set is larger than certain size?

什么可能导致这种情况？如果集合中的一个大于特定大小，它是否在某处有强制使用散列连接的配置？

explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000);
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1)
   ->  Nested Loop  (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1)
         ->  HashAggregate  (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1)
               ->  Limit  (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1)
                     ->  Index Scan using index_email_recipients_on_email_campaign_id on email_recipients  (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1)
                           Index Cond: (email_campaign_id = 1607)
         ->  Index Only Scan using index_email_activities_on_email_recipient_id on email_activities  (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000)
               Index Cond: (email_recipient_id = email_recipients.id)
               Heap Fetches: 40789
 Total runtime: 224.675 ms

And:

和：

explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000);
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1)
   ->  Hash Semi Join  (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1)
         Hash Cond: (email_activities.email_recipient_id = email_recipients.id)
         ->  Seq Scan on email_activities  (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1)
         ->  Hash  (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1)
               Buckets: 8192  Batches: 1  Memory Usage: 1758kB
               ->  Limit  (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1)
                     ->  Index Scan using index_email_recipients_on_email_campaign_id on email_recipients  (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1)
                           Index Cond: (email_campaign_id = 1607)
 Total runtime: 3050.660 ms

Version: PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
email_activities: ~10m rows
email_recipients: ~11m rows

版本：PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu，由 gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 编译，64 位
email_activity: ~10m 行
email_recipients: ~11m 行

Answer 1

回答by Erwin Brandstetter

Index scan -> bitmap index scan -> sequential scan

索引扫描->位图索引扫描->顺序扫描

For few rows it pays to run an index scan. With more rows to return (higher percentage of the table and depending on data distribution, value frequencies and row width) it becomes more likely to find several rows on one data page. Then it pays to switch to a bitmap index scans. Once a large percentage of data pages has to be visited anyway, it's cheaper to run a sequential scan, filter surplus rows and skip the overhead for indexes altogether.

对于几行，运行索引扫描是值得的。返回的行越多（表的百分比越高，取决于数据分布、值频率和行宽），在一个数据页上找到多行的可能性就越大。然后切换到位图索引扫描是值得的。一旦无论如何都必须访问大量数据页，运行顺序扫描、过滤多余行并完全跳过索引开销会更便宜。

Postgres switches to a sequential scan, expecting to find rows=263962, that's already 3 % of the whole table. (While only rows=47935are actually found, see below.)

Postgres 切换到顺序扫描，期望找到rows=263962，这已经是整个表的 3%。（虽然只有rows=47935实际发现，见下文。）

Beware of forcing query plans

当心强制查询计划

You cannot force a certain planner method directly in Postgres, but you can make othermethods seem extremely expensive for debugging purposes. See Planner Method Configurationin the manual.

您不能直接在 Postgres 中强制使用某个计划程序方法，但是您可以使其他方法对于调试目的而言显得非常昂贵。请参阅手册中的Planner 方法配置。

SET enable_seqscan = off(like suggested in another answer) does that to sequential scans. But that's intended for debugging purposes in your session only. Do notuse this as a general setting in production unless you know exactly what you are doing. It can force ridiculous query plans. Quoting the manual:

SET enable_seqscan = off（就像在另一个答案中建议的那样）对顺序扫描执行此操作。但这仅用于会话中的调试目的。难道不，除非你知道自己在做什么用这个作为生产一般的设置。它可以强制执行荒谬的查询计划。引用手册：

These configuration parameters provide a crude method of influencing the query plans chosen by the query optimizer. If the default plan chosen by the optimizer for a particular query is not optimal, a temporarysolution is to use one of these configuration parameters to force the optimizer to choose a different plan. Better ways to improve the quality of the plans chosen by the optimizer include adjusting the planer cost constants (see Section 18.7.2), running ANALYZEmanually, increasing the value of the default_statistics_targetconfiguration parameter, and increasing the amount of statistics collected for specific columns using ALTER TABLE SET STATISTICS.

这些配置参数提供了一种影响查询优化器选择的查询计划的粗略方法。如果优化器为特定查询选择的默认计划不是最优的，临时解决方案是使用这些配置参数之一来强制优化器选择不同的计划。提高优化器选择的计划质量的更好方法包括调整 planer 成本常量（参见第 18.7.2 节）、ANALYZE手动运行、增加default_statistics_target配置参数的值，以及增加为特定列收集的统计量ALTER TABLE SET STATISTICS.

That's already most of the advice you need.

这已经是您需要的大部分建议了。

Keep PostgreSQL from sometimes choosing a bad query plan

防止 PostgreSQL 有时选择错误的查询计划

In this particular case, Postgres expects 5-6 times more hits on email_activities.email_recipient_idthan are actually found:

在这种特殊情况下，Postgres 预计的点击次数email_activities.email_recipient_id是实际发现的5-6 倍：

estimated rows=227007vs. actual ... rows=40789
estimated rows=263962vs. actual ... rows=47935

估计rows=227007vs.actual ... rows=40789
估计rows=263962vs.actual ... rows=47935

If you run this query often it will pay to have ANALYZElook at a bigger sample for more accurate statistics on the particular column. Your table is big (~ 10M rows), so make that:

如果您经常运行此查询，则需要ANALYZE查看更大的样本以获得特定列的更准确统计信息。您的表很大（约 10M 行），因此请执行以下操作：

ALTER TABLE email_activities ALTER COLUMN email_recipient_id
SET STATISTICS 3000;  -- max 10000, default 100

Then ANALYZE email_activities;

然后 ANALYZE email_activities;

Measure of last resort

最后的措施

In very rarecases you might resort to force an index with SET LOCAL enable_seqscan = offin a separate transaction or in a function with its own environment. Like:

在极少数情况下，您可能会SET LOCAL enable_seqscan = off在单独的事务或具有自己环境的函数中强制使用索引。喜欢：

CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int, _limit int)
  RETURNS bigint AS
$func$
   SELECT COUNT(DISTINCT a.email_recipient_id)
   FROM   email_activities a
   WHERE  a.email_recipient_id IN (
      SELECT id
      FROM   email_recipients
      WHERE  email_campaign_id = 
      LIMIT  )       -- or consider query below
$func$  LANGUAGE sql VOLATILE COST 100000 SET enable_seqscan = off;

The setting only applies to the local scope of the function.

该设置仅适用于函数的局部范围。

Warning:This is just a proof of concept. Even this much less radical manual intervention might bite you in the long run. Cardinalities, value frequencies, your schema, global Postgres settings, everything changes over time. You are going to upgrade to a new Postgres version. The query plan you force now, may become a very bad idea later.

警告：这只是一个概念证明。从长远来看，即使是这种不那么激进的手动干预也可能会咬你。基数、值频率、你的模式、全局 Postgres 设置，一切都随着时间的推移而变化。您将升级到新的 Postgres 版本。您现在强制执行的查询计划以后可能会变成一个非常糟糕的主意。

And typically this is just a workaround for a problem with your setup. Better find and fix it.

通常，这只是解决您的设置问题的一种解决方法。最好找到并修复它。

Alternative query

替代查询

Essential information is missing in the question, but this equivalent query is probably faster and more likely to use an index on (email_recipient_id) - increasingly so for a bigger LIMIT.

问题中缺少基本信息，但此等效查询可能更快，并且更有可能使用 ( email_recipient_id)上的索引- 对于更大的LIMIT.

SELECT COUNT(*) AS ct
FROM  (
   SELECT id
   FROM   email_recipients
   WHERE  email_campaign_id = 1607
   LIMIT  43000
   ) r
WHERE  EXISTS (
   SELECT 1
   FROM   email_activities
   WHERE  email_recipient_id = r.id);

Answer 2

回答by Ctx

A sequential scan can be more efficient, even when an index exists. In this case, postgres seems to estimate things rather wrong. An ANALYZE <TABLE>on all related tables can help in such cases. If it doesnt, you can set the variable enable_seqscanto OFF, to force postgres to use an index whenever technically possible, at the expense, that sometimes an index-scan will be used when a sequential scan would perform better.

即使存在索引，顺序扫描也可以更有效。在这种情况下，postgres 似乎估计错误。一个ANALYZE <TABLE>对所有相关的表格可以在这种情况下帮助。如果没有，您可以将该变量设置enable_seqscan为 OFF，以强制 postgres 在技术上可能的情况下使用索引，代价是有时会在顺序扫描性能更好时使用索引扫描。

postgresql 当索引扫描是更好的选择时，Postgres 不使用索引

提问by Ryan Her

回答by Erwin Brandstetter

Index scan -> bitmap index scan -> sequential scan

索引扫描->位图索引扫描->顺序扫描

Beware of forcing query plans

当心强制查询计划

Measure of last resort

最后的措施

Alternative query

替代查询

回答by Ctx

相关推荐

最近更新

标签

postgresql 当索引扫描是更好的选择时，Postgres 不使用索引

提问by Ryan Her

回答by Erwin Brandstetter

Index scan -> bitmap index scan -> sequential scan

索引扫描->位图索引扫描->顺序扫描

Beware of forcing query plans

当心强制查询计划

Measure of last resort

最后的措施

Alternative query

替代查询

回答by Ctx

相关推荐

如何将 CSV 数据插入 PostgreSQL 数据库（远程数据库）

postgreSQL 同时将列类型从 int 更改为 bigint

postgresql 流复制和逻辑复制的区别

postgresql 从 Spring Boot 连接到 Heroku Postgres

相关推荐

最近更新

标签