SQL 如何强制 Postgres 使用特定索引?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/309786/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 00:19:39  来源:igfitidea点击:

How do I force Postgres to use a particular index?

sqlpostgresqlindexing

提问by mike

How do I force Postgres to use an index when it would otherwise insist on doing a sequential scan?

当 Postgres 坚持执行顺序扫描时,如何强制它使用索引?

回答by Patryk Kordylewski

Assuming you're asking about the common "index hinting" feature found in many databases, PostgreSQL doesn't provide such a feature. This was a conscious decision made by the PostgreSQL team. A good overview of why and what you can do instead can be found here. The reasons are basically that it's a performance hack that tends to cause more problems later down the line as your data changes, whereas PostgreSQL's optimizer can re-evaluate the plan based on the statistics. In other words, what might be a good query plan today probably won't be a good query plan for all time, and index hints force a particular query plan for all time.

假设您问的是许多数据库中常见的“索引提示”功能,PostgreSQL 不提供这样的功能。这是 PostgreSQL 团队有意识地做出的决定。可以在此处找到有关原因和可以执行的操作的完整概述。原因基本上是它是一个性能黑客,随着您的数据更改,它往往会导致更多问题,而 PostgreSQL 的优化器可以根据统计信息重新评估计划。换句话说,今天可能是一个好的查询计划可能不会一直都是一个好的查询计划,索引提示会强制执行一个特定的查询计划。

As a very blunt hammer, useful for testing, you can use the enable_seqscanand enable_indexscanparameters. See:

作为一个非常钝的锤子,对于测试很有用,您可以使用enable_seqscanenable_indexscan参数。看:

These are not suitable for ongoing production use. If you have issues with query plan choice, you should see the documentation for tracking down query performance issues. Don't just set enable_params and walk away.

这些不适合持续的生产使用。如果您在选择查询计划时遇到问题,您应该查看用于跟踪查询性能问题的文档。不要只是设置enable_参数然后走开。

Unless you have a very good reason for using the index, Postgres may be making the correct choice. Why?

除非您有充分的理由使用索引,否则 Postgres 可能会做出正确的选择。为什么?

  • For small tables, it's faster to do sequential scans.
  • Postgres doesn't use indexes when datatypes don't match properly, you may need to include appropriate casts.
  • Your planner settings might be causing problems.
  • 对于小表,顺序扫描会更快。
  • 当数据类型不正确匹配时,Postgres 不使用索引,您可能需要包含适当的强制转换。
  • 您的计划器设置可能会导致问题。

See also this old newsgroup post.

另请参阅此旧新闻组帖子

回答by Niraj Bhawnani

Probably the only valid reason for using

可能是使用的唯一正当理由

set enable_seqscan=false

is when you're writing queries and want to quickly see what the query plan would actually be were there large amounts of data in the table(s). Or of course if you need to quickly confirm that your query is not using an index simply because the dataset is too small.

是在您编写查询并希望快速查看如果表中有大量数据时查询计划的实际情况。或者当然,如果您需要快速确认您的查询没有仅仅因为数据集太小而使用索引。

回答by Ziggy Crueltyfree Zeitgeister

Sometimes PostgreSQL fails to make the best choice of indexes for a particular condition. As an example, suppose there is a transactions table with several million rows, of which there are several hundred for any given day, and the table has four indexes: transaction_id, client_id, date, and description. You want to run the following query:

有时 PostgreSQL 无法为特定条件做出最佳索引选择。例如,假设有一个包含几百万行的事务表,其中任何一天都有几百行,并且该表有四个索引:transaction_id、client_id、date 和 description。您要运行以下查询:

SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
      description = 'Refund'
GROUP BY client_id

PostgreSQL may choose to use the index transactions_description_idx instead of transactions_date_idx, which may lead to the query taking several minutes instead of less than one second. If this is the case, you can force using the index on date by fudging the condition like this:

PostgreSQL 可能会选择使用索引 transactions_description_idx 而不是 transactions_date_idx,这可能会导致查询需要几分钟而不是不到一秒。如果是这种情况,您可以通过伪造这样的条件来强制使用日期索引:

SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
      description||'' = 'Refund'
GROUP BY client_id

回答by emkey08

Short answer

简答

This problem typically happens when the estimated cost of an index scan is too high and doesn't correctly reflect reality. You may need to lower the random_page_costconfiguration parameter to fix this. From the Postgres documentation:

当索引扫描的估计成本太高并且没有正确反映现实时,通常会发生此问题。您可能需要降低random_page_cost配置参数来解决此问题。从Postgres 文档

Reducing this value [...] will cause the system to prefer index scans; raising it will make index scans look relatively more expensive.

减少这个值 [...] 将导致系统更喜欢索引扫描;提高它会使索引扫描看起来相对更昂贵。

You can check whether a lower value will actually make Postgres use the index (but use this for testing only):

您可以检查较低的值是否实际上会使 Postgres 使用索引(但将其用于测试):

EXPLAIN <query>;              # Uses sequential scan
SET random_page_cost = 1;
EXPLAIN <query>;              # May use index scan now

You can restore the default value with SET random_page_cost = DEFAULT;again.

您可以SET random_page_cost = DEFAULT;再次恢复默认值。

Background

背景

Index scans require non-sequential disk page fetches. Postgres uses random_page_costto estimate the cost of such non-sequential fetches in relation to sequential fetches. The default value is 4.0, thus assuming an averagecost factor of 4 compared to sequential fetches (taking caching effects into account).

索引扫描需要非顺序的磁盘页面提取。Postgres 用于random_page_cost估计与顺序提取相关的此类非顺序提取的成本。默认值为4.0,因此假设与顺序提取相比平均成本因子为 4(考虑缓存效应)。

The problem however is that this default value is unsuitable in the following important real-life scenarios:

然而,问题是这个默认值不适用于以下重要的现实生活场景:

1) Solid-state drives

1) 固态硬盘

As the documentation admits:

正如文件所承认的那样:

Storage that has a low random read cost relative to sequential, e.g. solid-state drives, might be better modeled with a lower value for random_page_cost.

相对于顺序的随机读取成本较低的存储,例如固态驱动器,可能最好使用较低的 值建模random_page_cost

According to the last point of this slidefrom a speak at PostgresConf 2018, random_page_costshould be set to something between 1.0and 2.0for solid-state drives.

根据2018 年 PostgresConf 演讲中这张幻灯片的最后一点,random_page_cost应该设置为介于固态驱动器1.02.0固态驱动器之间。

2) Cached data

2) 缓存数据

If the required index data is already cached in RAM, an index scan will always be significantly faster than a sequential scan. The documentation says:

如果所需的索引数据已缓存在 RAM 中,则索引扫描将始终比顺序扫描快得多。文档说:

Correspondingly, if your data is likely to be completely in cache, [...] decreasing random_page_costcan be appropriate.

相应地,如果您的数据可能完全在缓存中,[...] 减少random_page_cost可能是合适的。

The problem is that you of course can't easily know whether the relevant data is already cached. However, if a specific index is frequently queried, and if the system has sufficient RAM, then data is likely to be cached, and random_page_costshould be set to a lower value. You'll have to experiment with different values and see what works for you.

问题是你当然不能轻易知道相关数据是否已经被缓存。但是,如果经常查询某个特定的索引,并且系统有足够的RAM,则数据很可能会被缓存,random_page_cost应设置为较低的值。你必须尝试不同的值,看看什么对你有用。

You might also want to use the pg_prewarmextension for explicit data caching.

您可能还想使用pg_prewarm扩展进行显式数据缓存。



回答by emkey08

The question on itself is very much invalid. Forcing (by doing enable_seqscan=off for example) is very bad idea. It might be useful to check if it will be faster, but production code should never use such tricks.

这个问题本身是非常无效的。强制(例如通过执行 enable_seqscan=off )是非常糟糕的主意。检查它是否会更快可能很有用,但生产代码永远不应该使用这样的技巧。

Instead - do explain analyze of your query, read it, and find out why PostgreSQL chooses bad (in your opinion) plan.

相反 - 解释对您的查询的分析,阅读它,并找出 PostgreSQL 选择糟糕的(在您看来)计划的原因。

There are tools on the web that help with reading explain analyze output - one of them is explain.depesz.com- written by me.

网上有一些工具可以帮助阅读解释分析输出 - 其中之一是解释.depesz.com- 由我编写。

Another option is to join #postgresql channel on freenodeirc network, and talking to guys there to help you out - as optimizing query is not a matter of "ask a question, get answer be happy". it's more like a conversation, with many things to check, many things to be learned.

另一种选择是加入freenodeirc 网络上的 #postgresql 频道,并与那里的人交谈以帮助您 - 因为优化查询不是“提出问题,得到答案”的问题。这更像是一场对话,有很多东西要检查,很多东西要学习。

回答by Antony Gibbs

There is a trick to push postgres to prefer a seqscan adding a OFFSET 0in the subquery

有一个技巧可以让 postgres 更喜欢OFFSET 0在子查询中添加 a 的 seqscan

This is handy for optimizing requests linking big/huge tables when all you need is only the n first/last elements.

当您只需要 n 个第一个/最后一个元素时,这对于优化链接大/巨大表的请求非常方便。

Lets say you are looking for first/last 20 elements involving multiple tables having 100k (or more) entries, no point building/linking up all the query over all the data when what you'll be looking for is in the first 100 or 1000 entries. In this scenario for example, it turns out to be over 10x faster to do a sequential scan.

假设您正在寻找第一个/最后 20 个元素,这些元素涉及具有 100k(或更多)个条目的多个表,当您要查找的内容在前 100 或 1000 中时,没有必要构建/链接所有数据的所有查询条目。例如,在这种情况下,执行顺序扫描的速度提高了 10 倍以上。

see How can I prevent Postgres from inlining a subquery?

请参阅如何防止 Postgres 内联子查询?