PostgreSQL 查询未在生产中使用索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9475778/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PostgreSQL query not using index in production
提问by vonconrad
I'm noticing something strange/weird:
我注意到一些奇怪/奇怪的事情:
The exact same query in development/production are not using the same query path. In particular, the development version is using indexes which are omitted in production (in favor of seqscan).
开发/生产中完全相同的查询不使用相同的查询路径。特别是,开发版本使用的索引在生产中被省略(有利于 seqscan)。
The only real difference is that the dataset is production is significantly larger--the index size is 1034 MB, vs 29 MB in production. Would PostgreSQL abstain from using indexes if they (or the table) are too big?
唯一真正的区别是生产中的数据集要大得多——索引大小为 1034 MB,而生产中为 29 MB。如果索引(或表)太大,PostgreSQL 会放弃使用索引吗?
EDIT: EXPLAIN ANALYZE
for both queries:
编辑:EXPLAIN ANALYZE
对于两个查询:
Development:
发展:
Limit (cost=41638.15..41638.20 rows=20 width=154) (actual time=159.576..159.581 rows=20 loops=1)
-> Sort (cost=41638.15..41675.10 rows=14779 width=154) (actual time=159.575..159.577 rows=20 loops=1)
Sort Key: (sum(scenario_ad_group_performances.clicks))
Sort Method: top-N heapsort Memory: 35kB
-> GroupAggregate (cost=0.00..41244.89 rows=14779 width=154) (actual time=0.040..151.535 rows=14197 loops=1)
-> Nested Loop Left Join (cost=0.00..31843.75 rows=93800 width=154) (actual time=0.022..82.509 rows=50059 loops=1)
-> Merge Left Join (cost=0.00..4203.46 rows=14779 width=118) (actual time=0.017..27.103 rows=14197 loops=1)
Merge Cond: (scenario_ad_groups.id = scenario_ad_group_vendor_instances.ad_group_id)
-> Index Scan using scenario_ad_groups_pkey on scenario_ad_groups (cost=0.00..2227.06 rows=14779 width=114) (actual time=0.009..12.085 rows=14197 loops=1)
Filter: (scenario_id = 22)
-> Index Scan using index_scenario_ad_group_vendor_instances_on_ad_group_id on scenario_ad_group_vendor_instances (cost=0.00..1737.02 rows=27447 width=8) (actual time=0.007..7.021 rows=16528 loops=1)
Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))
-> Index Scan using index_ad_group_performances_on_vendor_instance_id_and_date on scenario_ad_group_performances (cost=0.00..1.73 rows=11 width=44) (actual time=0.002..0.003 rows=3 loops=14197)
Index Cond: ((vendor_instance_id = scenario_ad_group_vendor_instances.id) AND (date >= '2012-02-01'::date) AND (date <= '2012-02-28'::date))
Total runtime: 159.710 ms
Production:
生产:
Limit (cost=822401.35..822401.40 rows=20 width=179) (actual time=21279.547..21279.591 rows=20 loops=1)
-> Sort (cost=822401.35..822488.42 rows=34828 width=179) (actual time=21279.543..21279.560 rows=20 loops=1)
Sort Key: (sum(scenario_ad_group_performances.clicks))
Sort Method: top-N heapsort Memory: 33kB
-> GroupAggregate (cost=775502.60..821474.59 rows=34828 width=179) (actual time=19126.783..21226.772 rows=34495 loops=1)
-> Sort (cost=775502.60..776739.48 rows=494751 width=179) (actual time=19125.902..19884.164 rows=675253 loops=1)
Sort Key: scenario_ad_groups.id
Sort Method: external merge Disk: 94200kB
-> Hash Right Join (cost=25743.86..596796.70 rows=494751 width=179) (actual time=1155.491..16720.460 rows=675253 loops=1)
Hash Cond: (scenario_ad_group_performances.vendor_instance_id = scenario_ad_group_vendor_instances.id)
-> Seq Scan on scenario_ad_group_performances (cost=0.00..476354.29 rows=4158678 width=44) (actual time=0.043..8949.640 rows=4307019 loops=1)
Filter: ((date >= '2012-02-01'::date) AND (date <= '2012-02-28'::date))
-> Hash (cost=24047.72..24047.72 rows=51371 width=143) (actual time=1123.896..1123.896 rows=34495 loops=1)
Buckets: 1024 Batches: 16 Memory Usage: 392kB
-> Hash Right Join (cost=6625.90..24047.72 rows=51371 width=143) (actual time=92.257..1070.786 rows=34495 loops=1)
Hash Cond: (scenario_ad_group_vendor_instances.ad_group_id = scenario_ad_groups.id)
-> Seq Scan on scenario_ad_group_vendor_instances (cost=0.00..11336.31 rows=317174 width=8) (actual time=0.020..451.496 rows=431770 loops=1)
Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))
-> Hash (cost=5475.55..5475.55 rows=34828 width=139) (actual time=88.311..88.311 rows=34495 loops=1)
Buckets: 1024 Batches: 8 Memory Usage: 726kB
-> Bitmap Heap Scan on scenario_ad_groups (cost=798.20..5475.55 rows=34828 width=139) (actual time=4.451..44.065 rows=34495 loops=1)
Recheck Cond: (scenario_id = 276)
-> Bitmap Index Scan on index_scenario_ad_groups_on_scenario_id (cost=0.00..789.49 rows=34828 width=0) (actual time=4.232..4.232 rows=37006 loops=1)
Index Cond: (scenario_id = 276)
Total runtime: 21306.697 ms
回答by J Cooper
Disclaimer
免责声明
I have used PostgreSQL very little. I'm answering based on my knowledge of SQL Server index usage and execution plans. I ask the PostgreSQL gods for mercy if I get something wrong.
我很少使用 PostgreSQL。我是根据我对 SQL Server 索引使用和执行计划的了解来回答的。如果我做错了什么,我会向 PostgreSQL 大神们求饶。
Query Optimizers are Dynamic
查询优化器是动态的
You said your query plan has changed from your development to production environments. This is to be expected. Query optimizers are designed to generate the optimum execution plan based on the current data conditions. Under different conditions the optimizer may decide it is more efficient to use a table scan vs an index scan.
您说您的查询计划已从开发环境更改为生产环境。这是可以预料的。查询优化器旨在根据当前数据条件生成最佳执行计划。在不同的条件下,优化器可能会决定使用表扫描比使用索引扫描更有效。
When would it be more efficient to use a table scan vs an index scan?
什么时候使用表扫描比索引扫描更有效?
SELECT A, B
FROM someTable
WHERE A = 'SOME VALUE'
Let's say you have a non-clustered index on column A
. In this case you are filtering on column A
, which could potentially take advantage of the index. This would be efficient if the index is selective enough - basically, how many distinct values make up the index? The database keeps statistics on this selectivity info and uses these statistics when calculating costs for execution plans.
假设您在 column 上有一个非聚集索引A
。在这种情况下,您正在筛选 column A
,这可能会利用索引。如果索引足够有选择性,这将是有效的 - 基本上,索引有多少个不同的值?数据库保留有关此选择性信息的统计信息,并在计算执行计划的成本时使用这些统计信息。
If you have a million rows in a table, but only 10 possible values for A
, then your query would likely return about 100K rows. Because the index is non-clustered, and you are returning columns not included in the index, B
, a lookup will need to be performed for each row returned. These look-ups are random-access lookups which are much more expensive then sequential reads used by a table scan. At a certain point it becomes more efficient for the database to just perform a table scan rather than an index scan.
如果表中有 100 万行,但只有 10 个可能的值A
,那么您的查询可能会返回大约 10 万行。由于索引是非聚集的,并且您返回的列未包含在索引中,B
因此需要为返回的每一行执行查找。这些查找是随机访问查找,比表扫描使用的顺序读取要昂贵得多。在某个时刻,数据库只执行表扫描而不是索引扫描会变得更有效率。
This is just one scenario, there are many others. It's hard to know without knowing more about what your data is like, what your indexes look like and how you are trying to access the data.
这只是一种情况,还有许多其他情况。如果不了解更多关于您的数据是什么样的、您的索引是什么样的以及您如何尝试访问数据,就很难知道。
To answer the original question:
回答原来的问题:
Would PostgreSQL abstain from using indexes if they (or the table) are too big? No. It is more likely that in the way that you are accessing the data, it is less efficient for PostgreSQL to use the index vs using a table scan.
如果索引(或表)太大,PostgreSQL 会放弃使用索引吗?不。更有可能的是,在您访问数据的方式中,PostgreSQL 使用索引与使用表扫描相比效率较低。
The PostgreSQL FAQ touches on this very subject (see: Why are my queries slow? Why don't they use my indexes?): https://wiki.postgresql.org/wiki/FAQ#Why_are_my_queries_slow.3F_Why_don.27t_they_use_my_indexes.3F
PostgreSQL FAQ 触及了这个主题(请参阅:为什么我的查询很慢?为什么他们不使用我的索引?):https: //wiki.postgresql.org/wiki/FAQ#Why_are_my_queries_slow.3F_Why_don.27t_they_use_my_indexes.3F
回答by aib
Postgres' query optimizer comes up with multiple scenarios (e.g. index vs seq-scan) and evaluates them using statistical information about your tables and the relative costs of disk/memory/index/table access set in configuration.
Postgres 的查询优化器提出了多种方案(例如索引与 seq-scan),并使用有关表的统计信息和配置中设置的磁盘/内存/索引/表访问的相对成本来评估它们。
Did you use the EXPLAIN
command to see why index use was omitted? Did you use EXPLAIN ANALYZE
to find out if the decision was in error? Can we see the outputs, please?
您是否使用该EXPLAIN
命令查看了为什么省略了索引使用?你有没有EXPLAIN ANALYZE
用来查明决策是否有误?我们能看到输出吗?
edit:
编辑:
As hard as analyzing two different singular queries on different systems are, I think I see a couple of things.
尽管在不同系统上分析两个不同的单一查询很困难,但我想我看到了一些事情。
The production environment has a actual/cost rate of around 20-100 milliseconds per cost unit. I'm not even a DBA, but this seems consistent. The development environment has 261 for the main query. Does this seem right? Would you expect the raw speed (memory/disk/CPU) of the production environment to be 2-10x faster than dev?
生产环境的实际/成本率约为每个成本单位 20-100 毫秒。我什至不是 DBA,但这似乎是一致的。开发环境有261个主查询。这看起来对吗?您是否期望生产环境的原始速度(内存/磁盘/CPU)比开发速度快 2-10 倍?
Since the production environment has a much morecomplex query plan, it looks like it's doing its job. Undoubtedly, the dev environment's plan and many morehave been considered, and deemed too costly. And the 20-100 variance isn't that much in my experience (but again, not a DBA) and shows that there isn't anything way off the mark. Still, you may want to run a VACUUM
on the DB just in case.
由于生产环境有一个更复杂的查询计划,看起来它正在做它的工作。毫无疑问,已经考虑了开发环境的计划和更多计划,但认为成本太高。在我的经验中,20-100 的差异并没有那么大(但同样,不是 DBA),这表明没有任何离谱的地方。不过,您可能希望VACUUM
在数据库上运行 a以防万一。
I'm not experienced and patient enough to decode the full query, but could there be a denormalization/NOSQL-ization point for optimization?
我没有足够的经验和耐心来解码完整的查询,但是是否可以有一个非规范化/NOSQL 化点来进行优化?
The biggest bottleneck seems to be the disk merge at 90 MB. If the production environment has enough memory, you may want to increase the relevant setting (working memory?) to do it in-memory. It seems to be the work_mem
parameter here, though you'll want to read through the rest.
最大的瓶颈似乎是 90 MB 的磁盘合并。如果生产环境有足够的内存,你可能想增加相关设置(工作内存?)在内存中进行。它似乎是这里的work_mem
参数,尽管您需要通读其余部分。
I'd also suggest having a lookat the index usage statistics. Many options with partial and functional indices exist.
回答by Ramil Amerzyanov
Try
尝试
SET enable_seqscan TO 'off'
before EXPLAIN ANALYZE
前 EXPLAIN ANALYZE
回答by A.H.
It seems to me that your dev data is much "simpler" than the production data. As an example:
在我看来,您的开发数据比生产数据“简单”得多。举个例子:
Development:
发展:
-> Index Scan using index_scenario_ad_group_vendor_instances_on_ad_group_id on scenario_ad_group_vendor_instances
(cost=0.00..1737.02 rows=27447 width=8)
(actual time=0.007..7.021 rows=16528 loops=1)
Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))
Production:
生产:
-> Seq Scan on scenario_ad_group_vendor_instances
(cost=0.00..11336.31 rows=317174 width=8)
(actual time=0.020..451.496 rows=431770 loops=1)
Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))
This means, that in dev 27447 matching row have been estimated upfront and 16528 rows were indeed found. That't the same ballpark and OK.
这意味着,在 dev 中预先估计了 27447 个匹配行,并且确实找到了 16528 行。那不是同一个球场,好吧。
In production 317174 matching rows have been estimated upfront and 431770 rows were found. Also OK.
在生产中,预先估计了 317174 行匹配,并找到了 431770 行。还行。
But comparing dev to prod means that the numbers are 10 times different. As already other answers indicate, doing 10 times more random seeks (due to index access) might indeed be worse than a plain table scan.
但是将 dev 与 prod 进行比较意味着数字相差 10 倍。正如其他答案所表明的那样,进行 10 倍多的随机搜索(由于索引访问)可能确实比普通表扫描更糟糕。
Hence the interesting question is: How many rows does this table contain both in dev and in prod? Is number_returned_rows / number_total_rows
comparable between dev and prod?
因此,有趣的问题是:这个表在 dev 和 prod 中都包含多少行?是number_returned_rows / number_total_rows
开发和生产线之间的可比性?
EditDon't forget: I have picked oneindex access as an example. A quick glance shows that the other index accesses have the same symptoms.
编辑不要忘记:我选择了一个索引访问作为示例。快速浏览一下,其他索引访问具有相同的症状。