使用嵌套循环提高 SQL 查询的性能 - PostgreSQL
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36444580/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Improve performance on SQL query with Nested Loop - PostgreSQL
提问by Dabeliuteef
I am using PostgreSQLand I have a weird problem with my SQL query. Depending on wich date paramter I'm using. My request doesn't do the same operation.
我正在使用PostgreSQL,但我的 SQL 查询有一个奇怪的问题。取决于我使用的日期参数。我的请求没有执行相同的操作。
This is my working query :
这是我的工作查询:
SELECT DISTINCT app.id_application
FROM stat sj
LEFT OUTER JOIN groupe gp ON gp.id_groupe = sj.id_groupe
LEFT OUTER JOIN application app ON app.id_application = gp.id_application
WHERE date_stat >= '2016/3/01'
AND date_stat <= '2016/3/31'
AND ( date_stat = date_gen-1 or (date_gen = '2016/04/01' AND date_stat = '2016/3/31'))
AND app.id_application IS NOT NULL
This query takes around 2 secondes (which is OKAY for me because I have a lots of rows). When I run EXPLAIN ANALYSE for this query I have this:
这个查询需要大约 2 秒(这对我来说没问题,因为我有很多行)。当我为这个查询运行 EXPLAIN ANALYZE 时,我有这个:
HashAggregate (cost=375486.95..375493.62 rows=667 width=4) (actual time=2320.541..2320.656 rows=442 loops=1)
-> Hash Join (cost=254.02..375478.99 rows=3186 width=4) (actual time=6.144..2271.984 rows=263274 loops=1)
Hash Cond: (gp.id_application = app.id_application)
-> Hash Join (cost=234.01..375415.17 rows=3186 width=4) (actual time=5.926..2200.671 rows=263274 loops=1)
Hash Cond: (sj.id_groupe = gp.id_groupe)
-> Seq Scan on stat sj (cost=0.00..375109.47 rows=3186 width=8) (actual time=3.196..2068.357 rows=263274 loops=1)
Filter: ((date_stat >= '2016-03-01'::date) AND (date_stat <= '2016-03-31'::date) AND ((date_stat = (date_gen - 1)) OR ((date_gen = '2016-04-01'::date) AND (date_stat = '2016-03-31'::date))))
Rows Removed by Filter: 7199514
-> Hash (cost=133.45..133.45 rows=8045 width=12) (actual time=2.677..2.677 rows=8019 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 345kB
-> Seq Scan on groupe gp (cost=0.00..133.45 rows=8045 width=12) (actual time=0.007..1.284 rows=8019 loops=1)
-> Hash (cost=11.67..11.67 rows=667 width=4) (actual time=0.206..0.206 rows=692 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 25kB
-> Seq Scan on application app (cost=0.00..11.67 rows=667 width=4) (actual time=0.007..0.101 rows=692 loops=1)
Filter: (id_application IS NOT NULL)
Total runtime: 2320.855 ms
Now, When I'm trying the same query for the current month (we are the 6th of April, so I'm trying to get all the application_id of April) with the same query
现在,当我为当月尝试相同的查询时(我们是 4 月 6 日,所以我试图获取所有四月的 application_id)使用相同的查询
SELECT DISTINCT app.id_application
FROM stat sj
LEFT OUTER JOIN groupe gp ON gp.id_groupe = sj.id_groupe
LEFT OUTER JOIN application app ON app.id_application = gp.id_application
WHERE date_stat >= '2016/04/01'
AND date_stat <= '2016/04/30'
AND ( date_stat = date_gen-1 or ( date_gen = '2016/05/01' AND date_job = '2016/04/30'))
AND app.id_application IS NOT NULL
This query takes now 120 seconds. So I also ran EXPLAIN ANALYZE on this query and now it doesn't have the same operations:
此查询现在需要 120 秒。所以我也在这个查询上运行了 EXPLAIN ANALYZE ,现在它没有相同的操作:
HashAggregate (cost=375363.50..375363.51 rows=1 width=4) (actual time=186716.468..186716.532 rows=490 loops=1)
-> Nested Loop (cost=0.00..375363.49 rows=1 width=4) (actual time=1.945..186619.404 rows=118990 loops=1)
Join Filter: (gp.id_application = app.id_application)
Rows Removed by Join Filter: 82222090
-> Nested Loop (cost=0.00..375343.49 rows=1 width=4) (actual time=1.821..171458.237 rows=118990 loops=1)
Join Filter: (sj.id_groupe = gp.id_groupe)
Rows Removed by Join Filter: 954061820
-> Seq Scan on stat sj (cost=0.00..375109.47 rows=1 width=8) (actual time=0.235..1964.423 rows=118990 loops=1)
Filter: ((date_stat >= '2016-04-01'::date) AND (date_stat <= '2016-04-30'::date) AND ((date_stat = (date_gen - 1)) OR ((date_gen = '2016-05-01'::date) AND (date_stat = '2016-04-30'::date))))
Rows Removed by Filter: 7343798
-> Seq Scan on groupe gp (cost=0.00..133.45 rows=8045 width=12) (actual time=0.002..0.736 rows=8019 loops=118990)
-> Seq Scan on application app (cost=0.00..11.67 rows=667 width=4) (actual time=0.003..0.073 rows=692 loops=118990)
Filter: (id_application IS NOT NULL)
Total runtime: 186716.635 ms
So I decided to search where the problem came from by reducing the number of conditions from my query until the performances is acceptable again.
所以我决定通过减少查询中的条件数量来搜索问题的来源,直到性能再次可以接受。
So with only this parameter
所以只有这个参数
WHERE date_stat >= '2016/04/01'
It takes only 1.9secondes (like the first working query) and it's also working with 2 parameters :
它只需要 1.9 秒(就像第一个工作查询一样)并且它也使用 2 个参数:
WHERE date_stat >= '2016/04/01'
AND app.id_application IS NOT NULL
BUT when I try to add one of those line I have the Nested loop in the Explain
但是当我尝试添加其中一行时,我在解释中有嵌套循环
AND date_stat <= '2016/04/30'
AND ( date_stat = date_gen-1 or ( date_gen = '2016/05/01' AND date_stat = '2016/04/30'))
Does someone have any idea where it could come from?
有人知道它可能来自哪里吗?
回答by Gabriel's Messanger
Ok, it looks like there's problem with optimizer estimations. He thiks that for april there will be only 1 row
so he choose NESTED LOOP
which is very inefficient for big number of rows (118,990
in that case).
好的,优化器估计似乎存在问题。他认为 4 月只有1 row
这样,他才会选择NESTED LOOP
对于大量行(118,990
在这种情况下)效率非常低的方法。
- Perform
VACUUM ANALYZE
for every table. This will clean up dead tuples and refresh statistics. - consider adding index based on
dates
likeCREATE INDEX date_stat_idx ON <table with date_stat> USING btree (date_stat);
VACUUM ANALYZE
为每个表执行。这将清理死元组并刷新统计信息。- 考虑添加基于
dates
喜欢的索引CREATE INDEX date_stat_idx ON <table with date_stat> USING btree (date_stat);
Rerun the query,
重新运行查询,