SQL 优化多个连接
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2086894/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Optimizing multiple joins
提问by salathe
I'm trying to figure out a way to speed up a particularly cumbersome query which aggregates some data by date across a couple of tables. The full (ugly) query is below along with an EXPLAIN ANALYZE
to show just how horrible it is.
我试图找出一种方法来加速一个特别繁琐的查询,该查询按日期跨几个表聚合一些数据。下面是完整的(丑陋的)查询,并附有一个EXPLAIN ANALYZE
以显示它有多可怕。
If anyone could take a peek and see if they can spot any major issues (which is likely, I'm not a Postgres guy) that would be superb.
如果有人可以偷看一下,看看他们是否能发现任何重大问题(这很可能,我不是 Postgres 人)那将是极好的。
So here goes. The query is:
所以就到这里了。查询是:
SELECT
to_char(p.period, 'DD/MM/YY') as period,
coalesce(o.value, 0) AS outbound,
coalesce(i.value, 0) AS inbound
FROM (
SELECT
date '2009-10-01' + s.day
AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
) AS p
LEFT OUTER JOIN(
SELECT
SUM(b.body_size) AS value,
b.body_time::date AS period
FROM body AS b
LEFT JOIN
envelope e ON e.message_id = b.message_id
WHERE
e.envelope_command = 1
AND b.body_time BETWEEN '2009-10-01'
AND (date '2009-10-31' + INTERVAL '1 DAY')
GROUP BY period
ORDER BY period
) AS o ON p.period = o.period
LEFT OUTER JOIN(
SELECT
SUM(b.body_size) AS value,
b.body_time::date AS period
FROM body AS b
LEFT JOIN
envelope e ON e.message_id = b.message_id
WHERE
e.envelope_command = 2
AND b.body_time BETWEEN '2009-10-01'
AND (date '2009-10-31' + INTERVAL '1 DAY')
GROUP BY period
ORDER BY period
) AS i ON p.period = i.period
The EXPLAIN ANALYZE
can be found here: on explain.depesz.com
将EXPLAIN ANALYZE
可以在这里找到:在explain.depesz.com
Any comments or questions are appreciated.
任何意见或问题表示赞赏。
Cheers
干杯
采纳答案by salathe
Building on Craig Young's suggestions, here is the amended query which runs in ~1.8 seconds for the data set I'm working on. That is a slight improvement on the original ~2.0s and a huge improvement on Craig's which took ~22s.
基于 Craig Young 的建议,这里是修改后的查询,它在大约 1.8 秒内针对我正在处理的数据集运行。这比原来的 ~2.0s 略有改进,而 Craig's 花了 ~22s 的巨大改进。
SELECT
p.period,
/* The pivot technique... */
SUM(CASE envelope_command WHEN 1 THEN body_size ELSE 0 END) AS Outbound,
SUM(CASE envelope_command WHEN 2 THEN body_size ELSE 0 END) AS Inbound
FROM
(
/* Get days range */
SELECT date '2009-10-01' + day AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS day
) p
/* Join message information */
LEFT OUTER JOIN
(
SELECT b.body_size, b.body_time::date, e.envelope_command
FROM body AS b
INNER JOIN envelope e ON e.message_id = b.message_id
WHERE
e.envelope_command IN (2, 1)
AND b.body_time::date BETWEEN (date '2009-10-01') AND (date '2009-10-31')
) d ON d.body_time = p.period
GROUP BY p.period
ORDER BY p.period
回答by Disillusioned
There are always 2 things to consider when optimising queries:
优化查询时总是需要考虑两件事:
- What indexes can be used (you may need to create indexes)
- How the query is written (you may need to change the query to allow the query optimser to be able to find appropriate indexes, and to not re-read data redundantly)
- 可以使用哪些索引(您可能需要创建索引)
- 查询是如何编写的(您可能需要更改查询以允许查询优化器能够找到合适的索引,并且不会冗余地重新读取数据)
A few observations:
一些观察:
You are performing date manipulations before you join your dates. As a general rule this will prevent a query optimser from using an index even if it exists. You should try to write your expressions in such a way that indexed columns exist unaltered on one side of the expression.
Your subqueries are filtering to the same date range as
generate_series
. This is a duplication, and it limits the optimser's ability to choose the most efficient optimisation. I suspect that may have been written in to improve performance because the optimser was unable to use an index on the date column (body_time
)?NOTE: We would actually very much like to use an index on
Body.body_time
ORDER BY
within the subqueries is at best redundant. At worst it could force the query optimiser to sort the result set before joining; and that is not necessarily good for the query plan. Rather only apply ordering right at the end for final display.Use of
LEFT JOIN
in your subqueries is inappropriate. Assuming you're using ANSI conventions forNULL
behaviour (and you should be), any outerjoins toenvelope
would returnenvelope_command=NULL
, and these would consequently be excluded by the conditionenvelope_command=?
.Subqueries
o
andi
are almost identical save for theenvelope_command
value. This forces the optimser to scan the same underlying tables twice. You can use a pivot tabletechnique to join to the data once, and split the values into 2 columns.
在加入日期之前,您正在执行日期操作。作为一般规则,这将阻止查询优化器使用索引,即使它存在。您应该尝试以这样一种方式编写您的表达式,即索引列在表达式的一侧保持不变。
您的子查询过滤到与 相同的日期范围
generate_series
。这是一种重复,它限制了优化器选择最有效优化的能力。我怀疑这可能是为了提高性能而写入的,因为优化器无法在日期列 (body_time
)上使用索引?注意:我们实际上非常希望在
Body.body_time
ORDER BY
在子查询中充其量是多余的。在最坏的情况下,它可能会强制查询优化器在加入之前对结果集进行排序;这对查询计划不一定有好处。而是仅在最后应用订购以进行最终展示。使用
LEFT JOIN
你的子查询是不恰当的。假设您使用 ANSINULL
行为约定(您应该这样做),任何外部连接都envelope
将返回envelope_command=NULL
,因此这些将被条件排除envelope_command=?
。除了值之外,子查询
o
和i
几乎相同envelope_command
。这会强制优化器扫描相同的底层表两次。您可以使用数据透视表技术连接数据一次,并将值拆分为 2 列。
Try the following which uses the pivot technique:
尝试以下使用枢轴技术的方法:
SELECT p.period,
/*The pivot technique in action...*/
SUM(
CASE WHEN envelope_command = 1 THEN body_size
ELSE 0
END) AS Outbound,
SUM(
CASE WHEN envelope_command = 2 THEN body_size
ELSE 0
END) AS Inbound
FROM (
SELECT date '2009-10-01' + s.day AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
) AS p
/*The left JOIN is justified to ensure ALL generated dates are returned
Also: it joins to a subquery, else the JOIN to envelope _could_ exclude some generated dates*/
LEFT OUTER JOIN (
SELECT b.body_size,
b.body_time,
e.envelope_command
FROM body AS b
INNER JOIN envelope e
ON e.message_id = b.message_id
WHERE envelope_command IN (1, 2)
) d
/*The expressions below allow the optimser to use an index on body_time if
the statistics indicate it would be beneficial*/
ON d.body_time >= p.period
AND d.body_time < p.period + INTERVAL '1 DAY'
GROUP BY p.Period
ORDER BY p.Period
EDIT: Added filter suggested by Tom H.
编辑:添加了 Tom H 建议的过滤器。
回答by Tom H
I uninstalled my PostgreSQL server a couple of days ago, so you'll likely have to play around with this, but hopefully it's a good start for you.
几天前我卸载了我的 PostgreSQL 服务器,所以你可能不得不玩这个,但希望这对你来说是一个好的开始。
The keys are:
关键是:
- You shouldn't need the subqueries - just do the direct joins and aggregate
- You should be able to use INNER JOINs, which are typically more performant than OUTER JOINs
- 您不应该需要子查询 - 只需进行直接连接和聚合
- 您应该能够使用 INNER JOIN,它通常比 OUTER JOIN 性能更高
If nothing else, I think that the query below is a bit clearer.
如果不出意外,我认为下面的查询更清楚一些。
I used a calendar table in my query, but you can replace that with the generate_series as you were using it.
我在查询中使用了日历表,但您可以在使用时将其替换为 generate_series。
Also, depending on indexing, it might be better to compare the body_date with >= and < rather than pulling out the date part and comparing. I don't know enough about PostgreSQL to know how it works behind the scenes, so I would try both approaches to see which the server can optimize better. In pseudo-code you would be doing: body_date >= date (time=midnight) AND body_date < date + 1 (time=midnight).
此外,根据索引,最好将 body_date 与 >= 和 < 进行比较,而不是取出日期部分并进行比较。我对 PostgreSQL 的了解不够,无法知道它在幕后是如何工作的,所以我会尝试两种方法,看看哪个服务器可以优化得更好。在伪代码中,您将执行以下操作:body_date >= date (time=midnight) AND body_date < date + 1 (time=midnight)。
SELECT
CAL.calendar_date AS period,
SUM(O.body_size) AS outbound,
SUM(I.body_size) AS inbound
FROM
Calendar CAL
INNER JOIN Body OB ON
OB.body_time::date = CAL.calendar_date
INNER JOIN Envelope OE ON
OE.message_id = OB.message_id AND
OE.envelope_command = 1
INNER JOIN Body IB ON
IB.body_time::date = CAL.calendar_date
INNER JOIN Envelope IE ON
IE.message_id = IB.message_id AND
IE.envelope_command = 2
GROUP BY
CAL.calendar_date