SQL 优化多个连接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2086894/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:07:16  来源:igfitidea点击:

Optimizing multiple joins

sqlpostgresqloptimizationjoin

提问by salathe

I'm trying to figure out a way to speed up a particularly cumbersome query which aggregates some data by date across a couple of tables. The full (ugly) query is below along with an EXPLAIN ANALYZEto show just how horrible it is.

我试图找出一种方法来加速一个特别繁琐的查询,该查询按日期跨几个表聚合一些数据。下面是完整的(丑陋的)查询,并附有一个EXPLAIN ANALYZE以显示它有多可怕。

If anyone could take a peek and see if they can spot any major issues (which is likely, I'm not a Postgres guy) that would be superb.

如果有人可以偷看一下,看看他们是否能发现任何重大问题(这很可能,我不是 Postgres 人)那将是极好的。

So here goes. The query is:

所以就到这里了。查询是:

SELECT 
 to_char(p.period, 'DD/MM/YY') as period,
 coalesce(o.value, 0) AS outbound,
 coalesce(i.value, 0) AS inbound
FROM (
 SELECT
  date '2009-10-01' + s.day 
  AS period 
  FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
) AS p 
LEFT OUTER JOIN(
 SELECT
  SUM(b.body_size) AS value, 
  b.body_time::date AS period 
 FROM body AS b 
 LEFT JOIN 
  envelope e ON e.message_id = b.message_id 
 WHERE 
  e.envelope_command = 1 
  AND b.body_time BETWEEN '2009-10-01' 
  AND (date '2009-10-31' + INTERVAL '1 DAY') 
 GROUP BY period 
 ORDER BY period
) AS o ON p.period = o.period
LEFT OUTER JOIN( 
 SELECT 
  SUM(b.body_size) AS value, 
  b.body_time::date AS period 
 FROM body AS b 
 LEFT JOIN 
  envelope e ON e.message_id = b.message_id 
 WHERE 
  e.envelope_command = 2 
  AND b.body_time BETWEEN '2009-10-01' 
  AND (date '2009-10-31' + INTERVAL '1 DAY') 
 GROUP BY period 
 ORDER BY period
) AS i ON p.period = i.period 

The EXPLAIN ANALYZEcan be found here: on explain.depesz.com

EXPLAIN ANALYZE可以在这里找到:在explain.depesz.com

Any comments or questions are appreciated.

任何意见或问题表示赞赏。

Cheers

干杯

采纳答案by salathe

Building on Craig Young's suggestions, here is the amended query which runs in ~1.8 seconds for the data set I'm working on. That is a slight improvement on the original ~2.0s and a huge improvement on Craig's which took ~22s.

基于 Craig Young 的建议,这里是修改后的查询,它在大约 1.8 秒内针对我正在处理的数据集运行。这比原来的 ~2.0s 略有改进,而 Craig's 花了 ~22s 的巨大改进。

SELECT
    p.period,
    /* The pivot technique... */
    SUM(CASE envelope_command WHEN 1 THEN body_size ELSE 0 END) AS Outbound,
    SUM(CASE envelope_command WHEN 2 THEN body_size ELSE 0 END) AS Inbound
FROM
(
    /* Get days range */
    SELECT date '2009-10-01' + day AS period
    FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS day
) p
    /* Join message information */
    LEFT OUTER JOIN
    (
        SELECT b.body_size, b.body_time::date, e.envelope_command
        FROM body AS b 
            INNER JOIN envelope e ON e.message_id = b.message_id 
        WHERE
            e.envelope_command IN (2, 1)
            AND b.body_time::date BETWEEN (date '2009-10-01') AND (date '2009-10-31')
    ) d ON d.body_time = p.period
GROUP BY p.period
ORDER BY p.period

回答by Disillusioned

There are always 2 things to consider when optimising queries:

优化查询时总是需要考虑两件事:

  • What indexes can be used (you may need to create indexes)
  • How the query is written (you may need to change the query to allow the query optimser to be able to find appropriate indexes, and to not re-read data redundantly)
  • 可以使用哪些索引(您可能需要创建索引)
  • 查询是如何编写的(您可能需要更改查询以允许查询优化器能够找到合适的索引,并且不会冗余地重新读取数据)

A few observations:

一些观察:

  • You are performing date manipulations before you join your dates. As a general rule this will prevent a query optimser from using an index even if it exists. You should try to write your expressions in such a way that indexed columns exist unaltered on one side of the expression.

  • Your subqueries are filtering to the same date range as generate_series. This is a duplication, and it limits the optimser's ability to choose the most efficient optimisation. I suspect that may have been written in to improve performance because the optimser was unable to use an index on the date column (body_time)?

  • NOTE: We would actually very much like to use an index on Body.body_time

  • ORDER BYwithin the subqueries is at best redundant. At worst it could force the query optimiser to sort the result set before joining; and that is not necessarily good for the query plan. Rather only apply ordering right at the end for final display.

  • Use of LEFT JOINin your subqueries is inappropriate. Assuming you're using ANSI conventions for NULLbehaviour (and you should be), any outerjoins to envelopewould return envelope_command=NULL, and these would consequently be excluded by the condition envelope_command=?.

  • Subqueries oand iare almost identical save for the envelope_commandvalue. This forces the optimser to scan the same underlying tables twice. You can use a pivot tabletechnique to join to the data once, and split the values into 2 columns.

  • 在加入日期之前,您正在执行日期操作。作为一般规则,这将阻止查询优化器使用索引,即使它存在。您应该尝试以这样一种方式编写您的表达式,即索引列在表达式的一侧保持不变。

  • 您的子查询过滤到与 相同的日期范围generate_series。这是一种重复,它限制了优化器选择最有效优化的能力。我怀疑这可能是为了提高性能而写入的,因为优化器无法在日期列 ( body_time)上使用索引?

  • 注意:我们实际上非常希望在Body.body_time

  • ORDER BY在子查询中充其量是多余的。在最坏的情况下,它可能会强制查询优化器在加入之前对结果集进行排序;这对查询计划不一定有好处。而是仅在最后应用订购以进行最终展示。

  • 使用LEFT JOIN你的子查询是不恰当的。假设您使用 ANSINULL行为约定(您应该这样做),任何外部连接都envelope将返回envelope_command=NULL,因此这些将被条件排除envelope_command=?

  • 除了值之外,子查询oi几乎相同envelope_command。这会强制优化器扫描相同的底层表两次。您可以使用数据透视表技术连接数据一次,并将值拆分为 2 列。

Try the following which uses the pivot technique:

尝试以下使用枢轴技术的方法:

SELECT  p.period,
        /*The pivot technique in action...*/
        SUM(
        CASE WHEN envelope_command = 1 THEN body_size
        ELSE 0
        END) AS Outbound,
        SUM(
        CASE WHEN envelope_command = 2 THEN body_size
        ELSE 0
        END) AS Inbound
FROM    (
        SELECT  date '2009-10-01' + s.day AS period
        FROM    generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
        ) AS p 
        /*The left JOIN is justified to ensure ALL generated dates are returned
          Also: it joins to a subquery, else the JOIN to envelope _could_ exclude some generated dates*/
        LEFT OUTER JOIN (
        SELECT  b.body_size,
                b.body_time,
                e.envelope_command
        FROM    body AS b 
                INNER JOIN envelope e 
                  ON e.message_id = b.message_id 
        WHERE   envelope_command IN (1, 2)
        ) d
          /*The expressions below allow the optimser to use an index on body_time if 
            the statistics indicate it would be beneficial*/
          ON d.body_time >= p.period
         AND d.body_time < p.period + INTERVAL '1 DAY'
GROUP BY p.Period
ORDER BY p.Period

EDIT: Added filter suggested by Tom H.

编辑:添加了 Tom H 建议的过滤器。

回答by Tom H

I uninstalled my PostgreSQL server a couple of days ago, so you'll likely have to play around with this, but hopefully it's a good start for you.

几天前我卸载了我的 PostgreSQL 服务器,所以你可能不得不玩这个,但希望这对你来说是一个好的开始。

The keys are:

关键是:

  1. You shouldn't need the subqueries - just do the direct joins and aggregate
  2. You should be able to use INNER JOINs, which are typically more performant than OUTER JOINs
  1. 您不应该需要子查询 - 只需进行直接连接和聚合
  2. 您应该能够使用 INNER JOIN,它通常比 OUTER JOIN 性能更高

If nothing else, I think that the query below is a bit clearer.

如果不出意外,我认为下面的查询更清楚一些。

I used a calendar table in my query, but you can replace that with the generate_series as you were using it.

我在查询中使用了日历表,但您可以在使用时将其替换为 generate_series。

Also, depending on indexing, it might be better to compare the body_date with >= and < rather than pulling out the date part and comparing. I don't know enough about PostgreSQL to know how it works behind the scenes, so I would try both approaches to see which the server can optimize better. In pseudo-code you would be doing: body_date >= date (time=midnight) AND body_date < date + 1 (time=midnight).

此外,根据索引,最好将 body_date 与 >= 和 < 进行比较,而不是取出日期部分并进行比较。我对 PostgreSQL 的了解不够,无法知道它在幕后是如何工作的,所以我会尝试两种方法,看看哪个服务器可以优化得更好。在伪代码中,您将执行以下操作:body_date >= date (time=midnight) AND body_date < date + 1 (time=midnight)。

SELECT
    CAL.calendar_date AS period,
    SUM(O.body_size) AS outbound,
    SUM(I.body_size) AS inbound
FROM
    Calendar CAL
INNER JOIN Body OB ON
    OB.body_time::date = CAL.calendar_date
INNER JOIN Envelope OE ON
    OE.message_id = OB.message_id AND
    OE.envelope_command = 1
INNER JOIN Body IB ON
    IB.body_time::date = CAL.calendar_date
INNER JOIN Envelope IE ON
    IE.message_id = IB.message_id AND
    IE.envelope_command = 2
GROUP BY
    CAL.calendar_date