SQL 优化多个连接

Question

提问by salathe

I'm trying to figure out a way to speed up a particularly cumbersome query which aggregates some data by date across a couple of tables. The full (ugly) query is below along with an EXPLAIN ANALYZEto show just how horrible it is.

我试图找出一种方法来加速一个特别繁琐的查询，该查询按日期跨几个表聚合一些数据。下面是完整的（丑陋的）查询，并附有一个EXPLAIN ANALYZE以显示它有多可怕。

If anyone could take a peek and see if they can spot any major issues (which is likely, I'm not a Postgres guy) that would be superb.

如果有人可以偷看一下，看看他们是否能发现任何重大问题（这很可能，我不是 Postgres 人）那将是极好的。

So here goes. The query is:

所以就到这里了。查询是：

SELECT 
 to_char(p.period, 'DD/MM/YY') as period,
 coalesce(o.value, 0) AS outbound,
 coalesce(i.value, 0) AS inbound
FROM (
 SELECT
  date '2009-10-01' + s.day 
  AS period 
  FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
) AS p 
LEFT OUTER JOIN(
 SELECT
  SUM(b.body_size) AS value, 
  b.body_time::date AS period 
 FROM body AS b 
 LEFT JOIN 
  envelope e ON e.message_id = b.message_id 
 WHERE 
  e.envelope_command = 1 
  AND b.body_time BETWEEN '2009-10-01' 
  AND (date '2009-10-31' + INTERVAL '1 DAY') 
 GROUP BY period 
 ORDER BY period
) AS o ON p.period = o.period
LEFT OUTER JOIN( 
 SELECT 
  SUM(b.body_size) AS value, 
  b.body_time::date AS period 
 FROM body AS b 
 LEFT JOIN 
  envelope e ON e.message_id = b.message_id 
 WHERE 
  e.envelope_command = 2 
  AND b.body_time BETWEEN '2009-10-01' 
  AND (date '2009-10-31' + INTERVAL '1 DAY') 
 GROUP BY period 
 ORDER BY period
) AS i ON p.period = i.period

The EXPLAIN ANALYZEcan be found here: on explain.depesz.com

将EXPLAIN ANALYZE可以在这里找到：在explain.depesz.com

Any comments or questions are appreciated.

任何意见或问题表示赞赏。

Cheers

干杯

Answer 1

采纳答案by salathe

Building on Craig Young's suggestions, here is the amended query which runs in ~1.8 seconds for the data set I'm working on. That is a slight improvement on the original ~2.0s and a huge improvement on Craig's which took ~22s.

基于 Craig Young 的建议，这里是修改后的查询，它在大约 1.8 秒内针对我正在处理的数据集运行。这比原来的 ~2.0s 略有改进，而 Craig's 花了 ~22s 的巨大改进。

SELECT
    p.period,
    /* The pivot technique... */
    SUM(CASE envelope_command WHEN 1 THEN body_size ELSE 0 END) AS Outbound,
    SUM(CASE envelope_command WHEN 2 THEN body_size ELSE 0 END) AS Inbound
FROM
(
    /* Get days range */
    SELECT date '2009-10-01' + day AS period
    FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS day
) p
    /* Join message information */
    LEFT OUTER JOIN
    (
        SELECT b.body_size, b.body_time::date, e.envelope_command
        FROM body AS b 
            INNER JOIN envelope e ON e.message_id = b.message_id 
        WHERE
            e.envelope_command IN (2, 1)
            AND b.body_time::date BETWEEN (date '2009-10-01') AND (date '2009-10-31')
    ) d ON d.body_time = p.period
GROUP BY p.period
ORDER BY p.period

Answer 2

回答by Disillusioned

There are always 2 things to consider when optimising queries:

优化查询时总是需要考虑两件事：

What indexes can be used (you may need to create indexes)
How the query is written (you may need to change the query to allow the query optimser to be able to find appropriate indexes, and to not re-read data redundantly)

可以使用哪些索引（您可能需要创建索引）
查询是如何编写的（您可能需要更改查询以允许查询优化器能够找到合适的索引，并且不会冗余地重新读取数据）

A few observations:

一些观察：

You are performing date manipulations before you join your dates. As a general rule this will prevent a query optimser from using an index even if it exists. You should try to write your expressions in such a way that indexed columns exist unaltered on one side of the expression.
Your subqueries are filtering to the same date range as generate_series. This is a duplication, and it limits the optimser's ability to choose the most efficient optimisation. I suspect that may have been written in to improve performance because the optimser was unable to use an index on the date column (body_time)?
NOTE: We would actually very much like to use an index on Body.body_time
ORDER BYwithin the subqueries is at best redundant. At worst it could force the query optimiser to sort the result set before joining; and that is not necessarily good for the query plan. Rather only apply ordering right at the end for final display.
Use of LEFT JOINin your subqueries is inappropriate. Assuming you're using ANSI conventions for NULLbehaviour (and you should be), any outerjoins to envelopewould return envelope_command=NULL, and these would consequently be excluded by the condition envelope_command=?.
Subqueries oand iare almost identical save for the envelope_commandvalue. This forces the optimser to scan the same underlying tables twice. You can use a pivot tabletechnique to join to the data once, and split the values into 2 columns.

在加入日期之前，您正在执行日期操作。作为一般规则，这将阻止查询优化器使用索引，即使它存在。您应该尝试以这样一种方式编写您的表达式，即索引列在表达式的一侧保持不变。
您的子查询过滤到与相同的日期范围generate_series。这是一种重复，它限制了优化器选择最有效优化的能力。我怀疑这可能是为了提高性能而写入的，因为优化器无法在日期列 ( body_time)上使用索引？
注意：我们实际上非常希望在Body.body_time
ORDER BY在子查询中充其量是多余的。在最坏的情况下，它可能会强制查询优化器在加入之前对结果集进行排序；这对查询计划不一定有好处。而是仅在最后应用订购以进行最终展示。
使用LEFT JOIN你的子查询是不恰当的。假设您使用 ANSINULL行为约定（您应该这样做），任何外部连接都envelope将返回envelope_command=NULL，因此这些将被条件排除envelope_command=?。
除了值之外，子查询o和i几乎相同envelope_command。这会强制优化器扫描相同的底层表两次。您可以使用数据透视表技术连接数据一次，并将值拆分为 2 列。

Try the following which uses the pivot technique:

尝试以下使用枢轴技术的方法：

SELECT  p.period,
        /*The pivot technique in action...*/
        SUM(
        CASE WHEN envelope_command = 1 THEN body_size
        ELSE 0
        END) AS Outbound,
        SUM(
        CASE WHEN envelope_command = 2 THEN body_size
        ELSE 0
        END) AS Inbound
FROM    (
        SELECT  date '2009-10-01' + s.day AS period
        FROM    generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
        ) AS p 
        /*The left JOIN is justified to ensure ALL generated dates are returned
          Also: it joins to a subquery, else the JOIN to envelope _could_ exclude some generated dates*/
        LEFT OUTER JOIN (
        SELECT  b.body_size,
                b.body_time,
                e.envelope_command
        FROM    body AS b 
                INNER JOIN envelope e 
                  ON e.message_id = b.message_id 
        WHERE   envelope_command IN (1, 2)
        ) d
          /*The expressions below allow the optimser to use an index on body_time if 
            the statistics indicate it would be beneficial*/
          ON d.body_time >= p.period
         AND d.body_time < p.period + INTERVAL '1 DAY'
GROUP BY p.Period
ORDER BY p.Period

EDIT: Added filter suggested by Tom H.

编辑：添加了 Tom H 建议的过滤器。

Answer 3

回答by Tom H

I uninstalled my PostgreSQL server a couple of days ago, so you'll likely have to play around with this, but hopefully it's a good start for you.

几天前我卸载了我的 PostgreSQL 服务器，所以你可能不得不玩这个，但希望这对你来说是一个好的开始。

The keys are:

关键是：

You shouldn't need the subqueries - just do the direct joins and aggregate
You should be able to use INNER JOINs, which are typically more performant than OUTER JOINs

您不应该需要子查询 - 只需进行直接连接和聚合
您应该能够使用 INNER JOIN，它通常比 OUTER JOIN 性能更高

If nothing else, I think that the query below is a bit clearer.

如果不出意外，我认为下面的查询更清楚一些。

I used a calendar table in my query, but you can replace that with the generate_series as you were using it.

我在查询中使用了日历表，但您可以在使用时将其替换为 generate_series。

Also, depending on indexing, it might be better to compare the body_date with >= and < rather than pulling out the date part and comparing. I don't know enough about PostgreSQL to know how it works behind the scenes, so I would try both approaches to see which the server can optimize better. In pseudo-code you would be doing: body_date >= date (time=midnight) AND body_date < date + 1 (time=midnight).

此外，根据索引，最好将 body_date 与 >= 和 < 进行比较，而不是取出日期部分并进行比较。我对 PostgreSQL 的了解不够，无法知道它在幕后是如何工作的，所以我会尝试两种方法，看看哪个服务器可以优化得更好。在伪代码中，您将执行以下操作：body_date >= date (time=midnight) AND body_date < date + 1 (time=midnight)。

SELECT
    CAL.calendar_date AS period,
    SUM(O.body_size) AS outbound,
    SUM(I.body_size) AS inbound
FROM
    Calendar CAL
INNER JOIN Body OB ON
    OB.body_time::date = CAL.calendar_date
INNER JOIN Envelope OE ON
    OE.message_id = OB.message_id AND
    OE.envelope_command = 1
INNER JOIN Body IB ON
    IB.body_time::date = CAL.calendar_date
INNER JOIN Envelope IE ON
    IE.message_id = IB.message_id AND
    IE.envelope_command = 2
GROUP BY
    CAL.calendar_date

SQL 优化多个连接

提问by salathe

采纳答案by salathe

回答by Disillusioned

回答by Tom H

相关推荐

最近更新

标签

SQL 优化多个连接

提问by salathe

采纳答案by salathe

回答by Disillusioned

回答by Tom H

相关推荐

如何将 .sql 文件导入 SQLite 3？

SQL SSIS 错误：VS_NEEDSNEWMETADATA

SQL 多列排序

Oracle SQL (Oracle 11g) 中的排名优于分区依据

相关推荐

最近更新

标签