SQL 优化 GROUP BY 查询以检索每个用户的最新行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25536422/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 02:32:24  来源:igfitidea点击:

Optimize GROUP BY query to retrieve latest row per user

sqlpostgresqlindexinggreatest-n-per-grouppostgresql-performance

提问by xpapad

I have the following log table for user messages (simplified form) in Postgres 9.2:

我在 Postgres 9.2 中有以下用户消息日志表(简化形式):

CREATE TABLE log (
    log_date DATE,
    user_id  INTEGER,
    payload  INTEGER
);

It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. payload is ever increasing for each user (if that matters).

每个用户每天最多包含一条记录。在 300 天内每天将有大约 50 万条记录。每个用户的有效负载不断增加(如果这很重要)。

I want to efficiently retrieve the latest record for each user before a specific date. My query is:

我想在特定日期之前有效地检索每个用户的最新记录。我的查询是:

SELECT user_id, max(log_date), max(payload) 
FROM log 
WHERE log_date <= :mydate 
GROUP BY user_id

which is extremely slow. I have also tried:

这是非常缓慢的。我也试过:

SELECT DISTINCT ON(user_id), log_date, payload
FROM log
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC;

which has the same plan and is equally slow.

它具有相同的计划并且同样缓慢。

So far I have a single index on log(log_date), but doesn't help much.

到目前为止,我在 上只有一个索引log(log_date),但没有多大帮助。

And I have a userstable with all users included. I also want to retrieve the result for some some users (those with payload > :value).

我有一张users包含所有用户的表。我还想为某些用户(带有 的用户payload > :value)检索结果。

Is there any other index I should use to speed this up, or any other way to achieve what I want?

我应该使用任何其他索引来加快速度,或者任何其他方式来实现我想要的?

回答by Erwin Brandstetter

For best read performance you need a multicolumn index:

为了获得最佳读取性能,您需要一个多列索引

CREATE INDEX log_combo_idx
ON log (user_id, log_date DESC NULLS LAST);

To make index only scanspossible, add the otherwise not needed column payloadin a covering indexwith the INCLUDEclause (Postgres 11 or later):

要使仅索引扫描成为可能,请使用以下子句(Postgres 11 或更高版本)payload覆盖索引中添加原本不需要的列INCLUDE

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);

See:

看:

Fallback for older versions:

旧版本的回退:

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST, payload);

Why DESC NULLS LAST?

为什么DESC NULLS LAST

For fewrows per user_idor small tables DISTINCT ONis typically fastest and simplest:

对于每个或小表的行通常是最快和最简单的:user_idDISTINCT ON

For manyrows per user_idan index skip scan(or loose index scan)is (much) more efficient. That's not implemented up to Postgres 12 - work is ongoing for Postgres 13. But there are ways to emulate it efficiently.

对于许多每行user_id索引跳跃扫描(或松散索引扫描是(多)更有效。这在 Postgres 12 之前还没有实现 - Postgres 13 的工作正在进行中。但是有一些方法可以有效地模拟它。

Common Table Expressionsrequire Postgres 8.4+.
LATERALrequires Postgres 9.3+.
The following solutions go beyond what's covered in the Postgres Wiki.

通用表表达式需要 Postgres 8.4+
LATERAL需要 Postgres 9.3+
以下解决方案超出了Postgres Wiki 中涵盖的内容。

1. No separate table with unique users

1. 没有唯一用户的单独表

With a separate userstable, solutions in 2.below are typically simpler and faster. Skip ahead.

使用单独的users表格,下面2.中的解决方案通常更简单、更快。跳过。

1a. Recursive CTE with LATERALjoin

1a. 带LATERAL连接的递归 CTE

WITH RECURSIVE cte AS (
   (                                -- parentheses required
   SELECT user_id, log_date, payload
   FROM   log
   WHERE  log_date <= :mydate
   ORDER  BY user_id, log_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT l.user_id, l.log_date, l.payload
      FROM   log l
      WHERE  l.user_id > c.user_id  -- lateral reference
      AND    log_date <= :mydate    -- repeat condition
      ORDER  BY l.user_id, l.log_date DESC NULLS LAST
      LIMIT  1
      ) l
   )
TABLE  cte
ORDER  BY user_id;

This is simple to retrieve arbitrary columns and probably best in current Postgres. More explanation in chapter 2a.below.

这很容易检索任意列,并且在当前 Postgres 中可能是最好的。在第2a章中有更多解释以下。

1b. Recursive CTE with correlated subquery

1b. 具有相关子查询的递归 CTE

WITH RECURSIVE cte AS (
   (                                           -- parentheses required
   SELECT l AS my_row                          -- whole row
   FROM   log l
   WHERE  log_date <= :mydate
   ORDER  BY user_id, log_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT l                            -- whole row
           FROM   log l
           WHERE  l.user_id > (c.my_row).user_id
           AND    l.log_date <= :mydate        -- repeat condition
           ORDER  BY l.user_id, l.log_date DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (c.my_row).user_id IS NOT NULL       -- note parentheses
   )
SELECT (my_row).*                              -- decompose row
FROM   cte
WHERE  (my_row).user_id IS NOT NULL
ORDER  BY (my_row).user_id;

Convenient to retrieve a single columnor the whole row. The example uses the whole row type of the table. Other variants are possible.

方便检索单列整行。该示例使用表的整个行类型。其他变体也是可能的。

To assert a row was found in the previous iteration, test a single NOT NULL column (like the primary key).

要断言在前一次迭代中找到一行,请测试单个 NOT NULL 列(如主键)。

More explanation for this query in chapter 2b. below.

在第 2b 章中对该查询的更多解释。以下。

Related:

有关的:

2. With separate userstable

2. 有单独的users桌子

Table layout hardly matters as long as exactly one row per relevant user_idis guaranteed. Example:

只要user_id保证每个相关只有一行,表格布局就几乎无关紧要。例子:

CREATE TABLE users (
   user_id  serial PRIMARY KEY
 , username text NOT NULL
);

Ideally, the table is physically sorted in sync with the logtable. See:

理想情况下,表格的物理排序与log表格同步。看:

Or it's small enough (low cardinality) that it hardly matters. Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition.If the physical sort order of the userstable happens to match the index on log, this may be irrelevant.

或者它足够小(低基数)以至于它几乎不重要。否则,对查询中的行进行排序有助于进一步优化性能。见钢梁的补充。如果表的物理排序顺序users恰好与 上的索引匹配log,则这可能无关紧要。

2a. LATERALjoin

2a. LATERAL加入

SELECT u.user_id, l.log_date, l.payload
FROM   users u
CROSS  JOIN LATERAL (
   SELECT l.log_date, l.payload
   FROM   log l
   WHERE  l.user_id = u.user_id         -- lateral reference
   AND    l.log_date <= :mydate
   ORDER  BY l.log_date DESC NULLS LAST
   LIMIT  1
   ) l;

JOIN LATERALallows to reference preceding FROMitems on the same query level. See:

JOIN LATERAL允许FROM在同一查询级别上引用前面的项目。看:

Results in one index (-only) look-up per user.

导致每个用户只查找一个索引(-only)。

Returns no row for users missing in the userstable. Typically, a foreign keyconstraint enforcing referential integrity would rule that out.

对于users表中缺少的用户,不返回任何行。通常,强制参照完整性的外键约束会排除这种情况。

Also, no row for users without matching entry in log- conforming to the original question. To keep those users in the result use LEFT JOIN LATERAL ... ON trueinstead of CROSS JOIN LATERAL:

此外,没有匹配条目的用户没有行log- 符合原始问题。要将这些用户保留在结果中,请使用LEFT JOIN LATERAL ... ON true而不是CROSS JOIN LATERAL

Use LIMIT ninstead of LIMIT 1to retrieve more than one rows(but not all) per user.

使用LIMIT n而不是为每个用户LIMIT 1检索多行(但不是全部)。

Effectively, all of these do the same:

实际上,所有这些都做同样的事情:

JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...

The last one has lower priority, though. Explicit JOINbinds before comma. That subtle difference can matters with more join tables. See:

不过,最后一个优先级较低。JOIN逗号前的显式绑定。对于更多的连接表,这种细微的差异可能很重要。看:

2b. Correlated subquery

2b. 相关子查询

Good choice to retrieve a single columnfrom a single row. Code example:

单行检索单列的好选择。代码示例:

The same is possible for multiple columns, but you need more smarts:

这同样是可能的多列,但你需要更多的智慧:

CREATE TEMP TABLE combo (log_date date, payload int);

SELECT user_id, (combo1).*              -- note parentheses
FROM (
   SELECT u.user_id
        , (SELECT (l.log_date, l.payload)::combo
           FROM   log l
           WHERE  l.user_id = u.user_id
           AND    l.log_date <= :mydate
           ORDER  BY l.log_date DESC NULLS LAST
           LIMIT  1) AS combo1
   FROM   users u
   ) sub;
  • Like LEFT JOIN LATERALabove, this variant includes allusers, even without entries in log. You get NULLfor combo1, which you can easily filter with a WHEREclause in the outer query if need be.
    Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all column values happen to be NULL - same result. You need a NOT NULLcolumn in the subquery to avoid this ambiguity.

  • A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
    Use a registered type like the row type of an existing table. Or register a composite type explicitly (and permanently) with CREATE TYPE. Or create a temporary table (dropped automatically at end of session) to register its row type temporarily. Cast syntax: (log_date, payload)::combo

  • Finally, we do not want to decompose combo1on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (still true in Postgres 12). Instead, make it a subquery and decompose in the outer query.

  • LEFT JOIN LATERAL上面一样,这个变体包括所有用户,即使没有log. 您得到NULLfor combo1WHERE如果需要,您可以使用外部查询中的子句轻松过滤。
    挑剔:在外部查询中,您无法区分子查询是否未找到行或所有列值碰巧为 NULL - 结果相同。您需要NOT NULL在子查询中有一列以避免这种歧义。

  • 相关子查询只能返回一个。您可以将多个列包装成一个复合类型。但是为了稍后分解它,Postgres 需要一个众所周知的复合类型。匿名记录只能分解提供列定义列表。
    使用注册类型,如现有表的行类型。或者使用CREATE TYPE. 或者创建一个临时表(在会话结束时自动删除)来临时注册其行类型。投射语法:(log_date, payload)::combo

  • 最后,我们不想combo1在同一查询级别上进行分解。由于查询规划器的弱点,这将对每列评估一次子查询(在 Postgres 12 中仍然如此)。相反,将其设为子查询并在外部查询中分解。

Related:

有关的:

Demonstrating all 4 queries with 100k log entries and 1k users:
db<>fiddle here- pg 11
Old sqlfiddle- pg 9.6

使用 100k 日志条目和 1k 用户演示所有 4 个查询:
db<>fiddle here- pg 11
Old sqlfiddle- pg 9.6

回答by Gang Liang

This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the userstable to exploit the locality of the index on log.

这不是一个独立的答案,而是对@Erwin 的回答的评论。对于 2a,横向连接示例,可以通过对users表进行排序以利用 上索引的局部性来改进查询log

SELECT u.user_id, l.log_date, l.payload
  FROM (SELECT user_id FROM users ORDER BY user_id) u,
       LATERAL (SELECT log_date, payload
                  FROM log
                 WHERE user_id = u.user_id -- lateral reference
                   AND log_date <= :mydate
              ORDER BY log_date DESC NULLS LAST
                 LIMIT 1) l;

The rationale is that index lookup is expensive if user_idvalues are random. By sorting out user_idfirst, the subsequent lateral join would be like a simple scan on the index of log. Even though both query plans look alike, the running time would differ much especially for large tables.

基本原理是如果user_id值是随机的,则索引查找会很昂贵。通过先排序user_id,随后的横向连接就像对 的索引进行简单扫描log。即使两个查询计划看起来相似,运行时间也会有很大不同,尤其是对于大表。

The cost of the sorting is minimal especially if there is an index on the user_idfield.

排序的成本是最小的,尤其是在user_id字段上有索引的情况下。

回答by Gordon Linoff

Perhaps a different index on the table would help. Try this one: log(user_id, log_date). I am not positive that Postgres will make optimal use with distinct on.

也许桌子上的不同索引会有所帮助。试试这个: log(user_id, log_date)。我并不肯定 Postgres 将最佳地使用distinct on.

So, I would stick with that index and try this version:

所以,我会坚持使用该索引并尝试这个版本:

select *
from log l
where not exists (select 1
                  from log l2
                  where l2.user_id = l.user_id and
                        l2.log_date <= :mydate and
                        l2.log_date > l.log_date
                 );

This should replace the sorting/grouping with index look ups. It might be faster.

这应该用索引查找替换排序/分组。它可能会更快。