MySQL 获取每组最高/最小 <whatever> 的记录

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8748986/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 22:15:43  来源:igfitidea点击:

Get records with highest/smallest <whatever> per group

mysqlsubquerygreatest-n-per-grouprank

提问by TMS

How to do that?

怎么做?

Former title of this question was "using rank (@Rank := @Rank + 1) in complex query with subqueries - will it work?" because I was looking for solution using ranks, but now I see that the solution posted by Bill is much much better.

这个问题的前标题是“在带有子查询的复杂查询中使用 rank (@Rank := @Rank + 1) - 它会起作用吗?”因为我正在寻找使用 rank 的解决方案,但现在我看到 Bill 发布的解决方案是好多了。

Original question:

原问题:

I'm trying to compose a query that would take last record from each group given some defined order:

我正在尝试编写一个查询,该查询将从给定一些定义顺序的每个组中获取最后一条记录:

SET @Rank=0;

select s.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from Table
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from Table
      order by OrderField
      ) as s 
  on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField

Expression @Rank := @Rank + 1is normally used for rank, but for me it looks suspicious when used in 2 subqueries, but initialized only once. Will it work this way?

表达式@Rank := @Rank + 1通常用于排名,但对我来说,在 2 个子查询中使用时看起来很可疑,但只初始化一次。会这样吗?

And second, will it work with one subquery that is evaluated multiple times? Like subquery in where (or having) clause (another way how to write the above):

其次,它是否适用于一个被多次评估的子查询?就像 where (or have) 子句中的子查询一样(另一种写法):

SET @Rank=0;

select Table.*, @Rank := @Rank + 1 AS Rank
from Table
having Rank = (select max(Rank) AS MaxRank
              from (select GroupId, @Rank := @Rank + 1 AS Rank 
                    from Table as t0
                    order by OrderField
                    ) as t
              where t.GroupId = table.GroupId
             )
order by OrderField

Thanks in advance!

提前致谢!

回答by Bill Karwin

So you want to get the row with the highest OrderFieldper group? I'd do it this way:

所以你想获得OrderField每组最高的行?我会这样做:

SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
  ON t1.GroupId = t2.GroupId AND t1.OrderField < t2.OrderField
WHERE t2.GroupId IS NULL
ORDER BY t1.OrderField; // not needed! (note by Tomas)

(EDIT by Tomas:If there are more records with the same OrderField within the same group and you need exactly one of them, you may want to extend the condition:

Tomas 编辑:如果同一组中有更多具有相同 OrderField 的记录并且您恰好需要其中之一,您可能需要扩展条件:

SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
  ON t1.GroupId = t2.GroupId 
        AND (t1.OrderField < t2.OrderField 
         OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id))
WHERE t2.GroupId IS NULL

end of edit.)

编辑结束。)

In other words, return the row t1for which no other row t2exists with the same GroupIdand a greater OrderField. When t2.*is NULL, it means the left outer join found no such match, and therefore t1has the greatest value of OrderFieldin the group.

换句话说,返回t1t2存在具有相同GroupId和更大的其他行的行OrderField。当t2.*为 NULL 时,表示左外连接没有找到这样的匹配,因此在组中t1具有最大值OrderField

No ranks, no subqueries. This should run fast and optimize access to t2 with "Using index" if you have a compound index on (GroupId, OrderField).

没有等级,没有子查询。如果您在(GroupId, OrderField).



Regarding performance, see my answer to Retrieving the last record in each group. I tried a subquery method and the join method using the Stack Overflow data dump. The difference is remarkable: the join method ran 278 times faster in my test.

关于性能,请参阅我对Retrieving the last record in each group 的回答。我尝试了使用 Stack Overflow 数据转储的子查询方法和连接方法。区别很明显:在我的测试中,join 方法的运行速度快了 278 倍。

It's important that you have the right index to get the best results!

拥有正确的索引以获得最佳结果非常重要!

Regarding your method using the @Rank variable, it won't work as you've written it, because the values of @Rank won't reset to zero after the query has processed the first table. I'll show you an example.

关于您使用@Rank 变量的方法,它不会像您编写的那样工作,因为在查询处理第一个表后@Rank 的值不会重置为零。我给你看一个例子。

I inserted some dummy data, with an extra field that is null except on the row we know is the greatest per group:

我插入了一些虚拟数据,除了我们知道每组最大的行之外,还有一个额外的字段为空:

select * from `Table`;

+---------+------------+------+
| GroupId | OrderField | foo  |
+---------+------------+------+
|      10 |         10 | NULL |
|      10 |         20 | NULL |
|      10 |         30 | foo  |
|      20 |         40 | NULL |
|      20 |         50 | NULL |
|      20 |         60 | foo  |
+---------+------------+------+

We can show that the rank increases to three for the first group and six for the second group, and the inner query returns these correctly:

我们可以证明第一组的排名增加到 3,第二组的排名增加到 6,并且内部查询正确返回这些:

select GroupId, max(Rank) AS MaxRank
from (
  select GroupId, @Rank := @Rank + 1 AS Rank
  from `Table`
  order by OrderField) as t
group by GroupId

+---------+---------+
| GroupId | MaxRank |
+---------+---------+
|      10 |       3 |
|      20 |       6 |
+---------+---------+

Now run the query with no join condition, to force a Cartesian product of all rows, and we also fetch all columns:

现在运行没有连接条件的查询,强制所有行的笛卡尔积,我们还获取所有列:

select s.*, t.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from `Table`
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from `Table`
      order by OrderField
      ) as s 
  -- on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;

+---------+---------+---------+------------+------+------+
| GroupId | MaxRank | GroupId | OrderField | foo  | Rank |
+---------+---------+---------+------------+------+------+
|      10 |       3 |      10 |         10 | NULL |    7 |
|      20 |       6 |      10 |         10 | NULL |    7 |
|      10 |       3 |      10 |         20 | NULL |    8 |
|      20 |       6 |      10 |         20 | NULL |    8 |
|      20 |       6 |      10 |         30 | foo  |    9 |
|      10 |       3 |      10 |         30 | foo  |    9 |
|      10 |       3 |      20 |         40 | NULL |   10 |
|      20 |       6 |      20 |         40 | NULL |   10 |
|      10 |       3 |      20 |         50 | NULL |   11 |
|      20 |       6 |      20 |         50 | NULL |   11 |
|      20 |       6 |      20 |         60 | foo  |   12 |
|      10 |       3 |      20 |         60 | foo  |   12 |
+---------+---------+---------+------------+------+------+

We can see from the above that the max rank per group is correct, but then the @Rank continues to increase as it processes the second derived table, to 7 and on higher. So the ranks from the second derived table will never overlap with the ranks from the first derived table at all.

从上面我们可以看到,每组的最大排名是正确的,但是@Rank 在处理第二个派生表时继续增加,达到 7 及更高。因此,第二个派生表中的排名永远不会与第一个派生表中的排名重叠。

You'd have to add another derived table to force @Rank to reset to zero in between processing the two tables (and hope the optimizer doesn't change the order in which it evaluates tables, or else use STRAIGHT_JOIN to prevent that):

您必须添加另一个派生表以强制 @Rank 在处理两个表之间重置为零(并希望优化器不会更改它评估表的顺序,否则使用 STRAIGHT_JOIN 来防止这种情况):

select s.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from `Table`
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (select @Rank := 0) r -- RESET @Rank TO ZERO HERE
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from `Table`
      order by OrderField
      ) as s 
  on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;

+---------+------------+------+------+
| GroupId | OrderField | foo  | Rank |
+---------+------------+------+------+
|      10 |         30 | foo  |    3 |
|      20 |         60 | foo  |    6 |
+---------+------------+------+------+

But the optimization of this query is terrible. It can't use any indexes, it creates two temporary tables, sorts them the hard way, and even uses a join buffer because it can't use an index when joining temp tables either. This is example output from EXPLAIN:

但是这个查询的优化很糟糕。它不能使用任何索引,它创建了两个临时表,对它们进行了艰难的排序,甚至使用了连接缓冲区,因为它在连接临时表时也不能使用索引。这是来自的示例输出EXPLAIN

+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
|  1 | PRIMARY     | <derived4> | system | NULL          | NULL | NULL    | NULL |    1 | Using temporary; Using filesort |
|  1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL | NULL    | NULL |    2 |                                 |
|  1 | PRIMARY     | <derived5> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using where; Using join buffer  |
|  5 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  |
|  4 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL | No tables used                  |
|  2 | DERIVED     | <derived3> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using temporary; Using filesort |
|  3 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+

Whereas my solution using the left outer join optimizes much better. It uses no temp table and even reports "Using index"which means it can resolve the join using only the index, without touching the data.

而我使用左外连接的解决方案优化得更好。它不使用临时表,甚至不使用报告"Using index",这意味着它可以仅使用索引解析连接,而无需接触数据。

+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref             | rows | Extra                    |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL    | NULL    | NULL            |    6 | Using filesort           |
|  1 | SIMPLE      | t2    | ref  | GroupId       | GroupId | 5       | test.t1.GroupId |    1 | Using where; Using index |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+

You'll probably read people making claims on their blogs that "joins make SQL slow," but that's nonsense. Poor optimization makes SQL slow.

您可能会读到有人在他们的博客上声称“联接使 SQL 变慢”,但这是无稽之谈。糟糕的优化使 SQL 变慢。