MySQL SQL 仅选择列上具有最大值的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7745609/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 21:19:32  来源:igfitidea点击:

SQL select only rows with max value on a column

mysqlsqlaggregate-functionsgreatest-n-per-groupgroupwise-maximum

提问by Majid Fouladpour

I have this table for documents (simplified version here):

我有这个文件表(这里是简化版):

+------+-------+--------------------------------------+
| id   | rev   | content                              |
+------+-------+--------------------------------------+
| 1    | 1     | ...                                  |
| 2    | 1     | ...                                  |
| 1    | 2     | ...                                  |
| 1    | 3     | ...                                  |
+------+-------+--------------------------------------+

How do I select one row per id and only the greatest rev?
With the above data, the result should contain two rows: [1, 3, ...]and [2, 1, ..]. I'm using MySQL.

如何为每个 ID 选择一行并且只选择最大的转速?
有了上面的数据,结果应该包含两行:[1, 3, ...][2, 1, ..]。我正在使用MySQL

Currently I use checks in the whileloop to detect and over-write old revs from the resultset. But is this the only method to achieve the result? Isn't there a SQLsolution?

目前我在while循环中使用检查来检测和覆盖结果集中的旧转速。但这是达到结果的唯一方法吗?没有SQL解决方案吗?

Update
As the answers suggest, there isa SQL solution, and here a sqlfiddle demo.

更新
作为答案提示,有一个SQL的解决方案,并且这里sqlfiddle演示

Update 2
I noticed after adding the above sqlfiddle, the rate at which the question is upvoted has surpassed the upvote rate of the answers. That has not been the intention! The fiddle is based on the answers, especially the accepted answer.

更新 2
在添加上述sqlfiddle 后,我注意到问题的投票率已经超过了答案的投票率。这不是本意!小提琴基于答案,尤其是已接受的答案。

回答by Adriano Carneiro

At first glance...

乍一看...

All you need is a GROUP BYclause with the MAXaggregate function:

您只需要一个GROUP BY带有MAX聚合函数的子句:

SELECT id, MAX(rev)
FROM YourTable
GROUP BY id

It's never that simple, is it?

它从来没有那么简单,是吗?

I just noticed you need the contentcolumn as well.

我刚注意到你也需要这个content专栏。

This is a very common question in SQL: find the whole data for the row with some max value in a column per some group identifier. I heard that a lot during my career. Actually, it was one the questions I answered in my current job's technical interview.

这是 SQL 中一个非常常见的问题:在每个组标识符的列中找到具有某个最大值的行的整个数据。在我的职业生涯中,我听到了很多。实际上,这是我在当前工作的技术面试中回答的问题之一。

It is, actually, so common that StackOverflow community has created a single tag just to deal with questions like that: greatest-n-per-group.

实际上,StackOverflow 社区创建了一个标签来处理这样的问题,这太普遍了: best -n-per-group

Basically, you have two approaches to solve that problem:

基本上,您有两种方法可以解决该问题:

Joining with simple group-identifier, max-value-in-groupSub-query

加入简单的group-identifier, max-value-in-group子查询

In this approach, you first find the group-identifier, max-value-in-group(already solved above) in a sub-query. Then you join your table to the sub-query with equality on both group-identifierand max-value-in-group:

在这种方法中,您首先group-identifier, max-value-in-group在子查询中找到(已在上面解决)。然后你将你的表加入到子查询中,group-identifier并且在和上都相等max-value-in-group

SELECT a.id, a.rev, a.contents
FROM YourTable a
INNER JOIN (
    SELECT id, MAX(rev) rev
    FROM YourTable
    GROUP BY id
) b ON a.id = b.id AND a.rev = b.rev

Left Joining with self, tweaking join conditions and filters

与自身左连接,调整连接条件和过滤器

In this approach, you left join the table with itself. Equality goes in the group-identifier. Then, 2 smart moves:

在这种方法中,您将表与自身连接起来。平等在group-identifier. 然后,2个聪明的动作:

  1. The second join condition is having left side value less than right value
  2. When you do step 1, the row(s) that actually have the max value will have NULLin the right side (it's a LEFT JOIN, remember?). Then, we filter the joined result, showing only the rows where the right side is NULL.
  1. 第二个连接条件是左侧值小于右侧值
  2. 当您执行第 1 步时,实际具有最大值的行将NULL位于右侧(它是LEFT JOIN,还记得吗?)。然后,我们过滤连接的结果,仅显示右侧为 的行NULL

So you end up with:

所以你最终得到:

SELECT a.*
FROM YourTable a
LEFT OUTER JOIN YourTable b
    ON a.id = b.id AND a.rev < b.rev
WHERE b.id IS NULL;

Conclusion

结论

Both approaches bring the exact same result.

两种方法都带来完全相同的结果。

If you have two rows with max-value-in-groupfor group-identifier, both rows will be in the result in both approaches.

如果您有两行max-value-in-groupfor group-identifier,则这两行都将在两种方法的结果中。

Both approaches are SQL ANSI compatible, thus, will work with your favorite RDBMS, regardless of its "flavor".

这两种方法都与 SQL ANSI 兼容,因此,无论其“风格”如何,都可以与您最喜欢的 RDBMS 一起使用。

Both approaches are also performance friendly, however your mileage may vary (RDBMS, DB Structure, Indexes, etc.). So when you pick one approach over the other, benchmark. And make sure you pick the one which make most of sense to you.

这两种方法也是性能友好的,但是您的里程可能会有所不同(RDBMS、DB 结构、索引等)。因此,当您选择一种方法而不是另一种方法时,benchmark。并确保您选择对您最有意义的那个。

回答by Kevin Burton

My preference is to use as little code as possible...

我的偏好是使用尽可能少的代码......

You can do it using INtry this:

你可以使用IN试试这个:

SELECT * 
FROM t1 WHERE (id,rev) IN 
( SELECT id, MAX(rev)
  FROM t1
  GROUP BY id
)

to my mind it is less complicated... easier to read and maintain.

在我看来,它不那么复杂……更容易阅读和维护。

回答by topchef

I am flabbergasted that no answer offered SQL window function solution:

我很惊讶没有答案提供 SQL 窗口函数解决方案:

SELECT a.id, a.rev, a.contents
  FROM (SELECT id, rev, contents,
               ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) rank
          FROM YourTable) a
 WHERE a.rank = 1 

Added in SQL standard ANSI/ISO Standard SQL:2003 and later extended with ANSI/ISO Standard SQL:2008, window (or windowing) functions are available with all major vendors now. There are more types of rank functions available to deal with a tie issue: RANK, DENSE_RANK, PERSENT_RANK.

在 SQL 标准 ANSI/ISO 标准 SQL:2003 中添加,后来用 ANSI/ISO 标准 SQL:2008 扩展,现在所有主要供应商都可以使用窗口(或窗口)功能。有更多类型的秩函数可用于处理平局问题:RANK, DENSE_RANK, PERSENT_RANK.

回答by Vajk Hermecz

Yet another solution is to use a correlated subquery:

另一种解决方案是使用相关子查询:

select yt.id, yt.rev, yt.contents
    from YourTable yt
    where rev = 
        (select max(rev) from YourTable st where yt.id=st.id)

Having an index on (id,rev) renders the subquery almost as a simple lookup...

在 (id,rev) 上有一个索引几乎就像一个简单的查找一样呈现子查询......

Following are comparisons to the solutions in @AdrianCarneiro's answer (subquery, leftjoin), based on MySQL measurements with InnoDB table of ~1million records, group size being: 1-3.

以下是与@AdrianCarneiro 的答案(子查询、leftjoin)中的解决方案的比较,基于 MySQL 测量,InnoDB 表包含约 100 万条记录,组大小为:1-3。

While for full table scans subquery/leftjoin/correlated timings relate to each other as 6/8/9, when it comes to direct lookups or batch (id in (1,2,3)), subquery is much slower then the others (Due to rerunning the subquery). However I couldnt differentiate between leftjoin and correlated solutions in speed.

虽然对于全表扫描,子查询/左连接/相关时间彼此相关为 6/8/9,但在直接查找或批处理 ( id in (1,2,3)) 方面,子查询比其他查询慢得多(由于重新运行子查询)。但是我无法在速度上区分 leftjoin 和相关解决方案。

One final note, as leftjoin creates n*(n+1)/2 joins in groups, its performance can be heavily affected by the size of groups...

最后要注意的是,由于 leftjoin 在组中创建 n*(n+1)/2 个连接,因此其性能可能会受到组大小的严重影响......

回答by David Foster

I can't vouch for the performance, but here's a trick inspired by the limitations of Microsoft Excel. It has some good features

我不能保证性能,但这里有一个受 Microsoft Excel 限制启发的技巧。它有一些很好的功能

GOOD STUFF

好东西

  • It should force return of only one "max record" even if there is a tie (sometimes useful)
  • It doesn't require a join
  • 即使有平局,它也应该只强制返回一个“最大记录”(有时很有用)
  • 它不需要加入

APPROACH

方法

It is a little bit ugly and requires that you know something about the range of valid values of the revcolumn. Let us assume that we know the revcolumn is a number between 0.00 and 999 including decimals but that there will only ever be two digits to the right of the decimal point (e.g. 34.17 would be a valid value).

它有点难看,需要您了解rev列的有效值范围。让我们假设我们知道rev列是一个介于 0.00 和 999 之间的数字,包括小数,但小数点右侧永远只有两位数字(例如 34.17 将是一个有效值)。

The gist of the thing is that you create a single synthetic column by string concatenating/packing the primary comparison field along with the data you want. In this way, you can force SQL's MAX() aggregate function to return all of the data (because it has been packed into a single column). Then you have to unpack the data.

事情的要点是您通过字符串连接/打包主要比较字段以及您想要的数据来创建单个合成列。通过这种方式,您可以强制 SQL 的 MAX() 聚合函数返回所有数据(因为它已被打包到单个列中)。然后你必须解压数据。

Here's how it looks with the above example, written in SQL

下面是上面例子的样子,用 SQL 编写

SELECT id, 
       CAST(SUBSTRING(max(packed_col) FROM 2 FOR 6) AS float) as max_rev,
       SUBSTRING(max(packed_col) FROM 11) AS content_for_max_rev 
FROM  (SELECT id, 
       CAST(1000 + rev + .001 as CHAR) || '---' || CAST(content AS char) AS packed_col
       FROM yourtable
      ) 
GROUP BY id

The packing begins by forcing the revcolumn to be a number of known character length regardless of the value of revso that for example

打包开始时强制rev列是一些已知的字符长度,而不管rev的值如何,例如

  • 3.2 becomes 1003.201
  • 57 becomes 1057.001
  • 923.88 becomes 1923.881
  • 3.2 变成 1003.201
  • 57 变成 1057.001
  • 923.88 变成 1923.881

If you do it right, string comparison of two numbers should yield the same "max" as numeric comparison of the two numbers and it's easy to convert back to the original number using the substring function (which is available in one form or another pretty much everywhere).

如果你做对了,两个数字的字符串比较应该产生与两个数字的数字比较相同的“最大值”,并且使用 substring 函数很容易转换回原始数字(它以一种或另一种形式提供)到处)。

回答by HoldOffHunger

Cleanest Solution

最干净的解决方案

DB Fiddle

数据库小提琴

Newer versions of MySQL come with ONLY_FULL_GROUP_BYenabled by default, and many of the solutions here will failin testing with this condition.

较新版本的 MySQLONLY_FULL_GROUP_BY默认启用,这里的许多解决方案在这种情况下测试将失败

Even so, we can simply select DISTINCTsomeuniquefield, MAX(whateverotherfieldtoselect), (*somethirdfield ), etc., and have no worries understanding the result or how the query works :

即便如此,我们也可以简单地选择DISTINCTsomeuniquefieldMAX(whateverotherfieldtoselect)(*somethirdfield)等,并且不必担心理解结果或查询如何工作:

SELECT DISTINCT t1.id, MAX(t1.rev), MAX(t2.content)
FROM Table1 AS t1
JOIN Table1 AS t2 ON t2.id = t1.id AND t2.rev = (
    SELECT MAX(rev) FROM Table1 t3 WHERE t3.id = t1.id
)
GROUP BY t1.id;
  • SELECT DISTINCT Table1.id, max(Table1.rev), max(Table2.content): Return DISTINCTsomefield, MAX()some otherfield, the last MAX()is redundant, because I know it's just one row, but it's required by the query.
  • FROM Employee: Table searched on.
  • JOIN Table1 AS Table2 ON Table2.rev = Table1.rev: Join the second table on the first, because, we need to get the max(table1.rev)'s comment.
  • GROUP BY Table1.id: Force the top-sorted, Salary row of each employee to be the returned result.
  • SELECT DISTINCT Table1.id, max(Table1.rev), max(Table2.content): 返回DISTINCTsomefield,MAX()some otherfield,最后一个MAX()是多余的,因为我知道它只是一行,但它是查询所必需的。
  • FROM Employee: 表搜索。
  • JOIN Table1 AS Table2 ON Table2.rev = Table1.rev: 在第一个表上加入第二个表,因为,我们需要得到 max(table1.rev) 的注释。
  • GROUP BY Table1.id: 强制每个员工的排名靠前的Salary 行作为返回结果。

Note that since "content" was "..." in OP's question, there's no way to test that this works. So, I changed that to "..a", "..b", so, we can actually now see that the results are correct:

请注意,由于在 OP 的问题中“内容”是“...”,因此无法测试它是否有效。因此,我将其更改为“..a”、“..b”,因此,我们现在实际上可以看到结果是正确的:

id  max(Table1.rev) max(Table2.content)
1   3   ..d
2   1   ..b

Why is it clean?DISTINCT(), MAX(), etc., all make wonderful use of MySQL indices. This will be faster. Or, it will be much faster, if you have indexing, and you compare it to a query that looks at all rows.

为什么是干净的?DISTINCT()MAX()等,都很好地利用了 MySQL 索引。这会更快。或者,它会更快,如果您有索引,并将其与查看所有行的查询进行比较。

Original Solution

原始解决方案

With ONLY_FULL_GROUP_BYdisabled, we can use still use GROUP BY, but then we are only using it on the Salary, and not the id:

随着ONLY_FULL_GROUP_BY残疾人,我们可以使用仍然可以使用GROUP BY,但我们只使用它的工资,而不是ID:

SELECT *
FROM
    (SELECT *
    FROM Employee
    ORDER BY Salary DESC)
AS employeesub
GROUP BY employeesub.Salary;
  • SELECT *: Return all fields.
  • FROM Employee: Table searched on.
  • (SELECT *...)subquery : Return all people, sorted by Salary.
  • GROUP BY employeesub.Salary: Force the top-sorted, Salary row of each employee to be the returned result.
  • SELECT *: 返回所有字段。
  • FROM Employee: 表搜索。
  • (SELECT *...)子查询:返回所有人员,按薪水排序。
  • GROUP BY employeesub.Salary: 强制每个员工的排名靠前的Salary 行作为返回结果。

Unique-Row Solution

唯一行解决方案

Note the Definition of a Relational Database: "Each row in a table has its own unique key." This would mean that, in the question's example, id would have to beunique, and in that case, we can just do :

请注意关系数据库定义:“表中的每一行都有自己的唯一键。” 这意味着,在问题的示例中, id必须是唯一的,在这种情况下,我们可以这样做:

SELECT *
FROM Employee
WHERE Employee.id = 12345
ORDER BY Employee.Salary DESC
LIMIT 1

Hopefully this is a solution that solves the problem and helps everyone better understand what's happening in the DB.

希望这是一个解决问题的解决方案,并帮助每个人更好地了解数据库中发生的事情。

回答by Marc B

Something like this?

像这样的东西?

SELECT yourtable.id, rev, content
FROM yourtable
INNER JOIN (
    SELECT id, max(rev) as maxrev FROM yourtable
    WHERE yourtable
    GROUP BY id
) AS child ON (yourtable.id = child.id) AND (yourtable.rev = maxrev)

回答by Bulat

I like to use a NOT EXIST-based solution for this problem:

我喜欢使用NOT EXIST基于 -based 的解决方案来解决这个问题:

SELECT 
  id, 
  rev
  -- you can select other columns here
FROM YourTable t
WHERE NOT EXISTS (
   SELECT * FROM YourTable t WHERE t.id = id AND rev > t.rev
)

This will select all records with max value within the group and allows you to select other columns.

这将选择组内具有最大值的所有记录,并允许您选择其他列。

回答by schlebe

Another manner to do the job is using MAX()analytic function in OVER PARTITION clause

另一种完成工作的方式是MAX()在 OVER PARTITION 子句中使用分析函数

SELECT t.*
  FROM
    (
    SELECT id
          ,rev
          ,contents
          ,MAX(rev) OVER (PARTITION BY id) as max_rev
      FROM YourTable
    ) t
  WHERE t.rev = t.max_rev 

The other ROW_NUMBER()OVER PARTITION solution already documented in this post is

ROW_NUMBER()这篇文章中已经记录的另一个OVER PARTITION 解决方案是

SELECT t.*
  FROM
    (
    SELECT id
          ,rev
          ,contents
          ,ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) rank
      FROM YourTable
    ) t
  WHERE t.rank = 1 

This 2 SELECT work well on Oracle 10g.

这个 2 SELECT 在 Oracle 10g 上运行良好。

MAX() solution runs certainly FASTER that ROW_NUMBER()solution because MAX()complexity is O(n)while ROW_NUMBER()complexity is at minimum O(n.log(n))where nrepresent the number of records in table !

MAX()解决方案肯定更快运行该ROW_NUMBER()解决方案,因为MAX()复杂O(n),而ROW_NUMBER()复杂性是最小的O(n.log(n)),其中n在表代表的记录数!

回答by guru008

SELECT *
FROM Employee
where Employee.Salary in (select max(salary) from Employee group by Employe_id)
ORDER BY Employee.Salary