MySQL SQL 仅选择列上具有最大值的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7745609/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL select only rows with max value on a column
提问by Majid Fouladpour
I have this table for documents (simplified version here):
我有这个文件表(这里是简化版):
+------+-------+--------------------------------------+
| id | rev | content |
+------+-------+--------------------------------------+
| 1 | 1 | ... |
| 2 | 1 | ... |
| 1 | 2 | ... |
| 1 | 3 | ... |
+------+-------+--------------------------------------+
How do I select one row per id and only the greatest rev?
With the above data, the result should contain two rows: [1, 3, ...]
and [2, 1, ..]
. I'm using MySQL.
如何为每个 ID 选择一行并且只选择最大的转速?
有了上面的数据,结果应该包含两行:[1, 3, ...]
和[2, 1, ..]
。我正在使用MySQL。
Currently I use checks in the while
loop to detect and over-write old revs from the resultset. But is this the only method to achieve the result? Isn't there a SQLsolution?
目前我在while
循环中使用检查来检测和覆盖结果集中的旧转速。但这是达到结果的唯一方法吗?没有SQL解决方案吗?
Update
As the answers suggest, there isa SQL solution, and here a sqlfiddle demo.
更新
作为答案提示,有是一个SQL的解决方案,并且这里sqlfiddle演示。
Update 2
I noticed after adding the above sqlfiddle, the rate at which the question is upvoted has surpassed the upvote rate of the answers. That has not been the intention! The fiddle is based on the answers, especially the accepted answer.
更新 2
在添加上述sqlfiddle 后,我注意到问题的投票率已经超过了答案的投票率。这不是本意!小提琴基于答案,尤其是已接受的答案。
回答by Adriano Carneiro
At first glance...
乍一看...
All you need is a GROUP BY
clause with the MAX
aggregate function:
您只需要一个GROUP BY
带有MAX
聚合函数的子句:
SELECT id, MAX(rev)
FROM YourTable
GROUP BY id
It's never that simple, is it?
它从来没有那么简单,是吗?
I just noticed you need the content
column as well.
我刚注意到你也需要这个content
专栏。
This is a very common question in SQL: find the whole data for the row with some max value in a column per some group identifier. I heard that a lot during my career. Actually, it was one the questions I answered in my current job's technical interview.
这是 SQL 中一个非常常见的问题:在每个组标识符的列中找到具有某个最大值的行的整个数据。在我的职业生涯中,我听到了很多。实际上,这是我在当前工作的技术面试中回答的问题之一。
It is, actually, so common that StackOverflow community has created a single tag just to deal with questions like that: greatest-n-per-group.
实际上,StackOverflow 社区创建了一个标签来处理这样的问题,这太普遍了: best -n-per-group。
Basically, you have two approaches to solve that problem:
基本上,您有两种方法可以解决该问题:
Joining with simple group-identifier, max-value-in-group
Sub-query
加入简单的group-identifier, max-value-in-group
子查询
In this approach, you first find the group-identifier, max-value-in-group
(already solved above) in a sub-query. Then you join your table to the sub-query with equality on both group-identifier
and max-value-in-group
:
在这种方法中,您首先group-identifier, max-value-in-group
在子查询中找到(已在上面解决)。然后你将你的表加入到子查询中,group-identifier
并且在和上都相等max-value-in-group
:
SELECT a.id, a.rev, a.contents
FROM YourTable a
INNER JOIN (
SELECT id, MAX(rev) rev
FROM YourTable
GROUP BY id
) b ON a.id = b.id AND a.rev = b.rev
Left Joining with self, tweaking join conditions and filters
与自身左连接,调整连接条件和过滤器
In this approach, you left join the table with itself. Equality goes in the group-identifier
. Then, 2 smart moves:
在这种方法中,您将表与自身连接起来。平等在group-identifier
. 然后,2个聪明的动作:
- The second join condition is having left side value less than right value
- When you do step 1, the row(s) that actually have the max value will have
NULL
in the right side (it's aLEFT JOIN
, remember?). Then, we filter the joined result, showing only the rows where the right side isNULL
.
- 第二个连接条件是左侧值小于右侧值
- 当您执行第 1 步时,实际具有最大值的行将
NULL
位于右侧(它是LEFT JOIN
,还记得吗?)。然后,我们过滤连接的结果,仅显示右侧为 的行NULL
。
So you end up with:
所以你最终得到:
SELECT a.*
FROM YourTable a
LEFT OUTER JOIN YourTable b
ON a.id = b.id AND a.rev < b.rev
WHERE b.id IS NULL;
Conclusion
结论
Both approaches bring the exact same result.
两种方法都带来完全相同的结果。
If you have two rows with max-value-in-group
for group-identifier
, both rows will be in the result in both approaches.
如果您有两行max-value-in-group
for group-identifier
,则这两行都将在两种方法的结果中。
Both approaches are SQL ANSI compatible, thus, will work with your favorite RDBMS, regardless of its "flavor".
这两种方法都与 SQL ANSI 兼容,因此,无论其“风格”如何,都可以与您最喜欢的 RDBMS 一起使用。
Both approaches are also performance friendly, however your mileage may vary (RDBMS, DB Structure, Indexes, etc.). So when you pick one approach over the other, benchmark. And make sure you pick the one which make most of sense to you.
这两种方法也是性能友好的,但是您的里程可能会有所不同(RDBMS、DB 结构、索引等)。因此,当您选择一种方法而不是另一种方法时,benchmark。并确保您选择对您最有意义的那个。
回答by Kevin Burton
My preference is to use as little code as possible...
我的偏好是使用尽可能少的代码......
You can do it using IN
try this:
你可以使用IN
试试这个:
SELECT *
FROM t1 WHERE (id,rev) IN
( SELECT id, MAX(rev)
FROM t1
GROUP BY id
)
to my mind it is less complicated... easier to read and maintain.
在我看来,它不那么复杂……更容易阅读和维护。
回答by topchef
I am flabbergasted that no answer offered SQL window function solution:
我很惊讶没有答案提供 SQL 窗口函数解决方案:
SELECT a.id, a.rev, a.contents
FROM (SELECT id, rev, contents,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) rank
FROM YourTable) a
WHERE a.rank = 1
Added in SQL standard ANSI/ISO Standard SQL:2003 and later extended with ANSI/ISO Standard SQL:2008, window (or windowing) functions are available with all major vendors now. There are more types of rank functions available to deal with a tie issue: RANK, DENSE_RANK, PERSENT_RANK
.
在 SQL 标准 ANSI/ISO 标准 SQL:2003 中添加,后来用 ANSI/ISO 标准 SQL:2008 扩展,现在所有主要供应商都可以使用窗口(或窗口)功能。有更多类型的秩函数可用于处理平局问题:RANK, DENSE_RANK, PERSENT_RANK
.
回答by Vajk Hermecz
Yet another solution is to use a correlated subquery:
另一种解决方案是使用相关子查询:
select yt.id, yt.rev, yt.contents
from YourTable yt
where rev =
(select max(rev) from YourTable st where yt.id=st.id)
Having an index on (id,rev) renders the subquery almost as a simple lookup...
在 (id,rev) 上有一个索引几乎就像一个简单的查找一样呈现子查询......
Following are comparisons to the solutions in @AdrianCarneiro's answer (subquery, leftjoin), based on MySQL measurements with InnoDB table of ~1million records, group size being: 1-3.
以下是与@AdrianCarneiro 的答案(子查询、leftjoin)中的解决方案的比较,基于 MySQL 测量,InnoDB 表包含约 100 万条记录,组大小为:1-3。
While for full table scans subquery/leftjoin/correlated timings relate to each other as 6/8/9, when it comes to direct lookups or batch (id in (1,2,3)
), subquery is much slower then the others (Due to rerunning the subquery). However I couldnt differentiate between leftjoin and correlated solutions in speed.
虽然对于全表扫描,子查询/左连接/相关时间彼此相关为 6/8/9,但在直接查找或批处理 ( id in (1,2,3)
) 方面,子查询比其他查询慢得多(由于重新运行子查询)。但是我无法在速度上区分 leftjoin 和相关解决方案。
One final note, as leftjoin creates n*(n+1)/2 joins in groups, its performance can be heavily affected by the size of groups...
最后要注意的是,由于 leftjoin 在组中创建 n*(n+1)/2 个连接,因此其性能可能会受到组大小的严重影响......
回答by David Foster
I can't vouch for the performance, but here's a trick inspired by the limitations of Microsoft Excel. It has some good features
我不能保证性能,但这里有一个受 Microsoft Excel 限制启发的技巧。它有一些很好的功能
GOOD STUFF
好东西
- It should force return of only one "max record" even if there is a tie (sometimes useful)
- It doesn't require a join
- 即使有平局,它也应该只强制返回一个“最大记录”(有时很有用)
- 它不需要加入
APPROACH
方法
It is a little bit ugly and requires that you know something about the range of valid values of the revcolumn. Let us assume that we know the revcolumn is a number between 0.00 and 999 including decimals but that there will only ever be two digits to the right of the decimal point (e.g. 34.17 would be a valid value).
它有点难看,需要您了解rev列的有效值范围。让我们假设我们知道rev列是一个介于 0.00 和 999 之间的数字,包括小数,但小数点右侧永远只有两位数字(例如 34.17 将是一个有效值)。
The gist of the thing is that you create a single synthetic column by string concatenating/packing the primary comparison field along with the data you want. In this way, you can force SQL's MAX() aggregate function to return all of the data (because it has been packed into a single column). Then you have to unpack the data.
事情的要点是您通过字符串连接/打包主要比较字段以及您想要的数据来创建单个合成列。通过这种方式,您可以强制 SQL 的 MAX() 聚合函数返回所有数据(因为它已被打包到单个列中)。然后你必须解压数据。
Here's how it looks with the above example, written in SQL
下面是上面例子的样子,用 SQL 编写
SELECT id,
CAST(SUBSTRING(max(packed_col) FROM 2 FOR 6) AS float) as max_rev,
SUBSTRING(max(packed_col) FROM 11) AS content_for_max_rev
FROM (SELECT id,
CAST(1000 + rev + .001 as CHAR) || '---' || CAST(content AS char) AS packed_col
FROM yourtable
)
GROUP BY id
The packing begins by forcing the revcolumn to be a number of known character length regardless of the value of revso that for example
打包开始时强制rev列是一些已知的字符长度,而不管rev的值如何,例如
- 3.2 becomes 1003.201
- 57 becomes 1057.001
- 923.88 becomes 1923.881
- 3.2 变成 1003.201
- 57 变成 1057.001
- 923.88 变成 1923.881
If you do it right, string comparison of two numbers should yield the same "max" as numeric comparison of the two numbers and it's easy to convert back to the original number using the substring function (which is available in one form or another pretty much everywhere).
如果你做对了,两个数字的字符串比较应该产生与两个数字的数字比较相同的“最大值”,并且使用 substring 函数很容易转换回原始数字(它以一种或另一种形式提供)到处)。
回答by HoldOffHunger
Cleanest Solution
最干净的解决方案
Newer versions of MySQL come with ONLY_FULL_GROUP_BY
enabled by default, and many of the solutions here will failin testing with this condition.
较新版本的 MySQLONLY_FULL_GROUP_BY
默认启用,这里的许多解决方案在这种情况下测试将失败。
Even so, we can simply select DISTINCT
someuniquefield, MAX(
whateverotherfieldtoselect)
, (
*somethirdfield )
, etc., and have no worries understanding the result or how the query works :
即便如此,我们也可以简单地选择DISTINCT
someuniquefield、MAX(
whateverotherfieldtoselect)
、(
*somethirdfield)
等,并且不必担心理解结果或查询如何工作:
SELECT DISTINCT t1.id, MAX(t1.rev), MAX(t2.content)
FROM Table1 AS t1
JOIN Table1 AS t2 ON t2.id = t1.id AND t2.rev = (
SELECT MAX(rev) FROM Table1 t3 WHERE t3.id = t1.id
)
GROUP BY t1.id;
SELECT DISTINCT Table1.id, max(Table1.rev), max(Table2.content)
: ReturnDISTINCT
somefield,MAX()
some otherfield, the lastMAX()
is redundant, because I know it's just one row, but it's required by the query.FROM Employee
: Table searched on.JOIN Table1 AS Table2 ON Table2.rev = Table1.rev
: Join the second table on the first, because, we need to get the max(table1.rev)'s comment.GROUP BY Table1.id
: Force the top-sorted, Salary row of each employee to be the returned result.
SELECT DISTINCT Table1.id, max(Table1.rev), max(Table2.content)
: 返回DISTINCT
somefield,MAX()
some otherfield,最后一个MAX()
是多余的,因为我知道它只是一行,但它是查询所必需的。FROM Employee
: 表搜索。JOIN Table1 AS Table2 ON Table2.rev = Table1.rev
: 在第一个表上加入第二个表,因为,我们需要得到 max(table1.rev) 的注释。GROUP BY Table1.id
: 强制每个员工的排名靠前的Salary 行作为返回结果。
Note that since "content" was "..." in OP's question, there's no way to test that this works. So, I changed that to "..a", "..b", so, we can actually now see that the results are correct:
请注意,由于在 OP 的问题中“内容”是“...”,因此无法测试它是否有效。因此,我将其更改为“..a”、“..b”,因此,我们现在实际上可以看到结果是正确的:
id max(Table1.rev) max(Table2.content)
1 3 ..d
2 1 ..b
Why is it clean?DISTINCT()
, MAX()
, etc., all make wonderful use of MySQL indices. This will be faster. Or, it will be much faster, if you have indexing, and you compare it to a query that looks at all rows.
为什么是干净的?DISTINCT()
、MAX()
等,都很好地利用了 MySQL 索引。这会更快。或者,它会更快,如果您有索引,并将其与查看所有行的查询进行比较。
Original Solution
原始解决方案
With ONLY_FULL_GROUP_BY
disabled, we can use still use GROUP BY
, but then we are only using it on the Salary, and not the id:
随着ONLY_FULL_GROUP_BY
残疾人,我们可以使用仍然可以使用GROUP BY
,但我们只使用它的工资,而不是ID:
SELECT *
FROM
(SELECT *
FROM Employee
ORDER BY Salary DESC)
AS employeesub
GROUP BY employeesub.Salary;
SELECT *
: Return all fields.FROM Employee
: Table searched on.(SELECT *...)
subquery : Return all people, sorted by Salary.GROUP BY employeesub.Salary
: Force the top-sorted, Salary row of each employee to be the returned result.
SELECT *
: 返回所有字段。FROM Employee
: 表搜索。(SELECT *...)
子查询:返回所有人员,按薪水排序。GROUP BY employeesub.Salary
: 强制每个员工的排名靠前的Salary 行作为返回结果。
Unique-Row Solution
唯一行解决方案
Note the Definition of a Relational Database: "Each row in a table has its own unique key." This would mean that, in the question's example, id would have to beunique, and in that case, we can just do :
请注意关系数据库的定义:“表中的每一行都有自己的唯一键。” 这意味着,在问题的示例中, id必须是唯一的,在这种情况下,我们可以这样做:
SELECT *
FROM Employee
WHERE Employee.id = 12345
ORDER BY Employee.Salary DESC
LIMIT 1
Hopefully this is a solution that solves the problem and helps everyone better understand what's happening in the DB.
希望这是一个解决问题的解决方案,并帮助每个人更好地了解数据库中发生的事情。
回答by Marc B
Something like this?
像这样的东西?
SELECT yourtable.id, rev, content
FROM yourtable
INNER JOIN (
SELECT id, max(rev) as maxrev FROM yourtable
WHERE yourtable
GROUP BY id
) AS child ON (yourtable.id = child.id) AND (yourtable.rev = maxrev)
回答by Bulat
I like to use a NOT EXIST
-based solution for this problem:
我喜欢使用NOT EXIST
基于 -based 的解决方案来解决这个问题:
SELECT
id,
rev
-- you can select other columns here
FROM YourTable t
WHERE NOT EXISTS (
SELECT * FROM YourTable t WHERE t.id = id AND rev > t.rev
)
This will select all records with max value within the group and allows you to select other columns.
这将选择组内具有最大值的所有记录,并允许您选择其他列。
回答by schlebe
Another manner to do the job is using MAX()
analytic function in OVER PARTITION clause
另一种完成工作的方式是MAX()
在 OVER PARTITION 子句中使用分析函数
SELECT t.*
FROM
(
SELECT id
,rev
,contents
,MAX(rev) OVER (PARTITION BY id) as max_rev
FROM YourTable
) t
WHERE t.rev = t.max_rev
The other ROW_NUMBER()
OVER PARTITION solution already documented in this post is
ROW_NUMBER()
这篇文章中已经记录的另一个OVER PARTITION 解决方案是
SELECT t.*
FROM
(
SELECT id
,rev
,contents
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) rank
FROM YourTable
) t
WHERE t.rank = 1
This 2 SELECT work well on Oracle 10g.
这个 2 SELECT 在 Oracle 10g 上运行良好。
MAX() solution runs certainly FASTER that ROW_NUMBER()
solution because MAX()
complexity is O(n)
while ROW_NUMBER()
complexity is at minimum O(n.log(n))
where n
represent the number of records in table !
MAX()解决方案肯定更快运行该ROW_NUMBER()
解决方案,因为MAX()
复杂O(n)
,而ROW_NUMBER()
复杂性是最小的O(n.log(n))
,其中n
在表代表的记录数!
回答by guru008
SELECT *
FROM Employee
where Employee.Salary in (select max(salary) from Employee group by Employe_id)
ORDER BY Employee.Salary