SQL 获取每组的前 1 行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6841605/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get top 1 row of each group
提问by dpp
I have a table which I want to get the latest entry for each group. Here's the table:
我有一张表,我想获取每个组的最新条目。这是表:
DocumentStatusLogs
Table
DocumentStatusLogs
桌子
|ID| DocumentID | Status | DateCreated |
| 2| 1 | S1 | 7/29/2011 |
| 3| 1 | S2 | 7/30/2011 |
| 6| 1 | S1 | 8/02/2011 |
| 1| 2 | S1 | 7/28/2011 |
| 4| 2 | S2 | 7/30/2011 |
| 5| 2 | S3 | 8/01/2011 |
| 6| 3 | S1 | 8/02/2011 |
The table will be grouped by DocumentID
and sorted by DateCreated
in descending order. For each DocumentID
, I want to get the latest status.
该表将按降序分组DocumentID
和排序DateCreated
。对于每个DocumentID
,我想获得最新状态。
My preferred output:
我的首选输出:
| DocumentID | Status | DateCreated |
| 1 | S1 | 8/02/2011 |
| 2 | S3 | 8/01/2011 |
| 3 | S1 | 8/02/2011 |
Is there any aggregate function to get only the top from each group? See pseudo-code
GetOnlyTheTop
below:SELECT DocumentID, GetOnlyTheTop(Status), GetOnlyTheTop(DateCreated) FROM DocumentStatusLogs GROUP BY DocumentID ORDER BY DateCreated DESC
If such function doesn't exist, is there any way I can achieve the output I want?
- Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that
status
also be located in the parent table?
是否有任何聚合函数只能从每个组中获取顶部?见
GetOnlyTheTop
下面的伪代码:SELECT DocumentID, GetOnlyTheTop(Status), GetOnlyTheTop(DateCreated) FROM DocumentStatusLogs GROUP BY DocumentID ORDER BY DateCreated DESC
如果这样的功能不存在,有什么办法可以实现我想要的输出?
- 或者首先,这可能是由未规范化的数据库引起的吗?我在想,既然我要找的只是一行,那它
status
也应该位于父表中吗?
Please see the parent table for more information:
请参阅父表以获取更多信息:
Current Documents
Table
当前Documents
表
| DocumentID | Title | Content | DateCreated |
| 1 | TitleA | ... | ... |
| 2 | TitleB | ... | ... |
| 3 | TitleC | ... | ... |
Should the parent table be like this so that I can easily access its status?
父表应该这样吗,以便我可以轻松访问其状态?
| DocumentID | Title | Content | DateCreated | CurrentStatus |
| 1 | TitleA | ... | ... | s1 |
| 2 | TitleB | ... | ... | s3 |
| 3 | TitleC | ... | ... | s1 |
UPDATEI just learned how to use "apply" which makes it easier to address such problems.
更新我刚刚学会了如何使用“应用”,这使得解决此类问题更容易。
回答by gbn
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
如果您希望每天有 2 个条目,那么这将任意选择一个。要获得一天的两个条目,请改用 DENSE_RANK
As for normalised or not, it depends if you want to:
至于标准化与否,这取决于您是否想要:
- maintain status in 2 places
- preserve status history
- ...
- 在 2 个地方保持状态
- 保存状态历史
- ...
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
就目前而言,您可以保留状态历史记录。如果您也想要父表中的最新状态(这是非规范化),您需要一个触发器来维护父表中的“状态”。或删除此状态历史记录表。
回答by dpp
I just learned how to use cross apply
. Here's how to use it in this scenario:
我刚刚学会了如何使用cross apply
. 以下是在这种情况下如何使用它:
select d.DocumentID, ds.Status, ds.DateCreated
from Documents as d
cross apply
(select top 1 Status, DateCreated
from DocumentStatusLogs
where DocumentID = d.DocumentId
order by DateCreated desc) as ds
回答by John Fairbanks
I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.
我已经对这里的各种建议进行了一些计时,结果确实取决于所涉及表的大小,但最一致的解决方案是使用 CROSS APPLY 这些测试是针对 SQL Server 2008-R2 运行的,使用的表6,500 条记录,另一个(相同模式)有 1.37 亿条记录。被查询的列是表上主键的一部分,表的宽度很小(大约30个字节)。SQL Server 根据实际执行计划报告时间。
Query Time for 6500 (ms) Time for 137M(ms)
CROSS APPLY 17.9 17.9
SELECT WHERE col = (SELECT MAX(COL)…) 6.6 854.4
DENSE_RANK() OVER PARTITION 6.6 907.1
I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.
我认为真正令人惊奇的是无论涉及的行数如何,CROSS APPLY 的时间都是一致的。
回答by Josh Gilfillan
I know this is an old thread but the TOP 1 WITH TIES
solutions is quite nice and might be helpful to some reading through the solutions.
我知道这是一个旧线程,但TOP 1 WITH TIES
解决方案非常好,可能有助于阅读解决方案。
select top 1 with ties
DocumentID
,Status
,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)
More about the TOP clause can be found here.
可以在此处找到有关 TOP 子句的更多信息。
回答by Daniel Cotter
If you're worried about performance, you can also do this with MAX():
如果你担心性能,你也可以用 MAX() 来做到这一点:
SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)
ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.
ROW_NUMBER() 需要对 SELECT 语句中的所有行进行排序,而 MAX 不需要。应该大大加快您的查询。
回答by Ariel
SELECT * FROM
DocumentStatusLogs JOIN (
SELECT DocumentID, MAX(DateCreated) DateCreated
FROM DocumentStatusLogs
GROUP BY DocumentID
) max_date USING (DocumentID, DateCreated)
What database server? This code doesn't work on all of them.
什么数据库服务器?此代码不适用于所有这些。
Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogs
as a log, but still store the latest info in the main table.
关于您问题的后半部分,我认为将状态列为一栏似乎是合理的。您可以DocumentStatusLogs
作为日志保留,但仍将最新信息存储在主表中。
BTW, if you already have the DateCreated
column in the Documents table you can just join DocumentStatusLogs
using that (as long as DateCreated
is unique in DocumentStatusLogs
).
顺便说一句,如果您已经DateCreated
在 Documents 表中拥有该列,则可以使用该列进行连接DocumentStatusLogs
(只要在 中DateCreated
是唯一的DocumentStatusLogs
)。
Edit: MsSQL does not support USING, so change it to:
编辑:MsSQL 不支持 USING,因此将其更改为:
ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated
回答by Randall
This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using first_value
and over
you can make short work of the above query:
这是关于该主题的最容易找到的问题之一,因此我想对此给出一个现代答案(既供我参考,也可以帮助其他人)。通过使用first_value
andover
您可以对上述查询进行简短的工作:
Select distinct DocumentID
, first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
, first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs
This should work in Sql Server 2008 and up. First_value
can be thought of as a way to accomplish Select Top 1
when using an over
clause. Over
allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion. Hope this helps.
这应该适用于 Sql Server 2008 及更高版本。First_value
可以被认为是Select Top 1
使用over
子句时实现的一种方式。Over
允许在选择列表中进行分组,而不是编写嵌套的子查询(就像许多现有答案一样),而是以更具可读性的方式进行。希望这可以帮助。
回答by Clint
This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.
这是一个相当古老的线程,但我认为我会投入两分钱,因为接受的答案对我来说并不是特别好。我在一个大型数据集上尝试了 gbn 的解决方案,发现它非常慢(SQL Server 2012 中超过 500 万条记录超过 45 秒)。查看执行计划,很明显问题在于它需要一个 SORT 操作,这会显着减慢速度。
Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.
这是我从不需要 SORT 操作并执行非聚集索引搜索的实体框架中提取的替代方案。这将上述记录集的执行时间减少到 < 2 秒。
SELECT
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM [dbo].[DocumentStatusLogs] AS [Extent2]
WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
) AS [Project2]
ORDER BY [Project2].[ID] DESC) AS [Limit1]
Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreatedas this will provide an identical sort order and it's a faster sort.
现在我假设在原始问题中没有完全指定一些东西,但是如果你的表设计是你的 ID 列是一个自动递增的 ID,并且 DateCreated 被设置为每次插入的当前日期,那么即使如果不使用上面的查询运行,您实际上可以通过 ID 排序而不是 DateCreated 排序来获得 gbn 解决方案的显着性能提升(大约一半的执行时间),因为这将提供相同的排序顺序,而且排序速度更快。
回答by AnuPrakash
My code to select top 1 from each group
我的代码从每组中选择前 1
select a.* from #DocumentStatusLogs a where datecreated in( select top 1 datecreated from #DocumentStatusLogs b where a.documentid = b.documentid order by datecreated desc )
回答by praveen
This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):
此解决方案可用于获取每个分区的 TOP N 最近行(在示例中,WHERE 语句中的 N 为 1,分区为 doc_id):
SELECT doc_id, status, date_created FROM
(
SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
)
WHERE rnk = 1;