SQL 获取每组的前 1 行

Question

提问by dpp

I have a table which I want to get the latest entry for each group. Here's the table:

我有一张表，我想获取每个组的最新条目。这是表：

DocumentStatusLogsTable

DocumentStatusLogs桌子

|ID| DocumentID | Status | DateCreated |
| 2| 1          | S1     | 7/29/2011   |
| 3| 1          | S2     | 7/30/2011   |
| 6| 1          | S1     | 8/02/2011   |
| 1| 2          | S1     | 7/28/2011   |
| 4| 2          | S2     | 7/30/2011   |
| 5| 2          | S3     | 8/01/2011   |
| 6| 3          | S1     | 8/02/2011   |

The table will be grouped by DocumentIDand sorted by DateCreatedin descending order. For each DocumentID, I want to get the latest status.

该表将按降序分组DocumentID和排序DateCreated。对于每个DocumentID，我想获得最新状态。

My preferred output:

我的首选输出：

| DocumentID | Status | DateCreated |
| 1          | S1     | 8/02/2011   |
| 2          | S3     | 8/01/2011   |
| 3          | S1     | 8/02/2011   |

Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTopbelow:

SELECT
  DocumentID,
  GetOnlyTheTop(Status),
  GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC

If such function doesn't exist, is there any way I can achieve the output I want?
Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that statusalso be located in the parent table?

是否有任何聚合函数只能从每个组中获取顶部？见GetOnlyTheTop下面的伪代码：

SELECT
  DocumentID,
  GetOnlyTheTop(Status),
  GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC

如果这样的功能不存在，有什么办法可以实现我想要的输出？
或者首先，这可能是由未规范化的数据库引起的吗？我在想，既然我要找的只是一行，那它status也应该位于父表中吗？

Please see the parent table for more information:

请参阅父表以获取更多信息：

Current DocumentsTable

当前Documents表

| DocumentID | Title  | Content  | DateCreated |
| 1          | TitleA | ...      | ...         |
| 2          | TitleB | ...      | ...         |
| 3          | TitleC | ...      | ...         |

Should the parent table be like this so that I can easily access its status?

父表应该这样吗，以便我可以轻松访问其状态？

| DocumentID | Title  | Content  | DateCreated | CurrentStatus |
| 1          | TitleA | ...      | ...         | s1            |
| 2          | TitleB | ...      | ...         | s3            |
| 3          | TitleC | ...      | ...         | s1            |

UPDATEI just learned how to use "apply" which makes it easier to address such problems.

更新我刚刚学会了如何使用“应用”，这使得解决此类问题更容易。

Answer 1

回答by gbn

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

如果您希望每天有 2 个条目，那么这将任意选择一个。要获得一天的两个条目，请改用 DENSE_RANK

As for normalised or not, it depends if you want to:

至于标准化与否，这取决于您是否想要：

maintain status in 2 places
preserve status history
...

在 2 个地方保持状态
保存状态历史
...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

就目前而言，您可以保留状态历史记录。如果您也想要父表中的最新状态（这是非规范化），您需要一个触发器来维护父表中的“状态”。或删除此状态历史记录表。

Answer 2

回答by dpp

I just learned how to use cross apply. Here's how to use it in this scenario:

我刚刚学会了如何使用cross apply. 以下是在这种情况下如何使用它：

 select d.DocumentID, ds.Status, ds.DateCreated 
 from Documents as d 
 cross apply 
     (select top 1 Status, DateCreated
      from DocumentStatusLogs 
      where DocumentID = d.DocumentId
      order by DateCreated desc) as ds

Answer 3

回答by John Fairbanks

I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.

我已经对这里的各种建议进行了一些计时，结果确实取决于所涉及表的大小，但最一致的解决方案是使用 CROSS APPLY 这些测试是针对 SQL Server 2008-R2 运行的，使用的表6,500 条记录，另一个（相同模式）有 1.37 亿条记录。被查询的列是表上主键的一部分，表的宽度很小（大约30个字节）。SQL Server 根据实际执行计划报告时间。

Query                                  Time for 6500 (ms)    Time for 137M(ms)

CROSS APPLY                                    17.9                17.9
SELECT WHERE col = (SELECT MAX(COL)…)           6.6               854.4
DENSE_RANK() OVER PARTITION                     6.6               907.1

I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.

我认为真正令人惊奇的是无论涉及的行数如何，CROSS APPLY 的时间都是一致的。

Answer 4

回答by Josh Gilfillan

I know this is an old thread but the TOP 1 WITH TIESsolutions is quite nice and might be helpful to some reading through the solutions.

我知道这是一个旧线程，但TOP 1 WITH TIES解决方案非常好，可能有助于阅读解决方案。

select top 1 with ties
   DocumentID
  ,Status
  ,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)

More about the TOP clause can be found here.

可以在此处找到有关 TOP 子句的更多信息。

Answer 5

回答by Daniel Cotter

If you're worried about performance, you can also do this with MAX():

如果你担心性能，你也可以用 MAX() 来做到这一点：

SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)

ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.

ROW_NUMBER() 需要对 SELECT 语句中的所有行进行排序，而 MAX 不需要。应该大大加快您的查询。

Answer 6

回答by Ariel

SELECT * FROM
DocumentStatusLogs JOIN (
  SELECT DocumentID, MAX(DateCreated) DateCreated
  FROM DocumentStatusLogs
  GROUP BY DocumentID
  ) max_date USING (DocumentID, DateCreated)

What database server? This code doesn't work on all of them.

什么数据库服务器？此代码不适用于所有这些。

Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogsas a log, but still store the latest info in the main table.

关于您问题的后半部分，我认为将状态列为一栏似乎是合理的。您可以DocumentStatusLogs作为日志保留，但仍将最新信息存储在主表中。

BTW, if you already have the DateCreatedcolumn in the Documents table you can just join DocumentStatusLogsusing that (as long as DateCreatedis unique in DocumentStatusLogs).

顺便说一句，如果您已经DateCreated在 Documents 表中拥有该列，则可以使用该列进行连接DocumentStatusLogs（只要在中DateCreated是唯一的DocumentStatusLogs）。

Edit: MsSQL does not support USING, so change it to:

编辑：MsSQL 不支持 USING，因此将其更改为：

ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated

Answer 7

回答by Randall

This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using first_valueand overyou can make short work of the above query:

这是关于该主题的最容易找到的问题之一，因此我想对此给出一个现代答案（既供我参考，也可以帮助其他人）。通过使用first_valueandover您可以对上述查询进行简短的工作：

Select distinct DocumentID
  , first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
  , first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs

This should work in Sql Server 2008 and up. First_valuecan be thought of as a way to accomplish Select Top 1when using an overclause. Overallows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion. Hope this helps.

这应该适用于 Sql Server 2008 及更高版本。First_value可以被认为是Select Top 1使用over子句时实现的一种方式。Over允许在选择列表中进行分组，而不是编写嵌套的子查询（就像许多现有答案一样），而是以更具可读性的方式进行。希望这可以帮助。

Answer 8

回答by Clint

This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.

这是一个相当古老的线程，但我认为我会投入两分钱，因为接受的答案对我来说并不是特别好。我在一个大型数据集上尝试了 gbn 的解决方案，发现它非常慢（SQL Server 2012 中超过 500 万条记录超过 45 秒）。查看执行计划，很明显问题在于它需要一个 SORT 操作，这会显着减慢速度。

Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.

这是我从不需要 SORT 操作并执行非聚集索引搜索的实体框架中提取的替代方案。这将上述记录集的执行时间减少到 < 2 秒。

SELECT 
[Limit1].[DocumentID] AS [DocumentID], 
[Limit1].[Status] AS [Status], 
[Limit1].[DateCreated] AS [DateCreated]
FROM   (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY  (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
    FROM (SELECT 
        [Extent2].[ID] AS [ID], 
        [Extent2].[DocumentID] AS [DocumentID], 
        [Extent2].[Status] AS [Status], 
        [Extent2].[DateCreated] AS [DateCreated]
        FROM [dbo].[DocumentStatusLogs] AS [Extent2]
        WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
    )  AS [Project2]
    ORDER BY [Project2].[ID] DESC) AS [Limit1]

Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreatedas this will provide an identical sort order and it's a faster sort.

现在我假设在原始问题中没有完全指定一些东西，但是如果你的表设计是你的 ID 列是一个自动递增的 ID，并且 DateCreated 被设置为每次插入的当前日期，那么即使如果不使用上面的查询运行，您实际上可以通过 ID 排序而不是 DateCreated 排序来获得 gbn 解决方案的显着性能提升（大约一半的执行时间），因为这将提供相同的排序顺序，而且排序速度更快。

Answer 9

回答by AnuPrakash

My code to select top 1 from each group

我的代码从每组中选择前 1

select a.* from #DocumentStatusLogs a where 
 datecreated in( select top 1 datecreated from #DocumentStatusLogs b
where 
a.documentid = b.documentid
order by datecreated desc
)

Answer 10

回答by praveen

This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):

此解决方案可用于获取每个分区的 TOP N 最近行（在示例中，WHERE 语句中的 N 为 1，分区为 doc_id）：

SELECT doc_id, status, date_created FROM 
(
    SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
)
WHERE rnk = 1;

SQL 获取每组的前 1 行

提问by dpp

回答by gbn

回答by dpp

回答by John Fairbanks

回答by Josh Gilfillan

回答by Daniel Cotter

回答by Ariel

回答by Randall

回答by Clint

回答by AnuPrakash

回答by praveen

相关推荐

最近更新

标签

SQL 获取每组的前 1 行

提问by dpp

回答by gbn

回答by dpp

回答by John Fairbanks

回答by Josh Gilfillan

回答by Daniel Cotter

回答by Ariel

回答by Randall

回答by Clint

回答by AnuPrakash

回答by praveen

相关推荐

SQL Server 单查询内存使用情况

SQL Oracle：如何在架构中找到上次更新（任何表）的时间戳？

如何使用 SQL 中的选择查询生成 csv 文件

如何在 SQL Server 数据库中添加 auto_increment 主键？

相关推荐

最近更新

标签