SQL 合并重叠的日期间隔

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2561130/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:53:15  来源:igfitidea点击:

Merge overlapping date intervals

sqlsql-servertsql

提问by leoinfo

Is there a better way of merging overlapping date intervals?
The solution I came up with is so simple that now I wonder if someone else has a better idea of how this could be done.

有没有更好的方法来合并重叠的日期间隔?
我想出的解决方案非常简单,现在我想知道其他人是否对如何做到这一点有更好的了解。

/***** DATA EXAMPLE *****/
DECLARE @T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO @T (d1, d2)
        SELECT '2010-01-01','2010-03-31' UNION SELECT '2010-04-01','2010-05-31' 
  UNION SELECT '2010-06-15','2010-06-25' UNION SELECT '2010-06-26','2010-07-10' 
  UNION SELECT '2010-08-01','2010-08-05' UNION SELECT '2010-08-01','2010-08-09' 
  UNION SELECT '2010-08-02','2010-08-07' UNION SELECT '2010-08-08','2010-08-08' 
  UNION SELECT '2010-08-09','2010-08-12' UNION SELECT '2010-07-04','2010-08-16' 
  UNION SELECT '2010-11-01','2010-12-31' UNION SELECT '2010-03-01','2010-06-13' 

/***** INTERVAL ANALYSIS *****/
WHILE (1=1)  BEGIN
  UPDATE t1 SET t1.d2 = t2.d2
  FROM @T AS t1 INNER JOIN @T AS t2 ON 
            DATEADD(day, 1, t1.d2) BETWEEN t2.d1 AND t2.d2 
  IF @@ROWCOUNT = 0 BREAK
END

/***** RESULT *****/
SELECT StartDate = MIN(d1) , EndDate = d2
FROM @T
GROUP BY d2
ORDER BY StartDate, EndDate

/***** OUTPUT *****/
/*****
StartDate   EndDate
2010-01-01  2010-06-13 
2010-06-15  2010-08-16 
2010-11-01  2010-12-31 
*****/

回答by user1045402

I was looking for the same solution and came across this post on Combine overlapping datetime to return single overlapping range record.

我正在寻找相同的解决方案,并在结合重叠日期时间以返回单个重叠范围记录上遇到了这篇文章。

There is another thread on Packing Date Intervals.

包装日期间隔还有另一个主题。

I tested this with various date ranges, including the ones listed here, and it works correctly every time.

我使用各种日期范围(包括此处列出的日期范围)对此进行了测试,并且每次都能正常工作。



SELECT 
       s1.StartDate,
       --t1.EndDate 
       MIN(t1.EndDate) AS EndDate
FROM @T s1 
INNER JOIN @T t1 ON s1.StartDate <= t1.EndDate
  AND NOT EXISTS(SELECT * FROM @T t2 
                 WHERE t1.EndDate >= t2.StartDate AND t1.EndDate < t2.EndDate) 
WHERE NOT EXISTS(SELECT * FROM @T s2 
                 WHERE s1.StartDate > s2.StartDate AND s1.StartDate <= s2.EndDate) 
GROUP BY s1.StartDate 
ORDER BY s1.StartDate 


The result is:

结果是:

StartDate  | EndDate
2010-01-01 | 2010-06-13
2010-06-15 | 2010-06-25
2010-06-26 | 2010-08-16
2010-11-01 | 2010-12-31

回答by Martin Smith

You asked this back in 2010 but don't specify any particular version.

您在 2010 年问过这个问题,但没有指定任何特定版本。

An answer for people on SQL Server 2012+

对 SQL Server 2012+ 人员的回答

WITH T1
     AS (SELECT *,
                MAX(d2) OVER (ORDER BY d1) AS max_d2_so_far
         FROM   @T),
     T2
     AS (SELECT *,
                CASE
                  WHEN d1 <= DATEADD(DAY, 1, LAG(max_d2_so_far) OVER (ORDER BY d1))
                    THEN 0
                  ELSE 1
                END AS range_start
         FROM   T1),
     T3
     AS (SELECT *,
                SUM(range_start) OVER (ORDER BY d1) AS range_group
         FROM   T2)
SELECT range_group,
       MIN(d1) AS d1,
       MAX(d2) AS d2
FROM   T3
GROUP  BY range_group 

Which returns

哪个返回

+-------------+------------+------------+
| range_group |     d1     |     d2     |
+-------------+------------+------------+
|           1 | 2010-01-01 | 2010-06-13 |
|           2 | 2010-06-15 | 2010-08-16 |
|           3 | 2010-11-01 | 2010-12-31 |
+-------------+------------+------------+

DATEADD(DAY, 1is used because your desired results show you want a period ending on 2010-06-25to be collapsed into one starting 2010-06-26. For other use cases this may need adjusting.

DATEADD(DAY, 1使用是因为您想要的结果表明您希望将结束的期间2010-06-25折叠成一个开始的2010-06-26. 对于其他用例,这可能需要调整。

回答by Oleg K

Here is a solution with just three simple scans. No CTEs, no recursion, no joins, no table updates in a loop, no "group by" —?as a result, this solution should scale the best (I think). I think number of scans can be reduced to two, if min and max dates are known in advance; the logic itself just needs two scans — find gaps, applied twice.

这是一个只需三个简单扫描的解决方案。没有 CTE、没有递归、没有连接、没有循环中的表更新、没有“分组依据”——因此,这个解决方案应该是最好的(我认为)。如果提前知道最小和最大日期,我认为扫描次数可以减少到两次;逻辑本身只需要两次扫描——找到差距,应用两次。

declare @datefrom datetime, @datethru datetime

DECLARE @T TABLE (d1 DATETIME, d2 DATETIME)

INSERT INTO @T (d1, d2)

SELECT '2010-01-01','2010-03-31' 
UNION SELECT '2010-03-01','2010-06-13' 
UNION SELECT '2010-04-01','2010-05-31' 
UNION SELECT '2010-06-15','2010-06-25' 
UNION SELECT '2010-06-26','2010-07-10' 
UNION SELECT '2010-08-01','2010-08-05' 
UNION SELECT '2010-08-01','2010-08-09' 
UNION SELECT '2010-08-02','2010-08-07' 
UNION SELECT '2010-08-08','2010-08-08' 
UNION SELECT '2010-08-09','2010-08-12' 
UNION SELECT '2010-07-04','2010-08-16' 
UNION SELECT '2010-11-01','2010-12-31' 

select @datefrom = min(d1) - 1, @datethru = max(d2) + 1 from @t

SELECT 
StartDate, EndDate
FROM
(
    SELECT 
    MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
    LEAD(StartDate ) OVER (ORDER BY StartDate) - 1 EndDate
    FROM
    (
        SELECT 
        StartDate, EndDate
        FROM
        (
            SELECT 
            MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
            LEAD(StartDate) OVER (ORDER BY StartDate) - 1 EndDate 
            FROM 
            (
                SELECT d1 StartDate, d2 EndDate from @T 
                UNION ALL 
                SELECT @datefrom StartDate, @datefrom EndDate 
                UNION ALL 
                SELECT @datethru StartDate, @datethru EndDate
            ) T
        ) T
        WHERE StartDate <= EndDate
        UNION ALL 
        SELECT @datefrom StartDate, @datefrom EndDate 
        UNION ALL 
        SELECT @datethru StartDate, @datethru EndDate
    ) T
) T
WHERE StartDate <= EndDate

The result is:

结果是:

StartDate   EndDate
2010-01-01  2010-06-13
2010-06-15  2010-08-16
2010-11-01  2010-12-31

回答by Chao Xu

The idea is to simulate the scanning algorithm for merging intervals. My solution makes sure it works across a wide range of SQL implementations. I've tested it on MySQL, Postgres, SQL-Server 2017, SQLite and even Hive.

这个想法是模拟合并间隔的扫描算法。我的解决方案确保它适用于广泛的 SQL 实现。我已经在 MySQL、Postgres、SQL-Server 2017、SQLite 甚至 Hive 上对其进行了测试。

Assuming the table schema is the following.

假设表架构如下。

CREATE TABLE t (
  a DATETIME,
  b DATETIME
);

We also assume the interval is half-open like [a,b).

我们还假设区间是半开的,如 [a,b)。

When (a,i,j)is in the table, it shows that there are jintervals covering a, and there are iintervals covering the previous point.

(a,i,j)在表中时,表示有j个区间覆盖a,有i个区间覆盖上一点。

CREATE VIEW r AS 
SELECT a,
       Sum(d) OVER (ORDER BY a ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS i,
       Sum(d) OVER (ORDER BY a ROWS UNBOUNDED PRECEDING) AS j
FROM  (SELECT a, Sum(d) AS d
       FROM   (SELECT a,  1 AS d FROM t
               UNION ALL
               SELECT b, -1 AS d FROM t) e
       GROUP  BY a) f;

We produce all the endpoints in the union of the intervals and pair up adjacent ones. Finally, we produce the set of intervals by only picking the odd-numbered rows.

我们在区间的联​​合中生成所有端点并将相邻的端点配对。最后,我们只选择奇数行来生成区间集。

SELECT a, b
FROM (SELECT a,
             Lead(a)      OVER (ORDER BY a) AS b,
             Row_number() OVER (ORDER BY a) AS n
      FROM   r
      WHERE  j=0 OR i=0 OR i is null) e
WHERE  n%2 = 1;

I've created a sample DB-fiddleand SQL-fiddle. I also wrote a blog post on union intervals in SQL.

我创建了一个示例 DB-fiddleSQL-fiddle。我还写了一篇关于 SQL 中联合间隔博客文章

回答by pwilcox

A Geometric Approach

几何方法

Here and elsewhere I've noticed that date packing questions don't provide a geometric approach to this problem. After all, any range, date-ranges included, can be interpreted as a line. So why not convert them to a sql geometry type and utilize geometry::UnionAggregateto merge the ranges.

在这里和其他地方,我注意到日期打包问题没有提供解决这个问题的几何方法。毕竟,任何范围,包括日期范围,都可以解释为一行。那么为什么不将它们转换为 sql 几何类型并geometry::UnionAggregate用于合并范围。

Why?

为什么?

This has the advantage of handling all types of overlaps, including fully nested ranges. It also works like any other aggregate query, so it's a little more intuitive in that respect. You also get the bonus of a visual representation of your results if you care to use it. Finally, it is the approach I use for simultaneous range packing(you work with rectangles instead of lines in that case, and there are many more considerations). I just couldn't get the existing approaches to work in that scenario.

这具有处理所有类型重叠的优点,包括完全嵌套的范围。它也像任何其他聚合查询一样工作,因此在这方面更直观一些。如果您愿意使用它,您还可以获得结果的可视化表示。最后,这是我用于同时范围打包的方法(在这种情况下,您使用矩形而不是线条,还有更多注意事项)。我只是无法让现有的方法在这种情况下工作。

This has the disadvantage of requiring more recent versions of SQL Server. It also requires a numbers table and it's annoying to extract the individually produced lines from the aggregated shape. But hopefully in the future Microsoft adds a TVF that allows you to do this easily without a numbers table (or you can just build one yourself). Also, geometrical objects work with floats, so you have conversion annoyances and precision concerns to keep in mind.

这样做的缺点是需要更新版本的 SQL Server。它还需要一个数字表,从聚合形状中提取单独生成的线条很烦人。但希望将来微软会添加一个 TVF,让您无需数字表即可轻松完成此操作(或者您可以自己构建一个)。此外,几何对象与浮点数一起使用,因此您需要牢记转换烦恼和精度问题。

Performance-wise I don't know how it compares, but I've done a few things (not shown here) to make it work for me even with large datasets.

性能方面我不知道它是如何比较的,但我做了一些事情(这里没有显示)使它即使在大数据集的情况下也适合我。

Code Description

代码说明

In 'numbers':

在“数字”中:

  • I build a table representing a sequence
  • Swap it out with your favorite way to make a numbers table.
  • For a union operation, you won't ever need more rows than in your original table, so I just use it as the base to build it.
  • 我建立了一个表示序列的表
  • 用你最喜欢的方式交换它来制作数字表。
  • 对于联合操作,您永远不会需要比原始表中更多的行,所以我只是将它用作构建它的基础。

In 'mergeLines':

在“合并线”中:

  • I convert the dates to floats and use those floats to create geometrical points.
  • In this problem, we're working in 'integer space,' meaning there are no time considerations, and so an begin date in one range that is one day apart from an end date in another should be merged with that other. In order to make that merge happen, we need to convert to 'real space.', so we add 1 to the tail of all ranges (we undo this later).
  • I then connect these points via STUnion and STEnvelope.
  • Finally, I merge all these lines via UnionAggregate. The resulting 'lines' geometry object might contain multiple lines, but if they overlap, they turn into one line.
  • 我将日期转换为浮点数并使用这些浮点数来创建几何点。
  • 在这个问题中,我们在“整数空间”中工作,这意味着没有时间考虑,因此一个范围内的开始日期与另一个范围内的结束日期相隔一天应该与另一个合并。为了实现合并,我们需要转换为“真实空间”,因此我们在所有范围的尾部加 1(我们稍后撤消此操作)。
  • 然后我通过 STUnion 和 STEnvelope 连接这些点。
  • 最后,我通过 UnionAggregate 合并所有这些行。生成的“线”几何对象可能包含多条线,但如果它们重叠,它们就会变成一条线。

In the outer query:

在外部查询中:

  • I use the numbers CTE to extract the individual lines inside 'lines'.
  • I envelope the lines which here ensures that the lines are stored only as its two endpoints.
  • I read the endpoint x values and convert them back to their time representations, ensuring to put them back into 'integer space'.
  • 我使用数字 CTE 来提取“行”内的各个行。
  • 我封装了这里的行,以确保这些行仅作为它的两个端点存储。
  • 我读取端点 x 值并将它们转换回它们的时间表示,确保将它们放回“整数空间”。

The Code

编码

with 

    numbers as (

        select  row_number() over (order by (select null)) i 
        from    @t

    ),

    mergeLines as (

        select      lines = geometry::UnionAggregate(line)
        from        @t
        cross apply (select line = 
                        geometry::Point(convert(float, d1), 0, 0).STUnion(
                            geometry::Point(convert(float, d2) + 1, 0, 0)
                        ).STEnvelope()
                    ) l

    )

    select      ap.StartDate,
                ap.EndDate
    from        mergeLines ml
    join        numbers n on n.i between 1 and ml.lines.STNumGeometries()
    cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
    cross apply (select 
                    StartDate = convert(datetime,l.line.STPointN(1).STX),
                    EndDate = convert(datetime,l.line.STPointN(3).STX) - 1
                ) ap
    order by    ap.StartDate;

回答by Thomas

In this solution, I created a temporary Calendar table which stores a value for every day across a range. This type of table can be made static. In addition, I'm only storing 400 some odd dates starting with 2009-12-31. Obviously, if your dates span a larger range, you would need more values.

在此解决方案中,我创建了一个临时 Calendar 表,该表存储一个范围内每天的值。这种类型的表可以是静态的。此外,我只存储了 400 个从 2009-12-31 开始的奇怪日期。显然,如果您的日期跨越更大的范围,您将需要更多的值。

In addition, this solution will only work with SQL Server 2005+ in that I'm using a CTE.

此外,此解决方案仅适用于 SQL Server 2005+,因为我使用的是 CTE。

With Calendar As
    (
    Select DateAdd(d, ROW_NUMBER() OVER ( ORDER BY s1.object_id ), '1900-01-01') As [Date]
    From sys.columns as s1
        Cross Join sys.columns as s2
    )
    , StopDates As
    (
    Select C.[Date]
    From Calendar As C
        Left Join @T As T
            On C.[Date] Between T.d1 And T.d2
    Where C.[Date] >= ( Select Min(T2.d1) From @T As T2 )
        And C.[Date] <= ( Select Max(T2.d2) From @T As T2 )
        And T.d1 Is Null
    )
    , StopDatesInUse As
    (
    Select D1.[Date]
    From StopDates As D1
        Left Join StopDates As D2
            On D1.[Date] = DateAdd(d,1,D2.Date)
    Where D2.[Date] Is Null
    )
    , DataWithEariestStopDate As 
    (
    Select *
    , (Select Min(SD2.[Date])
        From StopDatesInUse As SD2
        Where T.d2 < SD2.[Date] ) As StopDate
    From @T As T
    )
Select Min(d1), Max(d2)
From DataWithEariestStopDate
Group By StopDate
Order By Min(d1)

EDITThe problem with using dates in 2009 has nothing to do with the final query. The problem is that the Calendar table is not big enough. I started the Calendar table at 2009-12-31. I have revised it start at 1900-01-01.

编辑在 2009 年使用日期的问题与最终查询无关。问题是日历表不够大。我在 2009-12-31 开始了日历表。我从 1900-01-01 开始修改它。

回答by Chris Bednarski

Try this

尝试这个

;WITH T1 AS
(
    SELECT d1, d2, ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
    FROM @T
), NUMS AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
    FROM T1 A
    CROSS JOIN T1 B
    CROSS JOIN T1 C
), ONERANGE AS 
(
    SELECT DISTINCT DATEADD(DAY, ROW_NUMBER() OVER(PARTITION BY T1.R ORDER BY (SELECT 0)) - 1, T1.D1) AS ELEMENT
    FROM T1
    CROSS JOIN NUMS
    WHERE NUMS.R <= DATEDIFF(DAY, d1, d2) + 1
), SEQUENCE AS
(
    SELECT ELEMENT, DATEDIFF(DAY, '19000101', ELEMENT) - ROW_NUMBER() OVER(ORDER BY ELEMENT) AS rownum
    FROM ONERANGE
)
SELECT MIN(ELEMENT) AS StartDate, MAX(ELEMENT) as EndDate
FROM SEQUENCE
GROUP BY rownum

The basic idea is to first unroll the existing data, so you get a separate row for each day. This is done in ONERANGE

基本思想是首先展开现有数据,这样您每天都会得到一个单独的行。这是在ONERANGE中完成的

Then, identify the relationship between how dates increment and the way the row numbers do. The difference remains constant within an existing range/island. As soon as you get to a new data island, the difference between them increases because the date increments by more than 1, while the row number increments by 1.

然后,确定日期递增方式与行号方式之间的关系。差异在现有范围/岛屿内保持不变。一旦到达一个新的数据岛,它们之间的差异就会增加,因为日期增加超过 1,而行号增加 1。