oracle 按日期范围折叠重复值的 SQL 查询

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/702500/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 18:00:29  来源:igfitidea点击:

SQL Query to Collapse Duplicate Values By Date Range

sqloracle

提问by questioner

I have a table with the following structure: ID, Month, Year, Value with values for one entry per id per month, most months have the same value.

我有一个具有以下结构的表:ID、月、年、值,每个 id 每月有一个条目的值,大多数月份具有相同的值。

I would like to create a view for that table that collapses the same values like this: ID, Start Month, End Month, Start Year, End Year, Value, with one row per ID per value.

我想为该表创建一个视图,折叠相同的值,如下所示:ID、开始月份、结束月份、开始年份、结束年份、值,每个值每个 ID 一行。

The catch is that if a value changes and then goes back to the original, it should have two rows in the table

问题是如果一个值改变然后又回到原来的状态,它应该在表中有两行

So:

所以:

  • 100 1 2008 80
  • 100 2 2008 80
  • 100 3 2008 90
  • 100 4 2008 80
  • 100 1 2008 80
  • 100 2 2008 80
  • 100 3 2008 90
  • 100 4 2008 80

should produce

应该产生

  • 100 1 2008 2 2008 80
  • 100 3 2008 3 2008 90
  • 100 4 2008 4 2008 80
  • 100 1 2008 2 2008 80
  • 100 3 2008 3 2008 90
  • 100 4 2008 4 2008 80

The following query works for everything besides this special case, when the value returns to the original.

当值返回到原始值时,以下查询适用于除此特殊情况之外的所有内容。

select distinct id, min(month) keep (dense_rank first order by month) 
over (partition   by id, value) startMonth, 
max(month) keep (dense_rank first order by month desc) over (partition
by id, value) endMonth, 
value

Database is Oracle

数据库是 Oracle

采纳答案by questioner

I got it to work as follows. It is heavy on analytic functions and is Oracle specific.

我让它按如下方式工作。它对分析功能很重要,并且是特定于 Oracle 的。

select distinct id, value,
decode(startMonth, null,
  lag(startMonth) over(partition by id, value order by startMonth, endMonth),  --if start is null, it's an end so take from the row before
startMonth) startMonth,

  decode(endMonth, null,
  lead(endMonth) over(partition by id, value order by startMonth, endMonth),  --if end is null, it's an start so take from the row after
endMonth) endMonth    

from (
select id, value, startMonth, endMonth from(
select id, value, 
decode(month+1, lead(month) over(partition by id,value order by month), null, month)     
startMonth, --get the beginning month for each interval
decode(month-1, lag(month) over(partition by id,value order by month), null, month)     
endMonth --get the end month for each interval from Tbl
) a 
where startMonth is not null or endMonth is not null --remain with start and ends only
)b

It might be possible to simplify some of the inner queries somewhat

可以稍微简化一些内部查询

The inner query checks if the month is a first/last month of the interval as follows: if the month + 1 == the next month (lag) for that grouping, then since there is a next month, this month is obviously not the end month. Otherwise, it isthe last month of the interval. The same concept is used to check for the first month.

内部查询检查该月份是否为区间的第一个月/最后一个月,如下所示:如果该分组的月份 + 1 == 下个月(滞后),那么由于有下个月,这个月显然不是月底。否则,它间隔的最后一个月。相同的概念用于检查第一个月。

The outer query first filters out all rows that are not either start or end months (where startMonth is not null or endMonth is not null). Then, each row is either a start month or an end month (or both), determined by whether start or end is not null). If the month is a start month, get the corresponding end month by getting the next (lead) endMonth for that id,value ordered by endMonth, and if it is an endMonth get the startMonth by looking for the previous startMonth (lag)

外部查询首先过滤掉不是开始或结束月份 ( where startMonth is not null or endMonth is not null) 的所有行。然后,每一行是开始月份或结束月份(或两者),由开始或结束是否为空来确定)。如果该月份是开始月份,则通过获取该 id 的下一个(前导)endMonth 来获取相应的结束月份,值按 endMonth 排序,如果是 endMonth,则通过查找上一个 startMonth(滞后)来获取 startMonth

回答by tpdi

I'm going to develop my solution incrementally, decomposing each transformation into a view. This both helps explain what's being done, and helps in debugging and testing. It's essentially applying the principle of functional decomposition to database queries.

我将逐步开发我的解决方案,将每个转换分解为一个视图。这既有助于解释正在做什么,也有助于调试和测试。它本质上是将功能分解的原理应用于数据库查询。

I'm also going to do it without using Oracle extensions, with SQL that ought to run on any modern RBDMS. So no keep, over, partition, just subqueries and group bys. (Inform me in the comments if it doesn't work on your RDBMS.)

我还将在不使用 Oracle 扩展的情况下使用应该在任何现代 RBDMS 上运行的 SQL。所以没有保留,结束,分区,只是子查询和分组。(如果它不适用于您的 RDBMS,请在评论中通知我。)

First, the table, which since I'm uncreative, I'll call month_value. Since the id is not actually a unique id, I'll call it "eid". The other columns are "m"onth, "y"ear, and "v"alue:

首先是表,因为我没有创造力,所以我将其称为月值。由于 id 实际上不是唯一的 id,我将其称为“eid”。其他列是“m”onth、“y”ear 和“v”alue:

create table month_value( 
   eid int not null, m int, y int,  v int );

After inserting the data, for two eids, I have:

插入数据后,对于两个 eid,我有:

> select * from month_value;
+-----+------+------+------+
| eid | m    | y    | v    |
+-----+------+------+------+
| 100 |    1 | 2008 |   80 |
| 100 |    2 | 2008 |   80 |
| 100 |    3 | 2008 |   90 |
| 100 |    4 | 2008 |   80 |
| 200 |    1 | 2008 |   80 |
| 200 |    2 | 2008 |   80 |
| 200 |    3 | 2008 |   90 |
| 200 |    4 | 2008 |   80 |
+-----+------+------+------+
8 rows in set (0.00 sec)

Next, we have one entity, the month, that's represented as two variables. That should really be one column (either a date or a datetime, or maybe even a foreign key to a table of dates), so we'll make it one column. We'll do that as a linear transform, such that it sorts the same as (y, m), and such that for any (y,m) tuple there is one and only value, and all values are consecutive:

接下来,我们有一个实体,即月份,它表示为两个变量。那真的应该是一列(日期或日期时间,或者甚至是日期表的外键),因此我们将其设为一列。我们将其作为一个线性变换来进行,这样它的排序与 (y, m) 相同,并且对于任何 (y, m) 元组都有一个且唯一的值,并且所有值都是连续的:

> create view cm_abs_month as 
select *, y * 12 + m as am from month_value;

That gives us:

这给了我们:

> select * from cm_abs_month;
+-----+------+------+------+-------+
| eid | m    | y    | v    | am    |
+-----+------+------+------+-------+
| 100 |    1 | 2008 |   80 | 24097 |
| 100 |    2 | 2008 |   80 | 24098 |
| 100 |    3 | 2008 |   90 | 24099 |
| 100 |    4 | 2008 |   80 | 24100 |
| 200 |    1 | 2008 |   80 | 24097 |
| 200 |    2 | 2008 |   80 | 24098 |
| 200 |    3 | 2008 |   90 | 24099 |
| 200 |    4 | 2008 |   80 | 24100 |
+-----+------+------+------+-------+
8 rows in set (0.00 sec)

Now we'll use a self-join in a correlated subquery to find, for each row, the earliest successor month in which the value changes. We'll base this view on the previous view we created:

现在,我们将在相关子查询中使用自联接来为每一行查找值更改的最早后继月份。我们将此视图基于我们创建的先前视图:

> create view cm_last_am as 
   select a.*, 
    ( select min(b.am) from cm_abs_month b 
      where b.eid = a.eid and b.am > a.am and b.v <> a.v) 
   as last_am 
   from cm_abs_month a;

> select * from cm_last_am;
+-----+------+------+------+-------+---------+
| eid | m    | y    | v    | am    | last_am |
+-----+------+------+------+-------+---------+
| 100 |    1 | 2008 |   80 | 24097 |   24099 |
| 100 |    2 | 2008 |   80 | 24098 |   24099 |
| 100 |    3 | 2008 |   90 | 24099 |   24100 |
| 100 |    4 | 2008 |   80 | 24100 |    NULL |
| 200 |    1 | 2008 |   80 | 24097 |   24099 |
| 200 |    2 | 2008 |   80 | 24098 |   24099 |
| 200 |    3 | 2008 |   90 | 24099 |   24100 |
| 200 |    4 | 2008 |   80 | 24100 |    NULL |
+-----+------+------+------+-------+---------+
8 rows in set (0.01 sec)

last_am is now the "absolute month" of the first (earliest) month (after the month of the current row) in which the value, v, changes. It's null where there is no later month, for that eid, in the table.

last_am 现在是值 v 发生变化的第一个(最早)月份(在当前行的月份之后)的“绝对月份”。如果表中没有该 eid 的下个月,则为 null。

Since last_am is the same for all months leading up to the change in v (which occurs at last_am), we can group on last_am and v (and eid, of course), and in any group, the min(am) is the absolute month of the firstconsecutive month that had that value:

由于 last_am 在导致 v 变化(发生在 last_am)之前的所有月份都是相同的,我们可以对 last_am 和 v(当然还有 eid)进行分组,并且在任何组中,min(am) 是绝对值该月第一次连续一个月即有这样的价值:

> create view cm_result_data as 
  select eid, min(am) as am , last_am, v 
  from cm_last_am group by eid, last_am, v;

> select * from cm_result_data;
+-----+-------+---------+------+
| eid | am    | last_am | v    |
+-----+-------+---------+------+
| 100 | 24100 |    NULL |   80 |
| 100 | 24097 |   24099 |   80 |
| 100 | 24099 |   24100 |   90 |
| 200 | 24100 |    NULL |   80 |
| 200 | 24097 |   24099 |   80 |
| 200 | 24099 |   24100 |   90 |
+-----+-------+---------+------+
6 rows in set (0.00 sec)

Now this is the result set we want, which is why this view is called cm_result_data. All that's lacking is something to transform absolute months back to (y,m) tuples.

现在这就是我们想要的结果集,这就是为什么这个视图被称为 cm_result_data。所缺少的只是将绝对月份转换回 (y,m) 元组。

To do that, we'll just join to the table month_value.

为此,我们只需加入表month_value。

There are only two problems: 1) we want the month beforelast_am in our output, and 2) we have nulls where there is no next month in our data; to met the OP's specification, those should be single month ranges.

只有两个问题:1)我们希望输出中 last_am之前的月份,以及 2)我们的数据中有空值,其中没有下个月;为了满足 OP 的规范,这些应该是单月范围。

EDIT: These could actually be longer ranges than one month, but in every case they mean we need to find the latest month for the eid, which is:

编辑:这些实际上可能比一个月更长的范围,但在每种情况下,它们都意味着我们需要找到 eid 的最新月份,即:

(select max(am) from cm_abs_month d where d.eid = a.eid )

Because the views decompose the problem, we could add in this "end cap" month earlier, by adding another view, but I'll just insert this into the coalesce. Which would be most efficient depends on how your RDBMS optimizes queries.

因为视图分解了问题,我们可以在一个月前通过添加另一个视图来添加这个“结束帽”,但我只是将它插入到合并中。哪种最有效取决于您的 RDBMS 如何优化查询。

To get month before, we'll join (cm_result_data.last_am - 1 = cm_abs_month.am)

要获得前一个月,我们将加入 (cm_result_data.last_am - 1 = cm_abs_month.am)

Wherever we have a null, the OP wants the "to" month to be the same as the "from" month, so we'll just use coalesce on that: coalesce( last_am, am). Since last eliminates any nulls, our joins don't need to be outer joins.

无论我们在哪里有一个空值,OP 都希望“到”月与“从”月相同,所以我们将只使用合并:coalesce( last_am, am)。由于 last 消除了任何空值,我们的连接不需要是外部连接。

> select a.eid, b.m, b.y, c.m, c.y, a.v 
   from cm_result_data a 
    join cm_abs_month b 
      on ( a.eid = b.eid and a.am = b.am)  
    join cm_abs_month c 
      on ( a.eid = c.eid and 
      coalesce( a.last_am - 1, 
              (select max(am) from cm_abs_month d where d.eid = a.eid )
      ) = c.am)
    order by 1, 3, 2, 5, 4;
+-----+------+------+------+------+------+
| eid | m    | y    | m    | y    | v    |
+-----+------+------+------+------+------+
| 100 |    1 | 2008 |    2 | 2008 |   80 |
| 100 |    3 | 2008 |    3 | 2008 |   90 |
| 100 |    4 | 2008 |    4 | 2008 |   80 |
| 200 |    1 | 2008 |    2 | 2008 |   80 |
| 200 |    3 | 2008 |    3 | 2008 |   90 |
| 200 |    4 | 2008 |    4 | 2008 |   80 |
+-----+------+------+------+------+------+

By joining back we get the output the OP wants.

通过加入,我们得到了 OP 想要的输出。

Not that we have to join back. As it happens, our absolute_month function is bi-directional, so we can just recalculate the year and offset month from it.

并不是说我们必须重新加入。碰巧的是,我们的 absolute_month 函数是双向的,因此我们可以重新计算年份并从中抵消月份。

First, lets take care of adding the "end cap" month:

首先,让我们注意添加“结束日期”月份:

> create or replace view cm_capped_result as 
select eid, am, 
  coalesce( 
   last_am - 1, 
   (select max(b.am) from cm_abs_month b where b.eid = a.eid)
  ) as last_am, v  
 from cm_result_data a;

And now we get the data, formatted per the OP:

现在我们得到了按照 OP 格式化的数据:

select eid, 
 ( (am - 1) % 12 ) + 1 as sm, 
 floor( ( am - 1 ) / 12 ) as sy, 
 ( (last_am - 1) % 12 ) + 1 as em, 
 floor( ( last_am - 1 ) / 12 ) as ey, v    
from cm_capped_result 
order by 1, 3, 2, 5, 4;

+-----+------+------+------+------+------+
| eid | sm   | sy   | em   | ey   | v    |
+-----+------+------+------+------+------+
| 100 |    1 | 2008 |    2 | 2008 |   80 |
| 100 |    3 | 2008 |    3 | 2008 |   90 |
| 100 |    4 | 2008 |    4 | 2008 |   80 |
| 200 |    1 | 2008 |    2 | 2008 |   80 |
| 200 |    3 | 2008 |    3 | 2008 |   90 |
| 200 |    4 | 2008 |    4 | 2008 |   80 |
+-----+------+------+------+------+------+

And there's the data the OP wants. All in SQL that should run on any RDBMS, and is decomposed into simple, easy to understand and easy to test views.

还有 OP 想要的数据。所有应该在任何 RDBMS 上运行的 SQL 中的所有内容都被分解为简单、易于理解和易于测试的视图。

Is is better to rejoin or to recalculate? I'll leave that (it's a trick question) to the reader.

重新加入好还是重新计算好?我会把那个(这是一个棘手的问题)留给读者。

(If your RDBMS doesn't allow group bys in views, you'll have to join first and then group, or group and then pull in the month and year with correlated subqueries. This is left as an exercise for the reader.)

(如果您的 RDBMS 不允许在视图中分组,则您必须先加入,然后再分组,或者分组,然后使用相关子查询提取月份和年份。这留给读者作为练习。)



Jonathan Leffler asks in the comments,

乔纳森莱夫勒在评论中问道,

What happens with your query if there are gaps in the data (say there's an entry for 2007-12 with value 80, and another for 2007-10, but not one for 2007-11? The question isn't clear what should happen there.

如果数据中存在差距,您的查询会发生什么情况(假设 2007-12 有一个值为 80 的条目,另一个是 2007-10 的条目,但 2007-11 没有一个条目?问题不清楚那里会发生什么.

Well, you're exactly right, the OP doesn't specify. Perhaps there's an (unmentioned) pre-condition that there are no gaps. In the absence of a requirement, we shouldn't try to code around something that might not be there. But, the fact is, gaps make the "joining back" strategy fail; the "recalculate" strategy doesn't fail under those conditions. I'd say more, but that would reveal the trick in the trick question I alluded to above.

好吧,您说得对,OP 没有指定。也许有一个(未提及的)先决条件,即没有差距。在没有需求的情况下,我们不应该尝试围绕可能不存在的东西编码。但是,事实是,差距使“回力”策略失败;在这些条件下,“重新计算”策略不会失败。我会说更多,但这会揭示我上面提到的技巧问题中的技巧。

回答by Rob van Wijk

This one uses only one table scan and works across years. It's better though to model your month and year column as only one date datatype column:

这个只使用一次表扫描并且跨年工作。最好将月份和年份列建模为只有一个日期数据类型列:

SQL> create table tbl (id,month,year,value)
  2  as
  3  select 100,12,2007,80 from dual union all
  4  select 100,1,2008,80 from dual union all
  5  select 100,2,2008,80 from dual union all
  6  select 100,3,2008,90 from dual union all
  7  select 100,4,2008,80 from dual union all
  8  select 200,12,2007,50 from dual union all
  9  select 200,1,2008,50 from dual union all
 10  select 200,2,2008,40 from dual union all
 11  select 200,3,2008,50 from dual union all
 12  select 200,4,2008,50 from dual union all
 13  select 200,5,2008,50 from dual
 14  /

Tabel is aangemaakt.

SQL> select id
  2       , mod(min(year*12+month-1),12)+1 startmonth
  3       , trunc(min(year*12+month-1)/12) startyear
  4       , mod(max(year*12+month-1),12)+1 endmonth
  5       , trunc(max(year*12+month-1)/12) endyear
  6       , value
  7    from ( select id
  8                , month
  9                , year
 10                , value
 11                , max(rn) over (partition by id order by year,month) maxrn
 12             from ( select id
 13                         , month
 14                         , year
 15                         , value
 16                         , case lag(value) over (partition by id order by year,month)
 17                           when value then null
 18                           else rownum
 19                           end rn
 20                      from tbl
 21                  ) inner
 22         )
 23   group by id
 24       , maxrn
 25       , value
 26   order by id
 27       , startyear
 28       , startmonth
 29  /

        ID STARTMONTH  STARTYEAR   ENDMONTH    ENDYEAR      VALUE
---------- ---------- ---------- ---------- ---------- ----------
       100         12       2007          2       2008         80
       100          3       2008          3       2008         90
       100          4       2008          4       2008         80
       200         12       2007          1       2008         50
       200          2       2008          2       2008         40
       200          3       2008          5       2008         50

6 rijen zijn geselecteerd.

Regards, Rob.

问候,罗伯。

回答by Rob van Wijk

I couldn't get the response from ngz to work when the input table contains multiple ids and date ranges that span years. I have a solution that does work, but with qualifications. It will only give you the correct answers if you know that you have a row for every month/year/id combination within the range. If there are "holes" it won't work. If you have holes, I know of know good way to do it other than writing some PL/SQL and using a cursor loop to create a new table in the format you want.

当输入表包含多个跨年的 ID 和日期范围时,我无法从 ngz 获得响应。我有一个有效的解决方案,但有资格。只有当您知道该范围内的每个月/年/id 组合都有一行时,它才会给您正确的答案。如果有“漏洞”,它就行不通。如果您有漏洞,我知道除了编写一些 PL/SQL 和使用游标循环以您想要的格式创建新表之外,还知道有什么好方法可以做到这一点。

By the way, this is why data modeled this way is an abomination. You should always store stuff as start/from range records, not as discrete time period records. It's trivial to transform the former into the latter with a "multiplier" table, but it's almost impossible (as you've seen) to go the other direction.

顺便说一下,这就是为什么以这种方式建模的数据令人厌恶。您应该始终将内容存储为开始/范围记录,而不是离散时间段记录。使用“乘数”表将前者转换为后者是微不足道的,但几乎不可能(如您所见)转向另一个方向。

SELECT ID
     , VALUE
     , start_date
     , end_date
  FROM (SELECT ID
             , VALUE
             , start_date
             , CASE
                  WHEN is_last = 0
                     THEN LEAD(end_date) OVER(PARTITION BY ID ORDER BY start_date)
                  ELSE end_date
               END end_date
             , is_first
          FROM (SELECT ID
                     , VALUE
                     , TO_CHAR(the_date, 'YYYY.MM') start_date
                     , TO_CHAR(NVL(LEAD(the_date - 31) OVER(PARTITION BY ID ORDER BY YEAR
                                  , MONTH), the_date), 'YYYY.MM') end_date
                     , is_first
                     , is_last
                  FROM (SELECT ID
                             , YEAR
                             , MONTH
                             , TO_DATE(TO_CHAR(YEAR) || '.' || TO_CHAR(MONTH) || '.' || '15', 'YYYY.MM.DD') the_date
                             , VALUE
                             , ABS(SIGN(VALUE -(NVL(LAG(VALUE) OVER(PARTITION BY ID ORDER BY YEAR
                                                   , MONTH), VALUE - 1)))) is_first
                             , ABS(SIGN(VALUE -(NVL(LEAD(VALUE) OVER(PARTITION BY ID ORDER BY YEAR
                                                   , MONTH), VALUE - 1)))) is_last
                          FROM test_table)
                 WHERE is_first = 1
                    OR is_last = 1))
 WHERE is_first = 1