仅选择 SQL 中列中重复值的第一行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8683547/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 13:47:02  来源:igfitidea点击:

Only select first row of repeating value in a column in SQL

sqlsqlitems-accessduplicates

提问by RYN

I have table that has a column that may have same values in a burst. Like this:

我有一个表,其中有一列可能在突发中具有相同的值。像这样:

+----+---------+
| id |   Col1  | 
+----+---------+
| 1  | 6050000 |
+----+---------+
| 2  | 6050000 |
+----+---------+
| 3  | 6050000 |
+----+---------+
| 4  | 6060000 |
+----+---------+
| 5  | 6060000 |
+----+---------+
| 6  | 6060000 |
+----+---------+
| 7  | 6060000 |
+----+---------+
| 8  | 6060000 |
+----+---------+
| 9  | 6050000 |
+----+---------+
| 10 | 6000000 |
+----+---------+
| 11 | 6000000 |
+----+---------+

Now I want to prune rows where the value of Col1is repeated and only select the first occurrence.
For the above table the result should be:

现在我想修剪Col1重复值的行,只选择第一次出现的行。
对于上表,结果应该是:

+----+---------+
| id |   Col1  | 
+----+---------+
| 1  | 6050000 |
+----+---------+
| 4  | 6060000 |
+----+---------+
| 9  | 6050000 |
+----+---------+
| 10 | 6000000 |
+----+---------+

How can I do this in SQL?
Note that only burst rows should be removed and values can be repeated in non-burst rows!id=1& id=9are repeated in sample result.

如何在 SQL 中执行此操作?
请注意,只应删除突发行,并且可以在非突发行中重复值!id=1&id=9在样本结果中重复。

EDIT:
I achieved it using this:

编辑:
我用这个实现了它:

select id,col1 from data as d1
where not exists (
    Select id from data as d2
    where d2.id=d1.id-1 and d1.col1=d2.col1 order by id limit 1)

But this only works when ids are sequential. With gaps between ids (deleted ones) the query breaks. How can I fix this?

但这仅在 id 是连续的时才有效。由于 id(已删除的)之间的差距,查询会中断。我怎样才能解决这个问题?

采纳答案by Erwin Brandstetter

You can use a EXISTSsemi-join to identify candidates:

您可以使用EXISTS半连接来确定候选人:

Select wanted rows:

选择想要的行:

SELECT * FROM tbl
WHERE NOT EXISTS (
    SELECT *
    FROM tbl t
    WHERE t.col1 = tbl.col1
    AND t.id = tbl.id - 1
    )
ORDER BY id

Get rid of unwanted rows:

摆脱不需要的行:

DELETE FROM tbl
-- SELECT * FROM tbl
WHERE EXISTS (
    SELECT *
    FROM   tbl t
    WHERE  t.col1 = tbl.col1
    AND    t.id   = tbl.id - 1
    )

This effectively deletes every row, where the preceding row has the same value in col1, thereby arriving at your set goal: only the first row of every burst survives.

这有效地删除了每一行,其中前一行在 中具有相同的值col1,从而达到您设定的目标:只有每个突发的第一行幸存下来。

I left the commented SELECTstatement because you should alwayscheck what is going to be deleted before you do the deed.

我留下了评论SELECT声明,因为您应该始终在执行操作之前检查将要删除的内容。



Solution for non-sequential IDs:

非连续 ID 的解决方案:

If your RDBMS supports the CTEand window functions(like PostgreSQL, Oracle, SQL Server, ... but notSQLite, MS Access or MySQL), there is an elegant way:

如果您的 RDBMS 支持CTE窗口函数(如 PostgreSQL、Oracle、SQL Server...但不支持SQLite、MS Access 或 MySQL),则有一种优雅的方法:

WITH x AS (
    SELECT *, row_number() OVER (ORDER BY id) AS rn
    FROM tbl
    )
SELECT id, col1
FROM   x
WHERE NOT EXISTS (
    SELECT *
    FROM   x x1
    WHERE  x1.col1 = x.col1
    AND    x1.rn   = x.rn - 1
    )
ORDER BY id;

There is also the not-so-elegant way that does the job without those niceties.
Should work for you:

还有一种不太优雅的方式可以在没有这些细节的情况下完成工作。
应该为你工作:

SELECT id, col1
FROM   tbl
WHERE (
    SELECT t.col1 = tbl.col1
    FROM   tbl AS t
    WHERE  t.id < tbl.id
    ORDER  BY id DESC
    LIMIT  1) IS NOT TRUE
ORDER BY id


Tool for test-casing non-sequential IDs

用于测试用例非序列 ID 的工具

(Tested in PostgreSQL)

(在 PostgreSQL 中测试)

CREATE TEMP TABLE tbl (id int, col1 int);
INSERT INTO tbl VALUES
 (1,6050000),(2,6050000),(6,6050000)
,(14,6060000),(15,6060000),(16,6060000)
,(17,6060000),(18,6060000),(19,6050000)
,(20,6000000),(111,6000000);

回答by The Pickle

select min(id), Col1 from tableName group by Col1 

回答by Rob Paller

If your RDBMS supports Window Aggregate functions and/or LEAD() and LAG() functions you can leverage them to accomplish what you are trying to report. The following SQL will help get you started down the right path:

如果您的 RDBMS 支持 Window Aggregate 函数和/或 LEAD() 和 LAG() 函数,您可以利用它们来完成您要报告的内容。以下 SQL 将帮助您走上正确的道路:

SELECT id
     , Col AS CurCol
     , MAX(Col)
       OVER(ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS PrevCol
     , MIN(COL)
       OVER(ORDER BY id ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS NextCol
FROM MyTable

From there you can put that SQL in a derived table with some CASE logic that if the NextColor PrevColis the same as CurColthen set CurCol = NULL. Then you can collapse eliminate all the id records CurCol IS NULL.

从那里,您可以将该 SQL 放入带有一些 CASE 逻辑的派生表中,如果NextColorPrevColCurColthen set相同CurCol = NULL。然后你可以折叠消除所有的 id 记录CurCol IS NULL

If you don't have the ability to use window aggregates or LEAD/LAG functions your task is a little more complex.

如果您无法使用窗口聚合或 LEAD/LAG 函数,您的任务会稍微复杂一些。

Hope this helps.

希望这可以帮助。

回答by Andriy M

Since idis always sequential, with no gaps or repetitions, as per your comment, you could use the following method:

由于id始终是连续的,没有间隙或重复,根据您的评论,您可以使用以下方法:

SELECT t1.*
FROM atable t1
  LEFT JOIN atable t2 ON t1.id = t2.id + 1 AND t1.Col1 = t2.Col1
WHERE t2.id IS NULL

The table is (outer-)joined to itself on the condition that the left side's idis one greater than the right side's and their Col1values are identical. In other words, the condition is ‘the previous row contains the same Col1value as the current row'. If there's no match on the right, then the current record should be selected.

在左边的id比右边大一并且它们的Col1值相同的条件下,该表是(外部)连接到自身的。换句话说,条件是'前一行包含与Col1当前行相同的值'。如果右侧没有匹配项,则应选择当前记录。



UPDATE

更新

To account for non-sequential ids (which, however, are assumed to be unique and defining the order of changes of Col1), you could also try the following query:

考虑到非顺序的ids(但是,假定它是唯一的并定义了 的更改顺序Col1),您还可以尝试以下查询:

SELECT t1.*
FROM atable t1
  LEFT JOIN atable t2 ON t1.id > t2.id
  LEFT JOIN atable t3 ON t1.id > t3.id AND t3.id > t2.id
WHERE t3.id IS NULL
  AND (t2.id IS NULL OR t2.Col1 <> t1.Col1)

The third self-join is there to ensure that the second one yields the row directly preceding that of t1. That is, if there's no match for t3, then either t2contains the preceding row or it's got no match either, the latter meaning that t1's current row is the top one.

第三个自联接是为了确保第二个自联接产生直接在 的行之前的行t1。也就是说,如果没有匹配t3,则要么t2包含前一行,要么也没有匹配项,后者意味着t1的当前行是最前面的行。