SQL 如何在PostgreSQL中查找重复记录
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28156795/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to find duplicate records in PostgreSQL
提问by John
I have a PostgreSQL database table called "user_links" which currently allows the following duplicate fields:
我有一个名为“user_links”的 PostgreSQL 数据库表,它当前允许以下重复字段:
year, user_id, sid, cid
The unique constraint is currently the first field called "id", however I am now looking to add a constraint to make sure the year
, user_id
, sid
and cid
are all unique but I cannot apply the constraint because duplicate values already exist which violate this constraint.
唯一的限制是目前所谓的“ID”的第一场,但是我现在希望增加一个约束,以确保year
,user_id
,sid
和cid
都是独一无二的,但由于重复的值已存在违反此约束我不能应用约束。
Is there a way to find all duplicates?
有没有办法找到所有重复项?
回答by Marcin Zablocki
The basic idea will be using a nested query with count aggregation:
基本思想将使用带有计数聚合的嵌套查询:
select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1
You can adjust the where clause in the inner query to narrow the search.
您可以调整内部查询中的 where 子句以缩小搜索范围。
There is another good solution for that mentioned in the comments, (but not everyone reads them):
评论中提到的还有另一个很好的解决方案,(但不是每个人都阅读它们):
select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1
Or shorter:
或更短:
SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1
回答by alexkovelsky
From "Find duplicate rows with PostgreSQL" here's smart solution:
从“使用 PostgreSQL 查找重复行”这里的智能解决方案:
select * from (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
FROM tbl
) dups
where
dups.Row > 1
回答by pwnyexpress
You can join to the same table on the fields that would be duplicated and then anti-join on the id field. Select the id field from the first table alias (tn1) and then use the array_agg function on the id field of the second table alias. Finally, for the array_agg function to work properly, you will group the results by the tn1.id field. This will produce a result set that contains the the id of a record and an array of all the id's that fit the join conditions.
您可以在要复制的字段上加入同一个表,然后在 id 字段上反加入。从第一个表别名 (tn1) 中选择 id 字段,然后在第二个表别名的 id 字段上使用 array_agg 函数。最后,为了让 array_agg 函数正常工作,您将按 tn1.id 字段对结果进行分组。这将产生一个结果集,其中包含记录的 id 和符合连接条件的所有 id 的数组。
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id;
Obviously, id's that will be in the duplicate_entries array for one id, will also have their own entries in the result set. You will have to use this result set to decide which id you want to become the source of 'truth.' The one record that shouldn't get deleted. Maybe you could do something like this:
显然,id 将在一个 id 的重复条目数组中,在结果集中也将有自己的条目。您将不得不使用此结果集来决定要成为“真相”来源的 ID。不应该被删除的一条记录。也许你可以这样做:
with dupe_set as (
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists
(select de from unnest(ds.duplicate_entries) as de where de < ds.id)
Selects the lowest number ID's that have duplicates (assuming the ID is increasing int PK). These would be the ID's that you would keep around.
选择具有重复项的最小编号 ID(假设 ID 正在增加 int PK)。这些将是您将保留的 ID。
回答by George Siggouroglou
In order to make it easier I assume that you wish to apply a unique constraint only for column year and the primary key is a column named id.
为了使它更容易,我假设您希望仅对列 year 应用唯一约束,并且主键是名为 id 的列。
In order to find duplicate values you should run,
为了找到你应该运行的重复值,
SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);
Using the sql statement above you get a table which contains all the duplicate years in your table. In order to delete all the duplicates except of the the latest duplicate entryyou should use the above sql statement.
使用上面的 sql 语句,您将获得一个包含表中所有重复年份的表。为了删除除最新的重复条目之外的所有重复项,您应该使用上面的 sql 语句。
DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;