删除带有警告的重复项
时间:2020-03-06 14:53:57 来源:igfitidea点击:
我有一个带有rowID,经度,纬度,businessName,url,标题的表。可能看起来像:
rowID | long | lat | businessName | url | caption 1 20 -20 Pizza Hut yum.com null
如何删除所有重复项,但仅保留一个具有URL的副本(第一优先级),或者如果另一个不具有URL的标题(第二优先级),则保留具有标题的副本,然后删除其余重复项?
解决方案
这是我的循环技术。这可能会因为没有成为主流而被否决,我对此很满意。
DECLARE @LoopVar int DECLARE @long int, @lat int, @businessname varchar(30), @winner int SET @LoopVar = (SELECT MIN(rowID) FROM Locations) WHILE @LoopVar is not null BEGIN --initialize the variables. SELECT @long = null, @lat = null, @businessname = null, @winner = null -- load data from the known good row. SELECT @long = long, @lat = lat, @businessname = businessname FROM Locations WHERE rowID = @LoopVar --find the winning row with that data SELECT top 1 @Winner = rowID FROM Locations WHERE @long = long AND @lat = lat AND @businessname = businessname ORDER BY CASE WHEN URL is not null THEN 1 ELSE 2 END, CASE WHEN Caption is not null THEN 1 ELSE 2 END, RowId --delete any losers. DELETE FROM Locations WHERE @long = long AND @lat = lat AND @businessname = businessname AND @winner != rowID -- prep the next loop value. SET @LoopVar = (SELECT MIN(rowID) FROM Locations WHERE @LoopVar < rowID) END
如果可能,我们可以均质化,然后删除重复项吗?
步骤1:
UPDATE BusinessLocations SET BusinessLocations.url = LocationsWithUrl.url FROM BusinessLocations INNER JOIN ( SELECT long, lat, businessName, url, caption FROM BusinessLocations WHERE url IS NOT NULL) LocationsWithUrl ON BusinessLocations.long = LocationsWithUrl.long AND BusinessLocations.lat = LocationsWithUrl.lat AND BusinessLocations.businessName = LocationsWithUrl.businessName UPDATE BusinessLocations SET BusinessLocations.caption = LocationsWithCaption.caption FROM BusinessLocations INNER JOIN ( SELECT long, lat, businessName, url, caption FROM BusinessLocations WHERE caption IS NOT NULL) LocationsWithCaption ON BusinessLocations.long = LocationsWithCaption.long AND BusinessLocations.lat = LocationsWithCaption.lat AND BusinessLocations.businessName = LocationsWithCaption.businessName
第2步:
删除重复项。
基于集合的解决方案:
delete from T as t1 where /* delete if there is a "better" row with same long, lat and businessName */ exists( select * from T as t2 where t1.rowID <> t2.rowID and t1.long = t2.long and t1.lat = t2.lat and t1.businessName = t2.businessName and case when t1.url is null then 0 else 4 end /* 4 points for non-null url */ + case when t1.businessName is null then 0 else 2 end /* 2 points for non-null businessName */ + case when t1.rowID > t2.rowId then 0 else 1 end /* 1 point for having smaller rowId */ < case when t2.url is null then 0 else 4 end + case when t2.businessName is null then 0 else 2 end )
delete MyTable from MyTable left outer join ( select min(rowID) as rowID, long, lat, businessName from MyTable where url is not null group by long, lat, businessName ) as HasUrl on MyTable.long = HasUrl.long and MyTable.lat = HasUrl.lat and MyTable.businessName = HasUrl.businessName left outer join ( select min(rowID) as rowID, long, lat, businessName from MyTable where caption is not null group by long, lat, businessName ) HasCaption on MyTable.long = HasCaption.long and MyTable.lat = HasCaption.lat and MyTable.businessName = HasCaption.businessName left outer join ( select min(rowID) as rowID, long, lat, businessName from MyTable where url is null and caption is null group by long, lat, businessName ) HasNone on MyTable.long = HasNone.long and MyTable.lat = HasNone.lat and MyTable.businessName = HasNone.businessName where MyTable.rowID <> coalesce(HasUrl.rowID, HasCaption.rowID, HasNone.rowID)
上周"我在Stack Overflow上学到的东西"为我们带来了该解决方案:
DELETE restaurant WHERE rowID in (SELECT rowID FROM restaurant EXCEPT SELECT rowID FROM ( SELECT rowID, Rank() over (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC ) AS Rank FROM restaurant ) rs WHERE Rank = 1)
警告:我尚未在真实数据库上对此进行测试
与另一个答案类似,但是我们要基于行号而不是排名删除。还要与常用表表达式混合:
;WITH GroupedRows AS ( SELECT rowID, Row_Number() OVER (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC) rowNum FROM restaurant ) DELETE r FROM restaurant r JOIN GroupedRows gr ON r.rowID = gr.rowID WHERE gr.rowNum > 1