在任何表中查找重复行的 SQL 查询

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6683788/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 11:19:25  来源:igfitidea点击:

SQL query to find duplicate rows, in any table

sqltsql

提问by Richard Pianka

I'm looking for a schema-independent query. That is, if I have a userstable or a purchasestable, the query should be equally capable of catching duplicate rows in either table without any modification (other than the fromclause, of course).

我正在寻找与模式无关的查询。也就是说,如果我有一个users表或一个purchases表,查询应该同样能够在不做任何修改的情况下捕获任一表中的重复行(from当然,除了子句)。

I'm using T-SQL, but I'm guessing there should be a general solution.

我正在使用 T-SQL,但我猜应该有一个通用的解决方案。

回答by Tom H

I believe that this should work for you. Keep in mind that CHECKSUM() isn't 100% perfect - it's theoretically possible to get a false positive here (I think), but otherwise you can just change the table name and this should work:

我相信这应该对你有用。请记住,CHECKSUM() 不是 100% 完美的 - 理论上有可能在这里得到误报(我认为),但否则您可以更改表名,这应该可以工作:

;WITH cte AS (
    SELECT
        *,
        CHECKSUM(*) AS chksum,
        ROW_NUMBER() OVER(ORDER BY GETDATE()) AS row_num
    FROM
        My_Table
)
SELECT
    *
FROM
    CTE T1
INNER JOIN CTE T2 ON
    T2.chksum = T1.chksum AND
    T2.row_num <> T1.row_num

The ROW_NUMBER()is needed so that you have some way of distinguishing rows. It requires an ORDER BYand that can't be a constant, so GETDATE()was my workaround for that.

ROW_NUMBER()需要,让你有区分行的办法。它需要一个ORDER BY并且不能是一个常数,所以GETDATE()我的解决方法是。

Simply change the table name in the CTE and it should work without spelling out the columns.

只需更改 CTE 中的表名,它应该可以在不拼出列的情况下工作。

回答by Conrad Frix

I'm still confused about what "detecting them might be" but I'll give it a shot.

我仍然对“检测它们可能是什么”感到困惑,但我会试一试。

Excluding them is easy

排除它们很容易

e.g.

例如

SELECT DISTINCT * FROM USERS

However if you wanted to only include them and a duplicate is all the fields than you have to do

但是,如果您只想包含它们并且重复的所有字段都是您必须做的

SELECT 
   [Each and every field]
FROM
   USERS
GROUP BY
   [Each and every field]
HAVING COUNT(*) > 1  

You can't get away with just using (*) because you can't GROUP BY *so this requirement from your comments is difficult

你不能只使用 (*) 因为你不能 GROUP BY *所以你的评论中的这个要求很困难

a schema-independent means I don't want to specify all of the columns in the query

与模式无关的意味着我不想指定查询中的所有列

Unless that is you want to use dynamic SQL and read the columns from sys.columnsor information_schema.columns

除非您想使用动态 SQL 并从中读取列sys.columnsinformation_schema.columns

For example

例如

DECLARE @colunns nvarchar(max)
SET  @colunns = ''

SELECT @colunns = @colunns  + '[' +  COLUMN_NAME  +'], ' 
FROM INFORMATION_SCHEMA.columns  
WHERE table_name = 'USERS'

SET  @colunns  = left(@colunns,len(@colunns ) - 1)


DECLARE @SQL nvarchar(max)
SET @SQL = 'SELECT '  + @colunns 
          + 'FROM  USERS' + 'GROUP BY ' 
          + @colunns 
           + ' Having Count(*) > 1'


exec sp_executesql @SQL

Please note you should read this The Curse and Blessings of Dynamic SQLif you haven't already

请注意如果您还没有读过这篇The Curse and Blessings of Dynamic SQL

回答by JohnD

I have done this using CTEs in SQL Server.

我已经在 SQL Server 中使用 CTE 完成了这项工作。

Here is a sample on how to delete dupes but you should be able to adapt it easily to find dupes:

这是一个关于如何删除欺骗的示例,但您应该能够轻松调整它以查找欺骗:

WITH CTE (COl1, Col2, DuplicateCount)
AS
(
    SELECT COl1,Col2,
    ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
    FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO

Here is a link to an article where I got the SQL:

这是我获得 SQL 的文章的链接:

http://blog.sqlauthority.com/2009/06/23/sql-server-2005-2008-delete-duplicate-rows/

http://blog.sqlauthority.com/2009/06/23/sql-server-2005-2008-delete-duplicate-rows/

回答by ppijnenburg

I recently was looking into the same issue and noticed this question. I managed to solve it using a stored procedure with some dynamic SQL. This way you only need to specify the table name. And it will get all the other relevant data from sys tables.

我最近正在研究同样的问题,并注意到了这个问题。我设法使用带有一些动态 SQL 的存储过程来解决它。这样你只需要指定表名。它将从 sys 表中获取所有其他相关数据。

/*
This SP returns all duplicate rows (1 line for each duplicate) for any given table.

to use the SP:
exec [database].[dbo].[sp_duplicates] 
    @table = '[database].[schema].[table]'  

*/
create proc dbo.sp_duplicates @table nvarchar(50) as

declare @query nvarchar(max)
declare @groupby nvarchar(max)

set @groupby =  stuff((select ',' + [name]
                FROM sys.columns
                WHERE object_id = OBJECT_ID(@table)
                FOR xml path('')), 1, 1, '')

set @query = 'select *, count(*)
                from '+@table+'
                group by '+@groupby+'
                having count(*) > 1'

exec (@query)