SQL 检查重复项时的最佳自连接技术

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5859191/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 10:22:32  来源:igfitidea点击:

Best self join technique when checking for duplicates

sqlsql-server-2008

提问by Dustin Davis

i'm trying to optimize a query that is in production which is taking a long time. The goal is to find duplicate records based on matching field values criteria and then deleting them. The current query uses a self join via inner join on t1.col1 = t2.col1 then a where clause to check the values.

我正在尝试优化一个需要很长时间的生产查询。目标是根据匹配的字段值条件查找重复记录,然后将其删除。当前查询在 t1.col1 = t2.col1 上通过内部联接使用自联接,然后使用 where 子句来检查值。

select * from table t1 
inner join table t2 on t1.col1 = t2.col1
where t1.col2 = t2.col2 ...

What would be a better way to do this? Or is it all the same based on indexes? Maybe

什么是更好的方法来做到这一点?还是基于索引都是一样的?也许

select * from table t1, table t2
where t1.col1 = t2.col1, t2.col2 = t2.col2 ...

this table has 100m+ rows.

该表有 100m+ 行。

MS SQL, SQL Server 2008 Enterprise

MS SQL、SQL Server 2008 企业版

select distinct t2.id
    from table1 t1 with (nolock)
    inner join table1 t2 with (nolock) on  t1.ckid=t2.ckid
    left join table2 t3 on t1.cid = t3.cid and t1.typeid = t3.typeid
    where 
    t2.id > @Max_id and
    t2.timestamp > t1.timestamp and
    t2.rid = 2 and
    isnull(t1.col1,'') = isnull(t2.col1,'') and 
    isnull(t1.cid,-1) = isnull(t2.cid,-1) and
    isnull(t1.rid,-1) = isnull(t2.rid,-1)and 
    isnull(t1.typeid,-1) = isnull(t2.typeid,-1) and
    isnull(t1.cktypeid,-1) = isnull(t2.cktypeid,-1) and
    isnull(t1.oid,'') = isnull(t2.oid,'') and
    isnull(t1.stypeid,-1) = isnull(t2.stypeid,-1)  

    and (
            (
                t3.uniqueoid = 1
            )
            or
            (
                t3.uniqueoid is null and 
                isnull(t1.col1,'') = isnull(t2.col1,'') and 
                isnull(t1.col2,'') = isnull(t2.col2,'') and
                isnull(t1.rdid,-1) = isnull(t2.rdid,-1) and 
                isnull(t1.stid,-1) = isnull(t2.stid,-1) and
                isnull(t1.huaid,-1) = isnull(t2.huaid,-1) and
                isnull(t1.lpid,-1) = isnull(t2.lpid,-1) and
                isnull(t1.col3,-1) = isnull(t2.col3,-1) 
            )
    )

回答by gbn

Why self join: this is an aggregate question.

为什么自加入:这是一个综合问题。

Hope you have an index on col1, col2, ...

希望你在 col1, col2, ... 上有一个索引

--DELETE table
--WHERE KeyCol NOT IN (
select
    MIN(KeyCol) AS RowToKeep,
    col1, col2, 
from
    table
GROUP BY
    col12, col2
HAVING
   COUNT(*) > 1
--)

However, this will take some time. Have a look at bulk delete techniques

但是,这需要一些时间。有一个看看批量删除技术

回答by Bruno Costa

You can use ROW_NUMBER() to find duplicate rows in one table.

您可以使用 ROW_NUMBER() 在一张表中查找重复的行。

You can check here

你可以在这里查看

回答by Pravin

For table with 100m+ rows, Using GROUPBY functions and using holding table will be optimized. Even though it translates into four queries.

对于 100m+ 行的表,使用 GROUPBY 函数和使用保持表将被优化。即使它转换为四个查询。

STEP 1: create a holding key:

步骤 1:创建一个保持密钥:

SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1

STEP 2: Push all the duplicate entries into the holddups. This is required for Step 4.

第 2 步:将所有重复条目推入holddups。这是步骤 4 所必需的。

SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

STEP 3: Delete the duplicate rows from the original table.

第 3 步:从原始表中删除重复的行。

DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

STEP 4: Put the unique rows back in the original table. For example:

第 4 步:将唯一行放回原始表中。例如:

INSERT t1 SELECT * FROM holddups

回答by Jay

The two methods you give should be equivalent. I think most SQL engines would do exactly the same thing in both cases.

你给出的两种方法应该是等价的。我认为大多数 SQL 引擎在这两种情况下都会做完全相同的事情。

And, by the way, this won't work. You have to have at least one field that is differernt or every record will match itself.

而且,顺便说一下,这行不通。您必须至少有一个不同的字段,否则每条记录都会匹配自己。

You might want to try something more like:

你可能想尝试更像:

select col1, col2, col3
from table
group by col1, col2, col3
having count(*)>1

回答by FrankPl

In my experience, SQL Server performance is really bad with ORconditions. Probably it is not the self join but that with table3 that causes the bad performance. But without seeing the plan, I would not be sure.

根据我的经验,SQL Server 的性能在OR条件下非常糟糕。可能不是 self join 而是 table3 导致性能不佳。但没有看到计划,我就不确定。

In this case, it might help to split your query into two: One with a WHERE condition t3.uniqueoid = 1and one with a WHERE condition for the other conditons on table3, and then use UNION ALLto append one to the other.

在这种情况下,将您的查询分成两个可能会有所帮助:一个具有 WHERE 条件t3.uniqueoid = 1,另一个具有用于 table3 上其他条件的 WHERE 条件,然后用于UNION ALL将一个附加到另一个。

回答by Christoph Walesch

To detect duplicates, you don't need to join:

要检测重复项,您无需加入:

SELECT col1, col2
FROM table
GROUP BY col1, col2
HAVING COUNT(*) > 1

That should be much faster.

那应该快得多。