使用 T-SQL Merge 语句时如何避免插入重复记录

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6592643/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 11:14:07  来源:igfitidea点击:

How to avoid inserting duplicate records when using a T-SQL Merge statement

sqltsqlmerge

提问by Jed

I am attempting to insert many records using T-SQL's MERGE statement, but my query fails to INSERT when there are duplicate records in the source table. The failure is caused by:

我正在尝试使用 T-SQL 的 MERGE 语句插入许多记录,但是当源表中有重复记录时,我的查询无法插入。故障原因如下:

  1. The target table has a Primary Key based on two columns
  2. The source table may contain duplicate records that violate the target table's Primary Key constraint ("Violation of PRIMARY KEY constraint" is thrown)
  1. 目标表有一个基于两列的主键
  2. 源表可能包含违反目标表主键约束的重复记录(抛出“违反主键约束”)

I'm looking for a way to change my MERGE statement so that it either ignores duplicate records within the source table and/or will try/catch the INSERT statement to catch exceptions that may occur (i.e. all other INSERT statements will run regardless of the few bad eggs that may occur) - or, maybe, there's a better way to go about this problem?

我正在寻找一种方法来更改我的 MERGE 语句,以便它忽略源表中的重复记录和/或尝试/捕获 INSERT 语句以捕获可能发生的异常(即所有其他 INSERT 语句将运行,而不管可能会发生一些坏蛋) - 或者,也许有更好的方法来解决这个问题?

Here's a query example of what I'm trying to explain. The example below will add 100k records to a temp table and then will attempt to insert those records in the target table -

这是我要解释的查询示例。下面的示例将向临时表添加 100k 条记录,然后尝试将这些记录插入目标表中 -

EDITIn my original post I only included two fields in the example tables which gave way to SO friends to give a DISTINCT solution to avoid duplicates in the MERGE statement. I should have mentioned that in my real-world problem the tables have 15 fields and of those 15, two of the fields are a CLUSTERED PRIMARY KEY. So the DISTINCT keyword doesn't work because I need to SELECT all 15 fields and ignore duplicates based on two of the fields.

编辑在我原来的帖子中,我只在示例表中包含了两个字段,这些字段让位于 SO 朋友提供 DISTINCT 解决方案以避免在 MERGE 语句中重复。我应该提到,在我的实际问题中,表有 15 个字段,在这 15 个字段中,其中两个字段是 CLUSTERED PRIMARY KEY。所以 DISTINCT 关键字不起作用,因为我需要选择所有 15 个字段并忽略基于两个字段的重复项。

I have updated the query below to include one more field, col4. I need to include col4 in the MERGE, but I only need to make sure that ONLY col2 and col3 are unique.

我已经更新了下面的查询,以包含另外一个字段 col4。我需要在 MERGE 中包含 col4,但我只需要确保只有 col2 和 col3 是唯一的。

-- Create the source table
CREATE TABLE #tmp (
col2 datetime NOT NULL,
col3 int NOT NULL,
col4 int
)
GO

-- Add a bunch of test data to the source table
-- For testing purposes, allow duplicate records to be added to this table
DECLARE @loopCount int = 100000
DECLARE @loopCounter int = 0
DECLARE @randDateOffset int
DECLARE @col2 datetime
DECLARE @col3 int
DECLARE @col4 int

WHILE (@loopCounter) < @loopCount
BEGIN
    SET @randDateOffset = RAND() * 100000
    SET @col2 = DATEADD(MI,@randDateOffset,GETDATE())
    SET @col3 = RAND() * 1000
    SET @col4 = RAND() * 10
    INSERT INTO #tmp
    (col2,col3,col4)
    VALUES
    (@col2,@col3,@col4);

    SET @loopCounter = @loopCounter + 1
END

-- Insert the source data into the target table
-- How do we make sure we don't attempt to INSERT a duplicate record? Or how can we 
-- catch exceptions? Or?
MERGE INTO dbo.tbl1 AS tbl
    USING (SELECT * FROM #tmp) AS src
    ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3)
    WHEN NOT MATCHED THEN 
        INSERT (col2,col3,col4)
        VALUES (src.col2,src.col3,src.col4);
GO

回答by t-clausen.dk

Solved to your new specification. Only inserting the highest value of col4: This time I used a group by to prevent duplicate rows.

解决了您的新规范。只插入col4的最高值:这次我用了一个group by来防止重复行。

MERGE INTO dbo.tbl1 AS tbl 
USING (SELECT col2,col3, max(col4) col4 FROM #tmp group by col2,col3) AS src 
ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3) 
WHEN NOT MATCHED THEN  
    INSERT (col2,col3,col4) 
    VALUES (src.col2,src.col3,src.col4); 

回答by gbn

Given the source has duplicates and you aren't using MERGE fully, I'd use an INSERT.

鉴于源有重复,并且您没有完全使用 MERGE,我会使用 INSERT。

 INSERT dbo.tbl1 (col2,col3) 
 SELECT DISTINCT col2,col3
 FROM #tmp src
 WHERE NOT EXISTS (
       SELECT *
       FROM dbo.tbl1 tbl
       WHERE tbl.col2 = src.col2 AND tbl.col3 = src.col3)

The reason MERGE fails is that it isn't checked row by row. All non-matches are found, then it tries to INSERT all these. It doesn't check for rows in the same batch that already match.

MERGE 失败的原因是它没有逐行检查。找到所有不匹配项,然后尝试插入所有这些。它不会检查同一批中已经匹配的行。

This reminds me a bit of the "Halloween problem"where early data changes of an atomic operation affect later data changes: it isn't correct

这让我想起了“万圣节问题”,其中原子操作的早期数据更改会影响以后的数据更改:这是不正确的

回答by Hai Phan

Instead of GROUP BY you can use an analytic function, allowing you to select a specific record in the set of duplicate records to merge.

您可以使用分析函数代替 GROUP BY,允许您在要合并的重复记录集中选择特定记录。

MERGE INTO dbo.tbl1 AS tbl
USING (
    SELECT *
    FROM (
        SELECT *, ROW_NUMBER() OVER (PARTITION BY col2, col3 ORDER BY ModifiedDate DESC) AS Rn
        FROM #tmp
    ) t
    WHERE Rn = 1    --choose the most recently modified record
) AS src
ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3)