C# SqlBulkCopy 错误处理/继续出错

Question

提问by Paladin

I am trying to insert huge amount of data into SQL server. My destination table has an unique index called "Hash".

我正在尝试将大量数据插入 SQL 服务器。我的目标表有一个名为“Hash”的唯一索引。

I would like to replace my SqlDataAdapter implementation with SqlBulkCopy. In SqlDataAapter there is a property called "ContinueUpdateOnError", when set to true adapter.Update(table) will insert all the rows possible and tag the error rows with RowError property.

我想用 SqlBulkCopy 替换我的 SqlDataAdapter 实现。在 SqlDataAapter 中有一个名为“ContinueUpdateOnError”的属性，当设置为 true adapter.Update(table) 将插入所有可能的行并使用 RowError 属性标记错误行。

The question is how can I use SqlBulkCopy to insert data as quickly as possible while keeping track of which rows got inserted and which rows did not (due to the unique index)?

问题是如何使用 SqlBulkCopy 尽可能快地插入数据，同时跟踪哪些行被插入而哪些行没有插入（由于唯一索引）？

Here is the additional information:

以下是附加信息：

The process is iterative, often set on a schedule to repeat.
The source and destination tables can be huge, sometimes millions of rows.
Even though it is possible to check for the hash values first, it requires two transactions per row (first for selecting the hash from destination table, then perform the insertion). I think in the adapter.update(table)'s case, it is faster to check for the RowError than checking for hash hits per row.

该过程是迭代的，通常按计划重复。
源表和目标表可能很大，有时有数百万行。
尽管可以先检查散列值，但每行需要两个事务（首先从目标表中选择散列，然后执行插入）。我认为在 adapter.update(table) 的情况下，检查 RowError 比检查每行的哈希命中要快。

Answer 1

采纳答案by Sam Saffron

SqlBulkCopy, has very limited error handling facilities, by default it doesn't even check constraints.

SqlBulkCopy 具有非常有限的错误处理功能，默认情况下它甚至不检查约束。

However, its fast, really really fast.

然而，它很快，真的很快。

If you want to work around the duplicate key issue, and identify which rows are duplicates in a batch. One option is:

如果您想解决重复键问题，并确定批次中哪些行是重复的。一种选择是：

start tran
Grab a tablockx on the table select all current "Hash" values and chuck them in a HashSet.
Filter out the duplicates and report.
Insert the data
commit tran

开始传输
抓住桌子上的一个 tabblockx 选择所有当前的“哈希”值并将它们放入一个 HashSet 中。
过滤掉重复项并报告。
插入数据
提交传输

This process will work effectively if you are inserting huge sets and the size of the initial data in the table is not too huge.

如果您要插入大量集合并且表中初始数据的大小不是太大，则此过程将有效。

Can you please expand your question to include the rest of the context of the problem.

您能否扩大您的问题以包括问题的其余部分。

EDIT

编辑

Now that I have some more context here is another way you can go about it:

现在我有了更多的上下文，你可以用另一种方式来处理它：

Do the bulk insert into a temp table.
start serializable tran
Select all temp rows that are already in the destination table ... report on them
Insert the data in the temp table into the real table, performing a left join on hash and including all the new rows.
commit the tran

将批量插入到临时表中。
启动可序列化的 tran
选择目标表中已经存在的所有临时行...报告它们
将临时表中的数据插入到真实表中，对散列执行左连接并包括所有新行。
提交 tran

That process is very light on round trips, and considering your specs should end up being really fast;

这个过程在往返过程中非常简单，并且考虑到您的规格最终应该非常快；

Answer 2

回答by Paladin

Note: This is a recap of Sam's answer with slightly more details

注意：这是 Sam 的回答的概述，其中包含更多详细信息

Thanks to Sam for the answer. I have put it in an answer due to comment's space constraints.

感谢山姆的回答。由于评论的空间限制，我已将其放入答案中。

Deriving from your answer I see two possible approaches:

根据您的回答，我看到了两种可能的方法：

Solution 1:

解决方案1：

start tran
grab all possible hit "hash" values by doing "select hash in destinationtable where hash in (val1, val2, ...)
filter out duplicates and report
insert data
commit tran

开始传输
通过执行“select hash in destinationtable where hash in (val1, val2, ...)
过滤掉重复项并报告
插入数据
提交传输

solution 2:

解决方案2：

Create temp table to mirror the schema of destination table
bulk insert into the temp table
start serializable transaction
Get duplicate rows: "select hash from tempTable where tempTable.hash=destinationTable.hash"
report on duplicate rows
Insert the data in the temp table into the destination table: "select * into destinationTable from temptable left join temptable.hash=destinationTable.hash where destinationTable.hash is null"
commit the tran

创建临时表以镜像目标表的架构
批量插入临时表
启动可序列化事务
获取重复行：“从 tempTable 中选择哈希，其中 tempTable.hash=destinationTable.hash”
报告重复行
将临时表中的数据插入目标表：“select * into destinationTable from temptable left join temptable.hash=destinationTable.hash where destinationTable.hash is null”
提交 tran

Since we have two approaches, it comes down to which approach is the most optimized? Both approaches have to retrieve the duplicate rows and report while the second approach requires extra:

既然我们有两种方法，那么归结为哪种方法最优化？两种方法都必须检索重复的行并报告，而第二种方法需要额外的：

temp table creation and delete
one more sql command to move data from temp to destination table
depends on the percentage of hash collision, it also transfers a lot of unnecessary data across the wire

临时表的创建和删除
另一个 sql 命令将数据从临时表移动到目标表
取决于哈希冲突的百分比，它还通过网络传输了大量不必要的数据

If these are the only solutions, it seems to me that the first approach wins. What do you guys think? Thanks!

如果这些是唯一的解决方案，那么在我看来，第一种方法会获胜。你们有什么感想？谢谢！

Answer 3

回答by Oli B

Slightly different approach than already suggested; Perform the SqlBulkCopyand catch the SqlExceptionthrown:

与已经建议的方法略有不同；执行SqlBulkCopy并捕获抛出的SqlException：

    Violation of PRIMARY KEY constraint 'PK_MyPK'. Cannot insert duplicate 
key in object 'dbo.MyTable'. **The duplicate key value is (17)**.

You can then remove all items from your source from ID 17, the first record that was duplicated. I'm making assumptions here that apply to my circumstances and possibly not yours; i.e. that the duplication is caused by the exactsame data from a previously failed SqlBulkCopydue to SQL/Network errors during the upload.

然后，您可以从 ID 17（重复的第一条记录）中删除源中的所有项目。我在这里做出的假设适用于我的情况，可能不适用于您的情况；即重复是由之前由于上传过程中 SQL/网络错误而失败的完全相同的数据引起的SqlBulkCopy。

C# SqlBulkCopy 错误处理/继续出错

提问by Paladin

采纳答案by Sam Saffron

回答by Paladin

回答by Oli B

相关推荐

最近更新

标签

C# SqlBulkCopy 错误处理/继续出错

提问by Paladin

采纳答案by Sam Saffron

回答by Paladin

回答by Oli B

相关推荐

C# 如何在 LINQ 中实现动态“where”子句？

Linux 如何计算文件夹中的所有文件、其子文件夹和所有 . 计数不应包括文件夹计数

C# 如何将颜色数组设置为另一个颜色数组？

在 linux 2.4 上以普通用户身份访问原始套接字

相关推荐

最近更新

标签