在 SQL Server 2008 中插入/更新大量数据的最佳实践
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2342006/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best practices for inserting/updating large amount of data in SQL Server 2008
提问by markvpc
I'm building a system for updating large amounts of data through various CSV feeds. Normally I would just loop though each row in the feed, do a select query to check if the item already exists and insert/update an item depending if it exists or not.
我正在构建一个系统,用于通过各种 CSV 提要更新大量数据。通常我只会遍历提要中的每一行,执行选择查询以检查该项目是否已存在并根据它是否存在插入/更新项目。
I feel this method isn't very scalable and could hammer the server on larger feeds. My solution is to loop through the items as normal but store them in memory. Then for every 100 or so items do a select on those 100 items and get a list of existing items in the database that match. Then concatenate the insert/update statements together and run them into the database. This would essentially cut down on the trips to the database.
我觉得这种方法的可扩展性不是很强,并且可能会在更大的提要上打击服务器。我的解决方案是像往常一样循环遍历这些项目,但将它们存储在内存中。然后对于每 100 个左右的项目,对这 100 个项目进行一次选择,并获取数据库中匹配的现有项目的列表。然后将插入/更新语句连接在一起并将它们运行到数据库中。这将从本质上减少到数据库的行程。
Is this a scalable enough solution and are there any example tutorials on importing large feeds into a productive environment?
这是一个足够可扩展的解决方案吗?是否有任何关于将大型提要导入生产环境的示例教程?
Thanks
谢谢
回答by marc_s
Seeing that you're using SQL Server 2008, I would recommend this approach:
看到您使用的是 SQL Server 2008,我会推荐这种方法:
- first bulkcopy your CSV files into a staging table
- update your target table from that staging table using the MERGE command
- 首先将您的 CSV 文件批量复制到临时表中
- 使用 MERGE 命令从该临时表更新您的目标表
Check out the MSDN docsand a great blog poston how to use the MERGE command.
查看MSDN 文档和有关如何使用 MERGE 命令的精彩博客文章。
Basically, you create a link between your actual data table and the staging table on a common criteria (e.g. a common primary key), and then you can define what to do when
基本上,您可以根据通用标准(例如通用主键)在实际数据表和暂存表之间创建链接,然后您可以定义何时执行什么操作
- the rows match, e.g. the row exists in both the source and the target table --> typically you'd either update some fields, or just ignore it all together
- the row from the source doesn't exist in the target --> typically a case for an INSERT
- 行匹配,例如该行同时存在于源表和目标表中 --> 通常你要么更新某些字段,要么一起忽略它
- 来自源的行在目标中不存在 --> 通常是 INSERT 的情况
You would have a MERGE
statement something like this:
你会有这样的MERGE
声明:
MERGE TargetTable AS t
USING SourceTable AS src
ON t.PrimaryKey = src.PrimaryKey
WHEN NOT MATCHED THEN
INSERT (list OF fields)
VALUES (list OF values)
WHEN MATCHED THEN
UPDATE
SET (list OF SET statements)
;
Of course, the ON
clause can be much more involved if needed. And of course, your WHEN
statements can also be more complex, e.g.
当然,ON
如果需要,该条款可以涉及更多。当然,你的WHEN
陈述也可以更复杂,例如
WHEN MATCHED AND (some other condition) THEN ......
and so forth.
等等。
MERGE
is a very powerful and very useful new command in SQL Server 2008 - use it, if you can!
MERGE
是 SQL Server 2008 中一个非常强大且非常有用的新命令 - 如果可以,请使用它!
回答by HLGEM
Your way is the worst possible solution. In general, you should not think in terms of looping through records individually. We used to have a company built import tool that loops through records, it would take 18-20 hours to load a file with over a million records (something that wasn't a frequent occurrence when it was built but which is a many times a day occurrence now).
你的方式是最糟糕的解决方案。通常,您不应该考虑单独循环遍历记录。我们曾经有一个公司构建的导入工具可以循环遍历记录,加载一个超过一百万条记录的文件需要 18 到 20 个小时(这在构建时并不经常发生,但这是很多次天发生现在)。
I see two options: First use bulk insert to load to a staging table and do whatever clean up you need to do on that table. How are you determining if the record already exists? You should be able to build a set-based update by joining to the staging table on those fields which determine the update. Often I have a added a column to my staging table for the id of the record it matches to and populated that through a query then done the update. Then you do an insert of the records which don't have a corresponding id. If you have too many records to do all at once, you may want to run in batches (which yes is a loop), but make the batches considerably larger than 1 record at a time (I usually start with 2000 and then based on the time it takes for that determine if I can do more or less in the batch).
我看到两个选项:首先使用批量插入加载到暂存表,然后在该表上执行您需要执行的任何清理操作。你如何确定记录是否已经存在?您应该能够通过在确定更新的那些字段上加入临时表来构建基于集合的更新。通常我会在我的临时表中添加一列,用于匹配它的记录的 ID,并通过查询填充它,然后完成更新。然后插入没有相应 id 的记录。如果您有太多的记录要一次完成,您可能希望分批运行(是的,这是一个循环),但每次使批次远大于 1 个记录(我通常从 2000 开始,然后基于确定我是否可以在批次中做更多或更少的时间所需的时间)。
I think 2008 also has a merge statement but I have not yet had a chance to use it. Look it up in books online.
我认为 2008 也有一个合并语句,但我还没有机会使用它。在网上的书上查一下。
The alternative is to use SSIS which is optimized for speed. SSIS is a complex thing though and the learning curve is steep.
另一种方法是使用针对速度进行了优化的 SSIS。SSIS 是一件复杂的事情,学习曲线很陡峭。
回答by Kris Krause
One way is load your CSV into a DataTable (or more likely a DataReader) and then batch slam in the results using SqlBulkCopy -
一种方法是将您的 CSV 加载到 DataTable(或更可能是 DataReader)中,然后使用 SqlBulkCopy 批量处理结果 -
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
Its pretty efficient and you can do some column mapping. Tip - when you map columns using SqlBulkCopy they are case sensitive.
它非常有效,您可以进行一些列映射。提示 - 当您使用 SqlBulkCopy 映射列时,它们区分大小写。
回答by Martin Milan
Another approach would be to write a .Net stored procedure on server on the server to operate on the entire file...
另一种方法是在服务器上的服务器上编写一个 .Net 存储过程来操作整个文件......
Only if you need more control than Kris Krause's solution though - I'm a big fan of keeping it simple (and reusable) where we can...
仅当您需要比 Kris Krause 的解决方案更多的控制时 - 我非常喜欢在我们可以的地方保持简单(和可重用)......
回答by glenatron
Do you need to be rolling your own here at all? Would it be possible to provide the data in such a way that the SQL Server can use Bulk Import to load it in and then deal with duplicates in the database once the import is complete?
你需要在这里滚动你自己的吗?是否有可能以这样一种方式提供数据,即 SQL Server 可以使用批量导入来加载它,然后在导入完成后处理数据库中的重复项?
When it comes to heavy lifting with a lot of data my experience tends to be that working in the database as much as possible is much quicker and less resource intensive.
当涉及到大量数据的繁重工作时,我的经验往往是,尽可能多地在数据库中工作会更快,资源占用更少。