将 .csv 文件导入 Microsoft SQL Server 2008 R2 的推荐方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5011335/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Recommended method to import a .csv file into Microsoft SQL Server 2008 R2?
提问by Contango
What is your recommended way to import .csv files into Microsoft SQL Server 2008 R2?
您推荐的将 .csv 文件导入 Microsoft SQL Server 2008 R2 的方法是什么?
I'd like something fast, as I have a directory with a lot of .csv files (>500MB spread across 500 .csv files).
我想要一些快速的东西,因为我有一个包含很多 .csv 文件的目录(> 500MB 分布在 500 个 .csv 文件中)。
I'm using SQL Server 2008 R2 on Win 7 x64.
我在 Win 7 x64 上使用 SQL Server 2008 R2。
Update: Solution
更新:解决方案
Here's how I solved the problem the end:
这是我最终解决问题的方法:
- I abandoned trying to use LINQ to Entities to do the job. It works - but it doesn't support bulk insert, so its about 20x slower. Maybe the next version of LINQ to Entities will support this.
- Took the advice given on this thread, used bulk insert.
- I created a T-SQL stored procedure that uses bulk insert. Data goes into a staging table, is normalized then copied into the target tables.
- I mapped the stored procedure into C# using the LINQ to Entities framework (there is a video on www.learnvisualstudio.net showing how to do this).
- I wrote all the code to cycle through files, etc in C#.
- This method eliminates the biggest bottleneck, which is reading tons of data off the drive and inserting it into the database.
- 我放弃了尝试使用 LINQ to Entities 来完成这项工作。它有效 - 但它不支持批量插入,所以它大约慢了 20 倍。也许下一版本的 LINQ to Entities 会支持这一点。
- 接受了在这个线程上给出的建议,使用了批量插入。
- 我创建了一个使用批量插入的 T-SQL 存储过程。数据进入临时表,进行规范化,然后复制到目标表中。
- 我使用 LINQ to Entities 框架将存储过程映射到 C#(www.learnvisualstudio.net 上有一个视频展示了如何执行此操作)。
- 我在 C# 中编写了所有代码来循环文件等。
- 这种方法消除了最大的瓶颈,即从驱动器读取大量数据并将其插入数据库。
The reason why this method is extremely quick at reading .csv files? Microsoft SQL Server gets to import the files directly from the hard drive straight into the database, using its own highly optimized routines. Most of the other C# based solutions require much more code, and some (like LINQ to Entities) end up having to pipe the data slowly into the database via the C#-to-SQL-server link.
这种方法在读取 .csv 文件时非常快的原因是什么?Microsoft SQL Server 可以使用自己高度优化的例程将文件直接从硬盘驱动器直接导入数据库。大多数其他基于 C# 的解决方案需要更多的代码,有些(如 LINQ to Entities)最终不得不通过 C#-to-SQL-server 链接缓慢地将数据通过管道传输到数据库中。
Yes, I know it'd be nicer to have 100% pure C# code to do the job, but in the end:
是的,我知道拥有 100% 纯 C# 代码来完成这项工作会更好,但最终:
- (a) For this particular problem, using T-SQL requires muchless code compared to C#, about 1/10th, especially for the logic to denormalize the data from the staging table. This is simpler and more maintainable.
- (b) Using T-SQL means you can take advantage of the native bulk insert procedures, which speeds things up from 20-minute wait to a 30-second pause.
- (a) 对于这个特殊问题,与 C# 相比,使用 T-SQL 需要的代码要少得多,大约是 1/10,特别是对于将暂存表中的数据进行非规范化的逻辑。这更简单,更易于维护。
- (b) 使用 T-SQL 意味着您可以利用本机批量插入过程,这将把事情从 20 分钟的等待加快到 30 秒的暂停。
采纳答案by Jason
Using BULK INSERT in a T-SQL script seems to be a good solution.
在 T-SQL 脚本中使用 BULK INSERT 似乎是一个很好的解决方案。
You can get the list of files in your directory with xp_cmdshell and the dir command (with a bit of cleanup). In the past, I tried to do something like this with sp_OAMethod and VBScript functions and had to use the dir method because I had trouble getting the list of files with the FSO object.
您可以使用 xp_cmdshell 和 dir 命令(稍作清理)获取目录中的文件列表。过去,我尝试使用 sp_OAMethod 和 VBScript 函数执行类似操作,但不得不使用 dir 方法,因为我无法使用 FSO 对象获取文件列表。
http://www.sqlusa.com/bestpractices2008/list-files-in-directory/
http://www.sqlusa.com/bestpractices2008/list-files-in-directory/
回答by John Saunders
If you have to do anything with the data in the files other than insert it, then I would recommend using SSIS. It can not only insert and/or update, it can also clean the data for you.
如果您必须对文件中的数据进行除插入之外的任何操作,那么我建议使用 SSIS。它不仅可以插入和/或更新,还可以为您清理数据。
回答by globalheap
First officially supported way of importing large text files is with command line tool called "bcp" (Bulk Copy Utility), very useful for huge amounts of binary data.
第一个官方支持的导入大文本文件的方法是使用名为“bcp”(批量复制实用程序)的命令行工具,对于大量二进制数据非常有用。
Please check out this link: http://msdn.microsoft.com/en-us/library/ms162802.aspx
请查看此链接:http: //msdn.microsoft.com/en-us/library/ms162802.aspx
However, in SQL Server 2008 I presume that BULK INSERT command would be your choice number one, because on the first place it became a part of standard command set. If for any reason you have to maintain vertical compatibility, I'd stick to bcp utility, available for SQL Server 2000 too.
但是,在 SQL Server 2008 中,我认为 BULK INSERT 命令将是您的第一选择,因为首先它成为标准命令集的一部分。如果出于任何原因必须保持垂直兼容性,我会坚持使用 bcp 实用程序,它也可用于 SQL Server 2000。
HTH :)
哈 :)
EDITED LATER: Googling around I recalled that SQL Server 2000 had BULK INSERT command too... however, there was obviously some reason I sticked up to bcp.exe, and I cannot recall why... perhaps of some limits, I guess.
稍后编辑:谷歌搜索我记得 SQL Server 2000 也有 BULK INSERT 命令......但是,显然我坚持使用 bcp.exe 是有原因的,我不记得为什么......也许有一些限制,我猜。
回答by kombsh
I should recommend this:
我应该推荐这个:
using System;
using System.Data;
using Microsoft.VisualBasic.FileIO;
namespace ReadDataFromCSVFile
{
static class Program
{
static void Main()
{
string csv_file_path=@"C:\Users\Administrator\Desktop\test.csv";
DataTable csvData = GetDataTabletFromCSVFile(csv_file_path);
Console.WriteLine("Rows count:" + csvData.Rows.Count);
Console.ReadLine();
}
private static DataTable GetDataTabletFromCSVFile(string csv_file_path)
{
DataTable csvData = new DataTable();
try
{
using(TextFieldParser csvReader = new TextFieldParser(csv_file_path))
{
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = true;
string[] colFields = csvReader.ReadFields();
foreach (string column in colFields)
{
DataColumn datecolumn = new DataColumn(column);
datecolumn.AllowDBNull = true;
csvData.Columns.Add(datecolumn);
}
while (!csvReader.EndOfData)
{
string[] fieldData = csvReader.ReadFields();
//Making empty value as null
for (int i = 0; i < fieldData.Length; i++)
{
if (fieldData[i] == "")
{
fieldData[i] = null;
}
}
csvData.Rows.Add(fieldData);
}
}
}
catch (Exception ex)
{
}
return csvData;
}
}
}
//Copy the DataTable to SQL Server using SqlBulkCopy
function static void InsertDataIntoSQLServerUsingSQLBulkCopy(DataTable csvData)
{
using(SqlConnection dbConnection = new SqlConnection("Data Source=ProductHost;Initial Catalog=yourDB;Integrated Security=SSPI;"))
{
dbConnection.Open();
using (SqlBulkCopy s = new SqlBulkCopy(dbConnection))
{
s.DestinationTableName = "Your table name";
foreach (var column in csvFileData.Columns)
s.ColumnMappings.Add(column.ToString(), column.ToString());
s.WriteToServer(csvFileData);
}
}
}
回答by pcofre
If the structure of all your CSVs are the same i recomend you to use Integration Services (SSIS) in order to loop between them and insert all of them into the same table.
如果所有 CSV 的结构都相同,我建议您使用集成服务 (SSIS) 以便在它们之间循环并将它们全部插入同一个表中。
回答by paparazzo
I understand this is not exactly your question. But, if you get into a situation where you use a straight insert use tablock and insert multiple rows. Depends on the row size but I usually go for 600-800 rows at at time. If it is a load into an empty table then sometimes dropping the indexes and creating them after it is loaded is faster. If you can sort the data on the clustered index before it is loaded. Use IGNORE_CONSTRAINTS and IGNORE_TRIGGERS if you can. Put the database in single user mode if you can.
我明白这不完全是你的问题。但是,如果您遇到使用直接插入的情况,请使用 tabblock 并插入多行。取决于行的大小,但我通常一次选择 600-800 行。如果它是加载到空表中,那么有时删除索引并在加载后创建它们会更快。如果您可以在加载之前对聚集索引上的数据进行排序。如果可以,请使用 IGNORE_CONSTRAINTS 和 IGNORE_TRIGGERS。如果可以,将数据库置于单用户模式。
USE AdventureWorks2008R2; GO INSERT INTO Production.UnitMeasure with (tablock) VALUES (N'FT2', N'Square Feet ', '20080923'), (N'Y', N'Yards', '20080923'), (N'Y3', N'Cubic Yards', '20080923'); GO
使用 AdventureWorks2008R2;GO INSERT INTO Production.UnitMeasure with (tablock) VALUES (N'FT2', N'Square Feet', '20080923'), (N'Y', N'Yards', '20080923'), (N'Y3', N'Cubic Yards', '20080923'); 走