在 .NET 中解析分隔的 CSV
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/736629/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse Delimited CSV in .NET
提问by hacker
I have a text file that is in a comma separated format, delimited by "on most fields. I am trying to get that into something I can enumerate through (Generic Collection, for example). I don't have control over how the file is output nor the character it uses for the delimiter.
我有一个以逗号分隔格式的文本文件,"在大多数字段中以分隔。我试图把它变成我可以枚举的东西(例如,通用集合)。我无法控制文件的输出方式,也无法控制它用于分隔符的字符。
In this case, the fields are separated by a comma and text fields are enclosed in "marks. The problem I am running into is that some fields have quotation marks in them (i.e. 8"Tray) and are accidentally being picked up as the next field. In the case of numeric fields, they don't have quotes around them, but they do start with a + or a - sign (depicting a positive/negative number).
在这种情况下,字段用逗号分隔,文本字段用"标记括起来。我"遇到的问题是某些字段中有引号(即 8 Tray),并且不小心被选为下一个字段。在数字字段的情况下,它们周围没有引号,但它们确实以 + 或 - 符号开头(表示正/负数)。
I was thinking of a RegEx, but my skills aren't that great so hopefully someone can come up with some ideas I can try. There are about 19,000 records in this file, so I am trying to do it as efficiently as possible. Here are a couple of example rows of data:
我正在考虑使用 RegEx,但我的技能不是那么好,所以希望有人能提出一些我可以尝试的想法。该文件中大约有 19,000 条记录,因此我正在努力尽可能高效地完成它。以下是几个示例数据行:
"00","000000112260 ","Pie Pumpkin ","RET","6.99 "," ","ea ",+0000000006.99000
"00","000000304078 ","Pie Apple caramel ","RET","9.99 "," ","ea ",+0000000009.99000
"00","StringValue here","8" Tray of Food ","RET","6.99 "," ","ea ",-00000000005.3200
There are a lot more fields, but you can get the picture....
有更多的领域,但你可以得到图片......
I am using VB.NET and I have a generic List setup to accept the data. I have tried using CSVReaderand it seems to work well until you hit a record like the 3rd one (with a quote in the text field). If I could somehow get it to handle the additional quotes, than the CSVReader option will work great.
我正在使用 VB.NET 并且我有一个通用列表设置来接受数据。我曾尝试使用CSVReader并且它似乎运行良好,直到您达到第 3 个这样的记录(在文本字段中带有引号)。如果我能以某种方式让它处理额外的引号,那么 CSVReader 选项会很好用。
Thanks!
谢谢!
采纳答案by Mitch Wheat
From here:
从这里:
Encoding fileEncoding = GetFileEncoding(csvFile);
// get rid of all doublequotes except those used as field delimiters
string fileContents = File.ReadAllText(csvFile, fileEncoding);
string fixedContents = Regex.Replace(fileContents, @"([^\^,\r\n])""([^$,\r\n])", @"");
using (CsvReader csv =
new CsvReader(new StringReader(fixedContents), true))
{
// ... parse the CSV
回答by Avi
I recommend looking at the TextFieldParserClassin .Net. You need to include
我建议查看.Net中的TextFieldParserClass。你需要包括
Imports Microsoft.VisualBasic.FileIO.TextFieldParser
Here's a quick sample:
这是一个快速示例:
Dim afile As FileIO.TextFieldParser = New FileIO.TextFieldParser(FileName)
Dim CurrentRecord As String() ' this array will hold each line of data
afile.TextFieldType = FileIO.FieldType.Delimited
afile.Delimiters = New String() {","}
afile.HasFieldsEnclosedInQuotes = True
' parse the actual file
Do While Not afile.EndOfData
Try
CurrentRecord = afile.ReadFields
Catch ex As FileIO.MalformedLineException
Stop
End Try
Loop
回答by Middletone
Try this site. http://kbcsv.codeplex.com/
试试这个网站。 http://kbcsv.codeplex.com/
I've looked for a good utility and this is hands down the best that I've found and works correctly. Don't waste your time trying other stuff,this is free and it works.
我一直在寻找一个很好的实用程序,这是我找到的最好的并且可以正常工作的实用程序。不要浪费你的时间尝试其他东西,这是免费的,而且有效。
回答by stone
As this link says... Don't roll your own CSV parser!
正如此链接所说...不要推出自己的 CSV 解析器!
Use TextFieldParser as Avi suggested. Microsoft has already done this for you. If you ended up writing one, and you find a bug in it, consider replacing it instead of fixing the bug. I did just that recently and it saved me a lot of time.
按照 Avi 的建议使用 TextFieldParser。微软已经为你做了这件事。如果您最终编写了一个,并且发现其中存在错误,请考虑更换它而不是修复错误。我最近就是这样做的,它为我节省了很多时间。
回答by CMS
Give a look to the FileHelpers library.
回答by Josh Close
You could give CsvHelper(a library I maintain) a try and it's available via NuGet. It follows the RFC 4180standard for CSV. It will be able to handle any content inside of a field including commas, quotes, and new lines.
你可以试试 CsvHelper(我维护的一个库),它可以通过NuGet 获得。它遵循CSV的RFC 4180标准。它将能够处理字段内的任何内容,包括逗号、引号和换行符。
CsvHelper is simple to use, but it's also easy to configure it to work with many different types of delimited files.
CsvHelper 使用简单,但也很容易将其配置为使用许多不同类型的分隔文件。
CsvReader csv = new CsvReader( streamToFile );
IEnumerable<MyObject> myObjects = csv.GetRecords<MyObject>();
If you want to read CSV files on a lower level, you can use the parser directly, which will return each row as a string array.
如果要在较低级别读取 CSV 文件,可以直接使用解析器,它将以字符串数组的形式返回每一行。
var parser = new CsvParser( myTextReader );
while( true )
{
string[] line = parser.ReadLine();
if( line == null )
{
break;
}
}
回答by mariob
RegEx to exclude first and last quote would be (?<!^)(?<!,)("")(?!,)(?!$). Of course, you need to use RegexOptions.Multiline.
排除第一个和最后一个引号的 RegEx 将是(?<!^)(?<!,)("")(?!,)(?!$). 当然,你需要使用 RegexOptions.Multiline。
That way there is no need for evaluator function. My code replaces undesired double quotes with single quotes.
这样就不需要评估器功能。我的代码用单引号替换了不需要的双引号。
Complete C# code is as below.
完整的 C# 代码如下。
string fixedCSV = Regex.Replace(
File.ReadAllText(fileName),
@"(?<!^)(?<!;)("")(?!;)(?!$)", "'", RegexOptions.Multiline);
回答by hacker
I am posting this as an answer so I can explain how I did it and why.... The answer from Mitch Wheat was the one that gave me the best solution for this case and I just had to modify it slightly due to the format this data was exported in.
我将此作为答案发布,以便我可以解释我是如何做到的以及为什么.... Mitch Wheat 的答案为我提供了针对这种情况的最佳解决方案,由于格式的原因,我只需要稍微修改它该数据被导出。
Here is the VB Code:
这是VB代码:
Dim fixedContents As String = Regex.Replace(
File.ReadAllText(csvFile, fileEncoding),
"(?<!,)("")(?!,)",
AddressOf ReplaceQuotes)
The RegEx that was used is what I needed to change because certain fields had non-escaped quotes in them and the RegEx provided didn't seem to work on all examples. This one uses 'Look Ahead' and 'Look Behind' to see if the quote is just after a comma or just before. In this case, they are both negative (meaning show me where the double quote is not before or after a comma). This should mean that the quote is in the middle of a string.
使用的 RegEx 是我需要更改的,因为某些字段中包含非转义引号,并且提供的 RegEx 似乎不适用于所有示例。这个使用“向前看”和“向后看”来查看引号是在逗号之后还是之前。在这种情况下,它们都是否定的(意思是告诉我双引号不在逗号之前或之后)。这应该意味着引号位于字符串的中间。
In this case, instead of doing a direct replacement, I am using the function ReplaceQuotes to handle that for me. The reason I am using this is because I needed a little extra logic to detect whether it was at the beginning of a line. If I would have spent even more time on it, I am sure I could have tweaked the RegEx to take into consideration the beginning of the line (using MultiLine, etc) but when I tried it quickly, it didn't seem to work at all.
在这种情况下,我没有直接替换,而是使用函数 ReplaceQuotes 来为我处理。我使用它的原因是因为我需要一些额外的逻辑来检测它是否在一行的开头。如果我花更多的时间在上面,我相信我可以调整 RegEx 以考虑行的开头(使用 MultiLine 等)但是当我快速尝试时,它似乎不起作用全部。
With this in place, using CSV reader on a 32MB CSV file (about 19000 rows), it takes about 2 seconds to read the file, perform the regex, load it into the CSV Reader, add all the data to my generic class and finish. Real quick!!
有了这个,在 32MB CSV 文件(大约 19000 行)上使用 CSV 阅读器,读取文件、执行正则表达式、将其加载到 CSV 阅读器、将所有数据添加到我的通用类并完成大约需要 2 秒. 真快!!
回答by rvarcher
The logic of this custom approach is: Read through file 1 line at a time, split each line on the comma, remove the first and last character (removing the outer quotes but not affecting any inside quotes), then adding the data to your generic list. It's short and very easy to read and work with.
这种自定义方法的逻辑是:一次通读文件 1 行,在逗号上拆分每一行,删除第一个和最后一个字符(删除外部引号但不影响任何内部引号),然后将数据添加到您的泛型列表。它很短而且很容易阅读和使用。
Dim fr As StreamReader = Nothing
Dim FileString As String = ""
Dim LineItemsArr() as String
Dim FilePath As String = HttpContext.Current.Request.MapPath("YourFile.csv")
fr = New System.IO.StreamReader(FilePath)
While fr.Peek <> -1
FileString = fr.ReadLine.Trim
If String.IsNullOrEmpty(FileString) Then Continue While 'Empty Line
LineItemsArr = FileString.Split(",")
For Each Item as String In LineItemsArr
'If every item will have a beginning and closing " (quote) then you can just
'cut the first and last characters of the string here.
'i.e. UpdatedItems = Item. remove first and last character
'Then stick the data into your Generic List (Of String()?)
Next
End While
回答by llamaoo7
Your problem with CSVReader is that the quote in the third record isn't escaped with another quote (aka double quoting). If you don't escape them, then how would you expect to handle ", in the middle of a text field?
您使用 CSVReader 的问题是第三条记录中的引用没有被另一个引用(又名双引号)转义。如果您不转义它们,那么您希望如何处理文本字段中间的 ", ?
http://en.wikipedia.org/wiki/Comma-separated_values
http://en.wikipedia.org/wiki/Comma-separated_values
(I did end up having to work with files (with different delimiters) but the quote characters inside a text value weren't escaped and I ended up writing my own custom parser. I do not know if this was absolutely necessary or not.)
(我最终不得不使用文件(使用不同的分隔符),但是文本值中的引号字符没有被转义,我最终编写了自己的自定义解析器。我不知道这是否绝对必要。)

