C# .NET 的 CSV 解析选项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9642055/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-09 08:17:04  来源:igfitidea点击:

CSV Parsing Options with .NET

c#.netparsing

提问by Brent Arias

I'm looking at my delimited-file (e.g. CSV, tab seperated, etc.) parsing options based on MS stack in general, and .net specifically. The only technology I'm excluding is SSIS, because I already know it will not meet my needs.

我正在查看基于 MS 堆栈的分隔文件(例如 CSV、制表符分隔等)解析选项,尤其是 .net。我唯一排除的技术是 SSIS,因为我已经知道它不能满足我的需求。

So my options appear to be:

所以我的选择似乎是:

  1. Regex.Split
  2. TextFieldParser
  3. OLEDB CSV Parser
  1. 正则表达式拆分
  2. 文本字段解析器
  3. OLEDB CSV 解析器

I have two criteria I must meet. First, given the following file which contains two logical rows of data (and five physical rows altogether):

我必须满足两个标准。首先,给定以下文件,其中包含两个逻辑数据行(总共五个物理行):

101, Bob, "Keeps his house ""clean"".
Needs to work on laundry."
102, Amy, "Brilliant.
Driven.
Diligent."

101, Bob, "Keeps his house ""clean"".
Needs to work on laundry."
102, Amy, "Brilliant.
Driven.
Diligent."

The parsed results must yield two logical "rows," consisting of three strings (or columns) each. The third row/column string must preserve the newlines! Said differently, the parser must recognize when lines are "continuing" onto the next physical row, due to the "unclosed" text qualifier.

解析结果必须产生两个逻辑“行”,每行由三个字符串(或列)组成。第三行/列字符串必须保留换行符!换句话说,由于“未关闭”文本限定符,解析器必须识别行何时“继续”到下一个物理行。

The second criteria is that the delimiter and text qualifier must be configurable, per file. Here are two strings, taken from different files, that I must be able to parse:

第二个标准是每个文件的分隔符和文本限定符必须是可配置的。这里有两个字符串,取自不同的文件,我必须能够解析:

var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all";
var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all";

A proper parsing of string "first" would be:

字符串“first”的正确解析是:

  • This
  • Is,A,Record
  • That "Cannot", they say,
  • _
  • _
  • be
  • rightly
  • parsed
  • at all
  • 这个
  • 是,A,记录
  • 那个“不能”,他们说,
  • _
  • _
  • 正确地
  • 解析
  • 根本

The '_' simply means that a blank was captured - I don't want a literal underbar to appear.

'_' 仅表示捕获了一个空白 - 我不希望出现文字下划线。

One important assumption can be made about the flat-files to be parsed: there will be a fixed number of columns per file.

可以对要解析的平面文件做出一个重要假设:每个文件将有固定数量的列。

Now for a dive into the technical options.

现在深入了解技术选项。

REGEX

正则表达式

First, many responders comment that regex "is not the best way" to achieve the goal. I did, however, find a commenter who offered an excellent CSV regex:

首先,许多响应者评论正则表达式“不是实现目标的最佳方式”。但是,我确实找到了一位提供出色 CSV 正则表达式评论者

var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))";
var Regex.Split(first, regex).Dump();

The results, applied to string "first," are quite wonderful:

应用于字符串“first”的结果非常棒:

  • "This"
  • "Is,A,Record"
  • "That ""Cannot"", they say,"
  • ""
  • _
  • "be"
  • rightly
  • "parsed"
  • at all
  • “这个”
  • “是,A,记录”
  • “那个“不能”,他们说,“
  • “”
  • _
  • “是”
  • 正确地
  • “解析”
  • 根本

It would be nice if the quotes were cleaned up, but I can easily deal with that as a post-process step. Otherwise, this approach can be used to parse both sample strings "first" and "second," provided the regex is modified for tilde and pipe symbols accordingly. Excellent!

如果引号被清理干净会很好,但我可以轻松地将其作为后处理步骤处理。否则,此方法可用于解析示例字符串“first”和“second”,前提是相应地针对波浪号和管道符号修改了正则表达式。优秀!

But the real problem pertains to the multi-line criteria. Before a regex can be applied to a string, I must read the full logical "row" from the file. Unfortunately, I don't know how many physical rows to read to complete the logical row, unless I've got a regex / state machine.

但真正的问题与多行标准有关。在将正则表达式应用于字符串之前,我必须从文件中读取完整的逻辑“行”。不幸的是,除非我有正则表达式/状态机,否则我不知道要读取多少物理行才能完成逻辑行。

So this becomes a "chicken and the egg" problem. My best option would be to read the entire file into memory as one giant string, and let the regex sort-out the multiple lines (I didn't check if the above regex could handle that). If I've got a 10 gig file, this could be a bit precarious.

所以这就变成了一个“先有鸡还是先有蛋”的问题。我最好的选择是将整个文件作为一个巨大的字符串读入内存,然后让正则表达式整理出多行(我没有检查上面的正则表达式是否可以处理)。如果我有一个 10 gig 文件,这可能有点不稳定。

On to the next option.

进入下一个选项。

TextFieldParser

文本字段解析器

Three lines of code will make the problem with this option apparent:

三行代码将使此选项的问题显而易见:

var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream);
reader.Delimiters = new string[] { @"|" };
reader.HasFieldsEnclosedInQuotes = true;

The Delimiters configuration looks good. However, the "HasFieldsEnclosedInQuotes" is "game over." I'm stunned that the delimiters are arbitrarily configurable, but in contrast I have no other qualifier option other than quotations. Remember, I need configurability over the text qualifier. So again, unless someone knows a TextFieldParser configuration trick, this is game over.

分隔符配置看起来不错。但是,“HasFieldsEnclosedInQuotes”是“游戏结束”。我很惊讶分隔符可以任意配置,但相比之下,除了引号之外,我没有其他限定符选项。请记住,我需要文本限定符的可配置性。再说一次,除非有人知道 TextFieldParser 配置技巧,否则游戏就结束了。

OLEDB

有机发光二极管

A colleague tells me this option has two major failings. First, it has terrible performance for large (e.g. 10 gig) files. Second, so I'm told, it guesses data types of input data rather than letting you specify. Not good.

一位同事告诉我这个选项有两个主要缺点。首先,它对于大型(例如 10 gig)文件的性能很差。其次,我被告知,它猜测输入数据的数据类型,而不是让您指定。不好。

HELP

帮助

So I'd like to know the facts I got wrong (if any), and the other options that I missed. Perhaps someone knows a way to jury-rig TextFieldParser to use an arbitrary delimiter. And maybe OLEDB has resolved the stated issues (or perhaps never had them?).

所以我想知道我错了的事实(如果有的话),以及我错过的其他选项。也许有人知道一种方法可以让 TextFieldParser 使用任意分隔符。也许 OLEDB 已经解决了上述问题(或者也许从未解决过这些问题?)。

What say ye?

你们怎么说?

采纳答案by Dour High Arch

Did you try searching for an already-existing .NET CSV parser? This oneclaims to handle multi-line records significantly faster than OLEDB.

您是否尝试搜索已经存在的 .NET CSV 解析器这个声称处理多行记录的速度比 OLEDB 快得多。

回答by Joel Coehoorn

Take a look at the code I posted to this question:

看看我发布到这个问题的代码:

https://stackoverflow.com/a/1544743/3043

https://stackoverflow.com/a/1544743/3043

It covers mostof your requirements, and it wouldn't take much to update it to support alternate delimiters or text qualifiers.

它涵盖了您的大部分需求,更新它以支持备用分隔符或文本限定符并不需要太多时间。

回答by Tim Medora

I wrote this a while back as a lightweight, standalone CSV parser. I believe it meets all of your requirements. Give it a try with the knowledge that it probably isn't bulletproof.

不久前,我将它写为一个轻量级的独立 CSV 解析器。我相信它可以满足您的所有要求。试一试,知道它可能不是防弹的。

If it does work for you, feel free to change the namespace and use without restriction.

如果它确实适合您,请随意更改命名空间并不受限制地使用。

namespace NFC.Portability
{
    using System;
    using System.Collections.Generic;
    using System.Data;
    using System.IO;
    using System.Linq;
    using System.Text;

    /// <summary>
    /// Loads and reads a file with comma-separated values into a tabular format.
    /// </summary>
    /// <remarks>
    /// Parsing assumes that the first line will always contain headers and that values will be double-quoted to escape double quotes and commas.
    /// </remarks>
    public unsafe class CsvReader
    {
        private const char SEGMENT_DELIMITER = ',';
        private const char DOUBLE_QUOTE = '"';
        private const char CARRIAGE_RETURN = '\r';
        private const char NEW_LINE = '\n';

        private DataTable _table = new DataTable();

        /// <summary>
        /// Gets the data contained by the instance in a tabular format.
        /// </summary>
        public DataTable Table
        {
            get
            {
                // validation logic could be added here to ensure that the object isn't in an invalid state

                return _table;
            }
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="path">The fully-qualified path to the file from which the instance will be populated.</param>
        public CsvReader( string path )
        {
            if( path == null )
            {
                throw new ArgumentNullException( "path" );
            }

            FileStream fs = new FileStream( path, FileMode.Open );
            Read( fs );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="stream">The stream from which the instance will be populated.</param>
        public CsvReader( Stream stream )
        {
            if( stream == null )
            {
                throw new ArgumentNullException( "stream" );
            }

            Read( stream );
        }

        /// <summary>
        /// Creates a new instance of <c>CsvReader</c>.
        /// </summary>
        /// <param name="bytes">The array of bytes from which the instance will be populated.</param>
        public CsvReader( byte[] bytes )
        {
            if( bytes == null )
            {
                throw new ArgumentNullException( "bytes" );
            }

            MemoryStream ms = new MemoryStream();
            ms.Write( bytes, 0, bytes.Length );
            ms.Position = 0;

            Read( ms );
        }

        private void Read( Stream s )
        {
            string lines;

            using( StreamReader sr = new StreamReader( s ) )
            {
                lines = sr.ReadToEnd();
            }

            if( string.IsNullOrWhiteSpace( lines ) )
            {
                throw new InvalidOperationException( "Data source cannot be empty." );
            }

            bool inQuotes = false;
            int lineNumber = 0;
            StringBuilder buffer = new StringBuilder( 128 );
            List<string> values = new List<string>();

            Action endSegment = () =>
            {
                values.Add( buffer.ToString() );
                buffer.Clear();
            };

            Action endLine = () =>
            {
                if( lineNumber == 0 )
                {
                    CreateColumns( values );
                    values.Clear();
                }
                else
                {
                    CreateRow( values );
                    values.Clear();
                }

                values.Clear();
                lineNumber++;
            };

            fixed( char* pStart = lines )
            {
                char* pChar = pStart;
                char* pEnd = pStart + lines.Length;

                while( pChar < pEnd ) // leave null terminator out
                {
                    if( *pChar == DOUBLE_QUOTE )
                    {
                        if( inQuotes )
                        {
                            if( Peek( pChar, pEnd ) == SEGMENT_DELIMITER )
                            {
                                endSegment();
                                pChar++;
                            }
                            else if( !ApproachingNewLine( pChar, pEnd ) )
                            {
                                buffer.Append( DOUBLE_QUOTE );
                            }
                        }

                        inQuotes = !inQuotes;
                    }
                    else if( *pChar == SEGMENT_DELIMITER )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                        }
                        else
                        {
                            buffer.Append( SEGMENT_DELIMITER );
                        }
                    }
                    else if( AtNewLine( pChar, pEnd ) )
                    {
                        if( !inQuotes )
                        {
                            endSegment();
                            endLine();
                            //pChar++;
                        }
                        else
                        {
                            buffer.Append( *pChar );
                        }
                    }
                    else
                    {
                        buffer.Append( *pChar );
                    }

                    pChar++;
                }
            }

            // append trailing values at the end of the file
            if( values.Count > 0 )
            {
                endSegment();
                endLine();
            }
        }

        /// <summary>
        /// Returns the next character in the sequence but does not advance the pointer. Checks bounds.
        /// </summary>
        /// <param name="pChar">Pointer to current character.</param>
        /// <param name="pEnd">End of range to check.</param>
        /// <returns>
        /// Returns the next character in the sequence, or char.MinValue if range is exceeded.
        /// </returns>
        private char Peek( char* pChar, char* pEnd )
        {
            if( pChar < pEnd )
            {
                return *( pChar + 1 );
            }

            return char.MinValue;
        }

        /// <summary>
        /// Determines if the current character represents a newline. This includes lookahead for two character newline delimiters.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool AtNewLine( char* pChar, char* pEnd )
        {
            if( *pChar == NEW_LINE )
            {
                return true;
            }

            if( *pChar == CARRIAGE_RETURN && Peek( pChar, pEnd ) == NEW_LINE )
            {
                return true;
            }

            return false;
        }

        /// <summary>
        /// Determines if the next character represents a newline, or the start of a newline.
        /// </summary>
        /// <param name="pChar"></param>
        /// <param name="pEnd"></param>
        /// <returns></returns>
        private bool ApproachingNewLine( char* pChar, char* pEnd )
        {
            if( Peek( pChar, pEnd ) == CARRIAGE_RETURN || Peek( pChar, pEnd ) == NEW_LINE )
            {
                // technically this cheats a little to avoid a two char peek by only checking for a carriage return or new line, not both in sequence
                return true;
            }

            return false;
        }

        private void CreateColumns( List<string> columns )
        {
            foreach( string column in columns )
            {
                DataColumn dc = new DataColumn( column );
                _table.Columns.Add( dc );
            }
        }

        private void CreateRow( List<string> values )
        {
            if( values.Where( (o) => !string.IsNullOrWhiteSpace( o ) ).Count() == 0 )
            {
                return; // ignore rows which have no content
            }

            DataRow dr = _table.NewRow();
            _table.Rows.Add( dr );

            for( int i = 0; i < values.Count; i++ )
            {
                dr[i] = values[i];
            }
        }
    }
}