C# 需要使用 StreamReader.ReadLine() 获取行终止符

Question

提问by Tony Trozzo

I wrote a C# program to read an Excel .xls/.xlsx file and output to CSV and Unicode text. I wrote a separate program to remove blank records. This is accomplished by reading each line with StreamReader.ReadLine(), and then going character by character through the string and not writing the line to output if it contains all commas (for the CSV) or all tabs (for the Unicode text).

我编写了一个 C# 程序来读取 Excel .xls/.xlsx 文件并输出到 CSV 和 Unicode 文本。我编写了一个单独的程序来删除空白记录。这是通过使用读取每一行StreamReader.ReadLine()，然后逐个字符地通过字符串而不是将行写入输出（如果它包含所有逗号（对于 CSV）或所有制表符（对于 Unicode 文本））来实现的。

The problem occurs when the Excel file contains embedded newlines (\x0A) inside the cells. I changed my XLS to CSV converter to find these new lines (since it goes cell by cell) and write them as \x0A, and normal lines just use StreamWriter.WriteLine().

当 Excel 文件在单元格内包含嵌入的换行符 (\x0A) 时，就会出现问题。我将我的 XLS 更改为 CSV 转换器以查找这些新行（因为它逐个单元格地进行）并将它们写为 \x0A，而普通行仅使用 StreamWriter.WriteLine()。

The problem occurs in the separate program to remove blank records. When I read in with StreamReader.ReadLine(), by definition it only returns the string with the line, not the terminator. Since the embedded newlines show up as two separate lines, I can't tell which is a full record and which is an embedded newline for when I write them to the final file.

该问题发生在单独的程序中，以删除空白记录。当我读入时StreamReader.ReadLine()，根据定义，它只返回带有行的字符串，而不是终止符。由于嵌入的换行符显示为两个单独的行，因此当我将它们写入最终文件时，我无法分辨哪个是完整记录，哪个是嵌入的换行符。

I'm not even sure I can read in the \x0A because everything on the input registers as '\n'. I could go character by character, but this destroys my logic to remove blank lines.

我什至不确定我是否可以读取 \x0A，因为输入寄存器上的所有内容都为 '\n'。我可以逐个字符地进行，但这会破坏我删除空行的逻辑。

Answer 1

采纳答案by Scott Wisniewski

I would recommend that you change your architecture to work more like a parser in a compiler.

我建议您更改架构，使其更像编译器中的解析器。

You want to create a lexer that returns a sequence of tokens, and then a parser that reads the sequence of tokens and does stuff with them.

您想创建一个返回标记序列的词法分析器，然后创建一个读取标记序列并对其进行处理的解析器。

In your case the tokens would be:

在您的情况下，令牌将是：

Column data
Comma
End of Line

列数据
逗号
行结束

You would treat '\n' ('\x0a') by its self as an embedded new line, and therefore include it as part of a column data token. A '\r\n' would constitute an End of Line token.

您会将 '\n' ('\x0a') 本身视为嵌入的新行，因此将其作为列数据标记的一部分。'\r\n' 将构成行尾标记。

This has the advantages of:

这具有以下优点：

Doing only 1 pass over the data
Only storing a max of 1 lines worth of data
Reusing as much memory as possible (for the string builder and the list)
It's easy to change should your requirements change

仅对数据执行 1 次传递
仅存储最多 1 行数据
尽可能多地重用内存（用于字符串构建器和列表）
如果您的要求发生变化，很容易改变

Here's a sample of what the Lexer would look like:

以下是 Lexer 的示例：

Disclaimer:I haven't even compiled, let alone tested, this code, so you'll need to clean it up and make sure it works.

免责声明：我什至没有编译，更不用说测试了，这段代码，所以你需要清理它并确保它可以工作。

enum TokenType
{
    ColumnData,
    Comma,
    LineTerminator
}

class Token
{
    public TokenType Type { get; private set;}
    public string Data { get; private set;}

    public Token(TokenType type)
    {
        Type = type;
    }

    public Token(TokenType type, string data)
    {
        Type = type;
        Data = data;
    }
}

private  IEnumerable<Token> GetTokens(TextReader s)
{
   var builder = new StringBuilder();

   while (s.Peek() >= 0)
   {
       var c = (char)s.Read();
       switch (c)
       {
           case ',':
           {
               if (builder.Length > 0)
               {
                   yield return new Token(TokenType.ColumnData, ExtractText(builder));
               }
               yield return new Token(TokenType.Comma);
               break;
           }
           case '\r':
           {
                var next = s.Peek();
                if (next == '\n')
                {
                    s.Read();
                }

                if (builder.Length > 0)
                {
                    yield return new Token(TokenType.ColumnData, ExtractText(builder));
                }
                yield return new Token(TokenType.LineTerminator);
                break;
           }
           default:
               builder.Append(c);
               break;
       }

   }

   s.Read();

   if (builder.Length > 0)
   {
       yield return new Token(TokenType.ColumnData, ExtractText(builder));
   }
}

private string ExtractText(StringBuilder b)
{
    var ret = b.ToString();
    b.Remove(0, b.Length);
    return ret;
}

Your "parser" code would then look like this:

您的“解析器”代码将如下所示：

public void ConvertXLS(TextReader s)
{
    var columnData = new List<string>();
    bool lastWasColumnData = false;
    bool seenAnyData = false;

    foreach (var token in GetTokens(s))
    {
        switch (token.Type)
        {
            case TokenType.ColumnData:
            {
                 seenAnyData = true;
                 if (lastWasColumnData)
                 {
                     //TODO: do some error reporting
                 }
                 else
                 {
                     lastWasColumnData = true;
                     columnData.Add(token.Data);
                 }
                 break;
            }
            case TokenType.Comma:
            {
                if (!lastWasColumnData)
                {
                    columnData.Add(null);
                }
                lastWasColumnData = false;
                break;
            }
            case TokenType.LineTerminator:
            {
                if (seenAnyData)
                {
                    OutputLine(lastWasColumnData);
                }
                seenAnyData = false;
                lastWasColumnData = false;
                columnData.Clear();
            }
        }
    }

    if (seenAnyData)
    {
        OutputLine(columnData);
    }
}

Answer 2

回答by Jon Skeet

You can't change StreamReaderto return the line terminators, and you can't change what it uses for line termination.

您不能更改StreamReader以返回行终止符，也不能更改它用于行终止的内容。

I'm not entirely clear about the problem in terms of what escaping you're doing, particularly in terms of "and write them as \x0A". A sample of the file would probably help.

我并不完全清楚你在做什么转义问题，特别是在“并将它们写为 \x0A”方面。该文件的样本可能会有所帮助。

It sounds like you mayneed to work character by character, or possibly load the whole file first and do a global replace, e.g.

听起来您可能需要逐个字符地工作，或者可能首先加载整个文件并进行全局替换，例如

x.Replace("\r\n", "\u0000") // Or some other unused character
 .Replace("\n", "\x0A") // Or whatever escaping you need
 .Replace("\u0000", "\r\n") // Replace the real line breaks

I'm sure you could do that with a regex and it would probably be more efficient, but I find the long way easier to understand :) It's a bit of a hack having to do a global replace though - hopefully with more information we'll come up with a better solution.

我相信你可以用正则表达式来做到这一点，它可能会更有效，但我发现很容易理解:) 尽管必须进行全局替换有点黑客 - 希望有更多信息我们'会想出更好的解决办法。

Answer 3

回答by Tony Trozzo

Essentially, a hard-return in Excel (shift+enter or alt+enter, I can't remember) puts a newline that is equivalent to \x0A in the default encoding I use to write my CSV. When I write to CSV, I use StreamWriter.WriteLine(), which outputs the line plus a newline (which I believe is \r\n).

本质上，Excel 中的硬回车（shift+enter 或 alt+enter，我记不清了）会在我用来编写 CSV 的默认编码中放置一个相当于 \x0A 的换行符。当我写入 CSV 时，我使用 StreamWriter.WriteLine()，它输出该行加上一个换行符（我认为是 \r\n）。

The CSV is fine and comes out exactly how Excel would save it, the problem is when I read it into the blank record remover, I'm using ReadLine() which will treat a record with an embedded newline as a CRLF.

CSV 很好，并且准确地显示 Excel 将如何保存它，问题是当我将它读入空白记录删除器时，我正在使用 ReadLine() 它将带有嵌入换行符的记录视为 CRLF。

Here's an example of the file after I convert to CSV...

这是我转换为 CSV 后的文件示例...

Reference,Name of Individual or Entity,Type,Name Type,Date of Birth,Place of Birth,Citizenship,Address,Additional Information,Listing Information,Control Date,Committees
1050,"Aziz Salih al-Numan
",Individual,Primary Name,1941 or 1945,An Nasiriyah,Iraqi,,Ba'th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
1050a,???? ???? ???????,Individual,Original script,1941 or 1945,An Nasiriyah,Iraqi,,Ba'th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)

As you can see, the first record has an embedded new-line after al-Numan. When I use ReadLine(), I get '1050,"Aziz Salih al-Numan' and when I write that out, WriteLine() ends that line with a CRLF. I lose the original line terminator. When I use ReadLine() again, I get the line starting with '1050a'.

如您所见，第一条记录在 al-Numan 之后嵌入了一个换行符。当我使用 ReadLine() 时，我得到 '1050,"Aziz Salih al-Numan'，当我写出来时，WriteLine() 以 CRLF 结束该行。我丢失了原来的行终止符。当我再次使用 ReadLine() 时，我得到了以“1050a”开头的行。

I could read the entire file in and replace them, but then I'd have to replace them back afterwards. Basically what I want to do is get the line terminator to determine if its \x0a or a CRLF, and then if its \x0A, I'll use Write() and insert that terminator.

我可以读入整个文件并替换它们，但是之后我必须将它们替换回来。基本上我想要做的是获取行终止符以确定它的 \x0a 还是 CRLF，然后如果它是 \x0A，我将使用 Write() 并插入该终止符。

Answer 4

回答by curmil

I know I'm a little late to the game here, but I was having the same problem and my solution was a lot simpler than most given.

我知道我在这里玩游戏有点晚了，但我遇到了同样的问题，我的解决方案比大多数人给出的要简单得多。

If you are able to determine the column count which should be easy to do since the first line is usually the column titles, you can check your column count against the expected column count. If the column count doesn't equal the expected column count, you simply concatenate the current line with the previous unmatched lines. For example:

如果您能够确定应该很容易做到的列数，因为第一行通常是列标题，您可以根据预期的列数检查您的列数。如果列数不等于预期的列数，您只需将当前行与之前不匹配的行连接起来。例如：

string sep = "\",\"";
int columnCount = 0;
while ((currentLine = sr.ReadLine()) != null)
{
    if (lineCount == 0)
    {
        lineData = inLine.Split(new string[] { sep }, StringSplitOptions.None);
        columnCount = lineData.length;
        ++lineCount;
        continue;
    }
    string thisLine = lastLine + currentLine;

    lineData = thisLine.Split(new string[] { sep }, StringSplitOptions.None);
    if (lineData.Length < columnCount)
    {
        lastLine += currentLine;
        continue;
    }
    else
    {
        lastLine = null;
    }
    ......

Answer 5

回答by TheZachHill

Thank you so much with your code and some others I came up with the following solution! I have added a link at the bottom to some code I wrote that used some of the logic from this page. I figured I'd give honor where honor was due! Thanks!

非常感谢您提供的代码和其他一些代码，我想出了以下解决方案！我在底部添加了一个链接，指向我编写的一些代码，这些代码使用了此页面中的一些逻辑。我想我会在应该获得荣誉的地方给予荣誉！谢谢！

Below is a explanation about what I needed: Try This, I wrote this because I have some very large '|' delimited files that have \r\n inside of some of the columns and I needed to use \r\n as the end of the line delimiter. I was trying to import some files using SSIS packages but because of some corrupted data in the files I was unable to. The File was over 5 GB so it was too large to open and manually fix. I found the answer through looking through lots of Forums to understand how streams work and ended up coming up with a solution that reads each character in a file and spits out the line based on the definitions I added into it. this is for use in a Command Line Application, complete with help :). I hope this helps some other people out, I haven't found a solution quite like it anywhere else, although the ideas were inspired by this forum and others.

下面是我需要什么的解释：试试这个，我写这个是因为我有一些非常大的“|” 在某些列中包含 \r\n 的分隔文件，我需要使用 \r\n 作为行尾分隔符。我试图使用 SSIS 包导入一些文件，但由于文件中的一些数据损坏，我无法导入。该文件超过 5 GB，因此太大而无法打开和手动修复。我通过浏览大量论坛以了解流的工作原理找到了答案，并最终提出了一个解决方案，该解决方案读取文件中的每个字符并根据我添加到其中的定义吐出该行。这是用于命令行应用程序，并附有帮助:)。我希望这可以帮助其他人，我还没有在其他任何地方找到类似的解决方案，

https://stackoverflow.com/a/12640862/1582188

C# 需要使用 StreamReader.ReadLine() 获取行终止符

提问by Tony Trozzo

采纳答案by Scott Wisniewski

回答by Jon Skeet

回答by Tony Trozzo

回答by curmil

回答by TheZachHill

相关推荐

最近更新

标签

C# 需要使用 StreamReader.ReadLine() 获取行终止符

提问by Tony Trozzo

采纳答案by Scott Wisniewski

回答by Jon Skeet

回答by Tony Trozzo

回答by curmil

回答by TheZachHill

相关推荐

C# 比较 DataTable 中的所有行 - 识别重复记录

在 C# 中序列化一个数组列表

C# 如何在 XNA 中调整和保存 Texture2D？

C#构造参数查询SQL - LIKE %

相关推荐

最近更新

标签