C# CSV 解析

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/316649/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 22:56:00  来源:igfitidea点击:

CSV Parsing

c#parsingcsv

提问by

I am trying to use C# to parse CSV. I used regular expressions to find ","and read string if my header counts were equal to my match count.

我正在尝试使用 C# 来解析 CSV。","如果我的标题计数等于我的匹配计数,我使用正则表达式来查找和读取字符串。

Now this will not work if I have a value like:

现在,如果我有这样的值,这将不起作用:

"a",""b","x","y"","c"

then my output is:

那么我的输出是:

'a'
'"b'
'x'
'y"'
'c'

but what I want is:

但我想要的是:

'a'
'"b","x","y"'
'c'

Is there any regex or any other logic I can use for this ?

我可以为此使用任何正则表达式或任何其他逻辑吗?

回答by gtd

In order to have a parseable CSV file, any double quotes inside a value need to be properly escaped somehow. The two standard ways to do this are by representing a double quote either as two double quotes back to back, or a backslash double quote. That is one of the following two forms:

为了有一个可解析的 CSV 文件,值中的任何双引号都需要以某种方式正确转义。执行此操作的两种标准方法是将双引号表示为两个背对背的双引号或反斜杠双引号。那是以下两种形式之一:

""

“”

\"

\"

In the second form your initial string would look like this:

在第二种形式中,您的初始字符串如下所示:

"a","\"b\",\"x\",\"y\"","c"

"a","\"b\",\"x\",\"y\"","c"

If your input string is not formatted against some rigorous format like this then you have very little chance of successfully parsing it in an automated environment.

如果您的输入字符串没有按照这样的严格格式进行格式化,那么您在自动化环境中成功解析它的机会很小。

回答by Adam Davis

Well, I'm no regex wiz, but I'm certain they have an answer for this.

好吧,我不是正则表达式专家,但我确定他们对此有答案。

Procedurally it's going through letter by letter. Set a variable, say dontMatch, to FALSE.

从程序上讲,它是一个字母一个字母的过程。将变量(例如 dontMatch)设置为 FALSE。

Each time you run into a quote toggle dontMatch.

每次遇到报价切换时都不要匹配。

each time you run into a comma, check dontMatch. If it's TRUE, ignore the comma. If it's FALSE, split at the comma.

每次遇到逗号时,请检查 dontMatch。如果为 TRUE,则忽略逗号。如果为 FALSE,则在逗号处拆分。

This works for the example you give, but the logic you use for quotation marks is fundamentally faulty - you must escape them or use another delimiter (single quotes, for instance) to set major quotations apart from minor quotations.

这适用于您给出的示例,但是您用于引号的逻辑从根本上是错误的 - 您必须转义它们或使用另一个分隔符(例如单引号)来将主要引用与次要引用分开。

For instance,

例如,

"a", ""b", ""c", "d"", "e""

"a", ""b", ""c", "d"", "e""

will yield bad results.

会产生不好的结果。

This can be fixed with another patch. Rather than simply keeping a true false you have to match quotes.

这可以通过另一个补丁修复。您必须匹配引号,而不是简单地保持真假。

To match quotes you have to know what was last seen, which gets into pretty deep parsing territory. You'll probably, at that point, want to make sure your language is designed well, and if it is you can use a compiler tool to create a parser for you.

要匹配引号,您必须知道上次看到的是什么,这进入了非常深入的解析领域。那时,您可能希望确保您的语言设计良好,如果是,您可以使用编译器工具为您创建解析器。

-Adam

-亚当

回答by Tomalak

If all your values are guaranteedto be in quotes, look for values, not for commas:

如果保证所有值都在引号中,请查找值,而不是逗号:

("".*?""|"[^"]*")

This takes advantage of the fact that "the earliest longest match wins" - it looks for double quoted values first, and with a lower priority for normal quoted values.

这利用了“最早的最长匹配获胜”这一事实 - 它首先查找双引号值,并以较低的优先级查找正常引用的值。

If you don't want the enclosing quote to be part of the match, use:

如果您不希望封闭引号成为匹配的一部分,请使用:

"(".*?"|[^"]*)"

and go for the value in match group 1.

并寻找匹配组 1 中的值。

As I said: Prerequisite for this to work is well-formed input with guaranteed quotes or double quotes around each value. Empty values must be quoted as well! A nice side-effect is that it does not care for the separator char. Commas, TABs, semi-colons, spaces, you name it. All will work.

正如我所说:此工作的先决条件是格式良好的输入,每个值周围都有保证的引号或双引号。空值也必须引用!一个很好的副作用是它不关心分隔符字符。逗号、制表符、分号、空格,应有尽有。一切都会奏效。

回答by Marc Gravell

CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well.

CSV,在处理多行、引用、不同的分隔符* 等时 - 可能比您想象的更棘手......也许考虑预先滚动的答案?我用这个,效果很好。

*=remember that some locales use [tab] as the C in CSV...

*=请记住,某些语言环境使用 [tab] 作为 CSV 中的 C...

回答by mlarsen

FileHelpersfor .Net is your friend.

.Net 的FileHelpers是您的朋友。

回答by Bevan

There's an oft quoted saying:

经常引用一句话:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)

有些人在遇到问题时会想“我知道,我会使用正则表达式”。现在他们有两个问题。(杰米·扎温斯基)

Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.

鉴于 CSV 文件没有官方标准(而是有大量稍微不兼容的样式),您需要确保您实现的内容适合您将接收的文件。实现任何比你需要的更高级的东西都没有意义 - 我很确定你不需要正则表达式。

Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:

这是我尝试提取术语的简单方法 - 基本上,它循环查找逗号的行,跟踪当前索引是否在字符串内:

    public IEnumerable<string> SplitCSV(string line)
    {
        int index = 0;
        int start = 0;
        bool inString = false;

        foreach (char c in line)
        {
            switch (c)
            {
                case '"':
                    inString = !inString;
                    break;

                case ',':
                    if (!inString)
                    {
                        yield return line.Substring(start, index - start);
                        start = index + 1;
                    }
                    break;
            }
            index++;
        }

        if (start < index)
            yield return line.Substring(start, index - start);
    }

Standard caveat - untested code, there may be off-by-one errors.

标准警告 - 未经测试的代码,可能存在一对一错误。

Limitations

限制

  • The quotes around a value aren't removed automatically.
    To do this, add a check just before the yield returnstatement near the end.

  • Single quotes aren't supported in the same way as double quotes
    You could add a separate boolean inSingleQuotedString, renaming the existing boolean to inDoubleQuotedStringand treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)

  • Whitespace isn't automatically removed
    Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.

  • 值周围的引号不会自动删除。
    为此,请在yield return接近结尾的语句之前添加一个检查。

  • 单引号与双引号的支持方式不同
    您可以添加一个单独的 boolean inSingleQuotedString,将现有的 boolean 重命名为inDoubleQuotedString并以相同的方式处理两者。(你不能让现有的布尔值做双重工作,因为你需要字符串以开始它的相同引号结束。)

  • 空格不会自动删除
    一些工具会在 CSV 文件中的逗号周围引入空格以“漂亮”文件;然后很难从格式化空格中分辨出故意的空格。

回答by Bevan

See the link "Regex fun with CSV" at:

请参阅链接“Regex fun with CSV”:

http://snippets.dzone.com/posts/show/4430

http://snippets.dzone.com/posts/show/4430

回答by Chris S

The LumenworksCSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.

LumenworksCSV解析器(开源的,免费的,但需要一个CodeProject上的登录)是迄今为止我用过的最好的一个。它将使您不必编写正则表达式,并且使用起来很直观。

回答by saku

I would use FileHelpersif I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.

如果我是你,我会使用FileHelpers。正则表达式很好,但很难阅读,特别是如果你在一段时间后回去快速修复。

Just for sake of exercising my mind, quick & dirty workingC# procedure:

只是行使我的脑海里,快速和肮脏的缘故工作的C#程序:

public static List<string> SplitCSV(string line)
{
    if (string.IsNullOrEmpty(line))
        throw new ArgumentException();

    List<string> result = new List<string>();

    bool inQuote = false;
    StringBuilder val = new StringBuilder();

    // parse line
    foreach (var t in line.Split(','))
    {
        int count = t.Count(c => c == '"');

        if (count > 2 && !inQuote)
        {
            inQuote = true;
            val.Append(t);
            val.Append(',');
            continue;
        }

        if (count > 2 && inQuote)
        {
            inQuote = false;
            val.Append(t);
            result.Add(val.ToString());
            continue;
        }

        if (count == 2 && !inQuote)
        {
            result.Add(t);
            continue;
        }

        if (count == 2 && inQuote)
        {
            val.Append(t);
            val.Append(',');
            continue;
        }
    }

    // remove quotation
    for (int i = 0; i < result.Count; i++)
    {
        string t = result[i];
        result[i] = t.Substring(1, t.Length - 2);
    }

    return result;
}

回答by saku

I have just try your regular expression in my code..its work fine for formated text with quote ...

我刚刚在我的代码中尝试了你的正则表达式......它适用于带引号的格式化文本......

but wondering if we can parse below value by Regex..

但想知道我们是否可以通过正则表达式解析以下值..

"First_Bat7679",""NAME","ENAME","FILE"","","","From: "DDD,_Ala%as"@sib.com"

I am looking for result as:

我正在寻找结果为:

'First_Bat7679'
'"NAME","ENAME","FILE"'
''
''
'From: "DDD,_Ala%as"@sib.com'

Thanx

谢谢