C# 我应该如何检测文本文件中使用了哪个分隔符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/761932/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 00:13:03  来源:igfitidea点击:

How should I detect which delimiter is used in a text file?

c#asp.netcsvtext-parsing

提问by samiz

I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use?

我需要能够解析 CSV 和 TSV 文件。我不能依靠用户知道差异,所以我想避免要求用户选择类型。有没有一种简单的方法来检测正在使用哪个分隔符?

One way would be to read in every line and count both tabs and commas and find out which is most consistently used in every line. Of course, the data could include commas or tabs, so that may be easier said than done.

一种方法是阅读每一行并计算制表符和逗号,并找出在每一行中最一致地使用的。当然,数据可能包含逗号或制表符,因此说起来容易做起来难。

Edit:Another fun aspect of this project is that I will also need to detect the schema of the file when I read it in because it could be one of many. This means that I won't know how many fields I have until I can parse it.

编辑:这个项目的另一个有趣的方面是,当我读入文件时,我还需要检测文件的架构,因为它可能是其中之一。这意味着在解析它之前我不知道我有多少个字段。

采纳答案by dommer

You could show them the results in preview window - similar to the way Excel does it. It's pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.

您可以在预览窗口中向他们显示结果 - 类似于 Excel 的方式。在这种情况下使用错误的分隔符时很明显。然后,您可以允许他们选择一系列分隔符并实时更新预览。

Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).

然后,您可以对开始的分隔符进行简单的猜测(例如,先使用逗号还是制表符)。

回答by Jon Skeet

Do you know how many fields shouldbe present per line? If so, I'd read the first few lines of the file and check based on that.

你知道每行应该有多少个字段吗?如果是这样,我会阅读文件的前几行并基于此进行检查。

In my experience, "normal" data quite often contains commas but rarely contains tab characters. This would suggest that you should check for a consistent number of tabs in the first few lines, and go with that choice as a preferred guess. Of course, it depends on exactly what data you've got.

根据我的经验,“正常”数据通常包含逗号,但很少包含制表符。这表明您应该检查前几行中的选项卡数量是否一致,并将该选项作为首选猜测。当然,这取决于您所获得的数据。

Ultimately, it would be quite possible to have a file which is completely valid for both formats - so you can't make it absolutely foolproof. It'll have to be a "best effort" job.

最终,很有可能拥有一个对两种格式都完全有效的文件 - 所以你不能让它绝对万无一失。这必须是“尽最大努力”的工作。

回答by Humphrey Bogart

There is no "efficient" way.

没有“有效”的方式。

回答by Noldorin

I'd imagine that your suggested solution would be the best way to go. In a well-formed CSV or TSV file, the number of commas or tabs respectively per line should be constant (no variation at all). Do a count of each for every line of the file, and check which one is constant for all lines. It would seem quite unlikely that the count of both delimeters for each line is identical, but in this inconceivably rare case, you could of course prompt the user.

我想你建议的解决方案将是最好的方法。在格式正确的 CSV 或 TSV 文件中,每行的逗号或制表符的数量应该是恒定的(根本没有变化)。对文件的每一行进行计数,并检查所有行中哪一个是常量。每行的两个分隔符的计数似乎不太可能相同,但在这种难以置信的罕见情况下,您当然可以提示用户。

If neither the number of tabs nor commas is constant, then display a message to the user telling them that the file is malformed but the program thinks it is a (whatever format has the lowest standard deviation of delimeters per line) file.

如果制表符和逗号的数量都不是常数,则向用户显示一条消息,告诉他们该文件格式错误,但程序认为它是(任何格式的每行分隔符的标准偏差最低)文件。

回答by Jeff Yates

Assuming that there are a fixed number of fields per line and that any commas or tabs within values are enclosed by quotes ("), you should be able to work it out on the frequency of each character in each line. If the fields aren't fixed, this is harder, and if quotes aren't used to enclose otherwise delimiting characters, it will be, I suspect, near impossible (and depending on the data, locale-specific).

假设每行有固定数量的字段,并且值中的任何逗号或制表符都用引号 (") 括起来,那么您应该能够计算出每行中每个字符出现的频率。如果字段不是' t 固定,这更难,如果不使用引号将其他分隔字符括起来,我怀疑这几乎是不可能的(并且取决于数据,特定于区域设置)。

回答by Reed Copsey

In my experience, data rarely contains tabs, so a line of tab delimited fields would (generally) be fairly obvious.

根据我的经验,数据很少包含制表符,因此一行制表符分隔的字段(通常)会相当明显。

Commas are more difficult, though - especially if you're reading data in non-US locales. Numerical data can contain huge numbers of commas if you're reading files generated out of country, since floating point numbers will often contain them.

但是,逗号更难使用——尤其是当您在非美国语言环境中阅读数据时。如果您正在阅读在国外生成的文件,数字数据可能包含大量逗号,因为浮点数通常会包含它们。

In the end, the only safe thing, though, is usually to try, then present it to the user and allow them to adjust, especially if your data will contain commas and/or tabs.

不过,最后,唯一安全的做法通常是尝试,然后将其呈现给用户并允许他们进行调整,尤其是当您的数据将包含逗号和/或制表符时。

回答by rmeador

I would assume that in normal text, tabs are very rare except as the first character(s) on a line -- think indented paragraphs or source code. I think if you find embedded tabs (i.e. ones that don't follow commas), you can assume that the tabs are being used as the delimiters and be correct most of the time. This is just a hunch, not verified with any research. I'd of course give the user the option to override the auto-calculated mode.

我认为在普通文本中,制表符非常罕见,除了作为一行的第一个字符——想想缩进的段落或源代码。我认为如果您找到嵌入的制表符(即不跟随逗号的制表符),您可以假设制表符被用作分隔符并且在大多数情况下是正确的。这只是一种预感,未经任何研究证实。我当然会为用户提供覆盖自动计算模式的选项。

回答by Andrew Ensley

Just read a few lines, count the number of commas and the number of tabs and compare them. If there's 20 commas and no tabs, it's in CSV. If there's 20 tabs and 2 commas (maybe in the data), it's in TSV.

只需阅读几行,计算逗号的数量和制表符的数量并进行比较。如果有 20 个逗号且没有制表符,则为 CSV。如果有 20 个制表符和 2 个逗号(可能在数据中),则它在 TSV 中。

回答by Chris Brandsma

Assuming you have a standard set of columns you are going to expect...

假设您有一组标准的列,您会期望......

I would use FileHelper (open source project on SourceForge). http://filehelpers.sourceforge.net/

我会使用 FileHelper(SourceForge 上的开源项目)。 http://filehelpers.sourceforge.net/

Define two reader templates, one for comas, one for tabs.

定义两个阅读器模板,一个用于昏迷,一个用于选项卡。

If the first one fails, try the second.

如果第一个失败,请尝试第二个。

回答by Tim Pietzcker

In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):

在 Python 中, csv 模块中有一个 Sniffer 类,可用于猜测给定文件的分隔符和引号字符。它的策略是(引自 csv.py 的文档字符串):



[First, look] for text enclosed between two identical quotes (the probable quotechar) which are preceded and followed by the same character (the probable delimiter). For example:

[首先,查找] 包含在两个相同引号(可能的quotechar)之间的文本,它们前后都带有相同的字符(可能的分隔符)。例如:

         ,'some text',

The quote with the most wins, same with the delimiter. If there is no quotechar the delimiter can't be determined this way.

获胜最多的报价,与分隔符相同。如果没有quotechar,则无法通过这种方式确定分隔符。

In that case, try the following:

在这种情况下,请尝试以下操作:

The delimiter shouldoccur the same number of times on each row. However, due to malformed data, it may not. We don't want an all or nothing approach, so we allow for small variations in this number.

分隔符应该在每一行出现相同的次数。但是,由于数据格式错误,它可能不会。我们不想要全有或全无的方法,所以我们允许这个数字有微小的变化。

  1. build a table of the frequency of each character on every line.
  2. build a table of freqencies of this frequency (meta-frequency?), e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows, 7 times in 2 rows'
  3. use the mode of the meta-frequency to determine the expectedfrequency for that character
  4. find out how often the character actually meets that goal
  5. the character that best meets its goal is the delimiter
  1. 建立每行每个字符出现频率的表格。
  2. 建立一个此频率的频率表(元频率?),例如“x 在 10 行中出现 5 次,在 1000 行中出现 6 次,在 2 行中出现 7 次”
  3. 使用元频率的模式来确定该 字符的预期频率
  4. 找出角色实际达到该目标的频率
  5. 最符合其目标的字符是分隔符

For performance reasons, the data is evaluated in chunks, so it can try and evaluate the smallest portion of the data possible, evaluating additional chunks as necessary.

出于性能原因,数据是按块评估的,因此它可以尝试评估数据的最小部分,并根据需要评估其他块。



I'm not going to quote the source code here - it's in the Lib directory of every Python installation.

我不打算在这里引用源代码——它在每个 Python 安装的 Lib 目录中。

Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)

请记住,CSV 也可以使用分号代替逗号作为分隔符(例如,在德国版本的 Excel 中,CSV 以分号分隔,因为在德国使用逗号作为小数点分隔符...)