C# 如何检测文本文件的编码/代码页

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/90838/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 11:31:13  来源:igfitidea点击:

How can I detect the encoding/codepage of a text file

提问by GvS

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.

在我们的应用中,我们收到的文本文件(.txt.csv从不同的来源,等等)。阅读时,这些文件有时包含垃圾,因为这些文件是在不同/未知的代码页中创建的。

Is there a way to (automatically) detect the codepage of a text file?

有没有办法(自动)检测文本文件的代码页?

The detectEncodingFromByteOrderMarks, on the StreamReaderconstructor, works for UTF8and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850, windows1252.

detectEncodingFromByteOrderMarks,对StreamReader构造,工程UTF8和其他的Unicode标文件,但是我正在寻找一种方法来检测代码页,像ibm850windows1252



Thanks for your answers, this is what I've done.

谢谢你的回答,这就是我所做的。

The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.

我们收到的文件来自最终用户,他们对代码页一无所知。接收者也是最终用户,到目前为止,这就是他们对代码页的了解:代码页存在,而且很烦人。

Solution:

解决方案:

  • Open the received file in Notepad, look at a garbled piece of text. If somebody is called Fran?ois or something, with your human intelligence you can guess this.
  • I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
  • Loop through all codepages, and display the ones that give a solution with the user provided text.
  • If more as one codepage pops up, ask the user to specify more text.
  • 用记事本打开收到的文件,看一段乱码。如果有人叫弗朗索瓦之类的,以你的人类智慧你可以猜到这一点。
  • 我创建了一个小应用程序,用户可以使用它来打开文件,并输入用户知道在使用正确的代码页时它将出现在文件中的文本。
  • 循环遍历所有代码页,并显示提供用户提供文本的解决方案的代码页。
  • 如果 more as one codepage 弹出,请用户指定更多文本。

采纳答案by JV.

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

您无法检测到代码页,您需要被告知。您可以分析字节并猜测它,但这可能会产生一些奇怪(有时很有趣)的结果。我现在找不到它,但我确信记事本可以被欺骗以中文显示英文文本。

Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

无论如何,这就是您需要阅读的内容: 每个软件开发人员绝对、肯定必须了解 Unicode 和字符集的绝对最低要求(没有任何借口!)

Specifically Joel says:

乔尔具体说:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

关于编码的一个最重要的事实

如果你完全忘记我刚刚解释的一切,请记住一个非常重要的事实。在不知道它使用什么编码的情况下拥有一个字符串是没有意义的。你不能再把头埋在沙子里,假装“纯”文本是 ASCII。没有纯文本这样的东西。

如果您在内存、文件或电子邮件消息中有一个字符串,您必须知道它是什么编码,否则您无法正确解释它或向用户显示它。

回答by leppie

The StreamReader class's constructor takes a 'detect encoding' parameter.

StreamReader 类的构造函数采用“检测编码”参数。

回答by Tomer Gabel

If you're looking to detect non-UTF encodings (i.e. no BOM), you're basically down to heuristics and statistical analysis of the text. You might want to take a look at the Mozilla paper on universal charset detection(same link, with better formatting via Wayback Machine).

如果您要检测非 UTF 编码(即无 BOM),您基本上只能对文本进行启发式和统计分析。您可能想查看有关通用字符集检测Mozilla 论文相同链接,通过 Wayback Machine 提供更好的格式)。

回答by DeeCee

Got the same problem but didn't found a good solution yet for detecting it automatically . Now im using PsPad (www.pspad.com) for that ;) Works fine

遇到了同样的问题,但还没有找到一个好的解决方案来自动检测它。现在我正在使用 PsPad (www.pspad.com) ;) 工作正常

回答by tzot

I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.

我在 Python 中做过类似的事情。基本上,您需要来自各种编码的大量样本数据,这些数据由滑动的两字节窗口分解并存储在字典(哈希)中,以字节对为键,提供编码列表的值。

Given that dictionary (hash), you take your input text and:

给定该字典(哈希),您将输入文本并:

  • if it starts with any BOM character ('\xfe\xff' for UTF-16-BE, '\xff\xfe' for UTF-16-LE, '\xef\xbb\xbf' for UTF-8 etc), I treat it as suggested
  • if not, then take a large enough sample of the text, take all byte-pairs of the sample and choose the encoding that is the least common suggested from the dictionary.
  • 如果它以任何 BOM 字符开头('\xfe\xff' 表示 UTF-16-BE,'\xff\xfe' 表示 UTF-16-LE,'\xef\xbb\xbf' 表示 UTF-8 等),我按建议处理
  • 如果不是,则取足够大的文本样本,取样本的所有字节对并选择字典中建议的最不常见的编码。

If you've also sampled UTF encoded texts that do notstart with any BOM, the second step will cover those that slipped from the first step.

如果您还对以任何 BOM 开头的UTF 编码文本进行了采样,则第二步将涵盖那些从第一步中滑落的文本。

So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.

到目前为止,它对我有用(样本数据和随后的输入数据是各种语言的字幕),并且错误率在降低。

回答by shoosh

You can't detect the codepage

您无法检测到代码页

This is clearly false. Every web browser has some kind of universal charset detector to deal with pages which have no indication whatsoever of an encoding. Firefox has one. You can download the code and see how it does it. See some documentation here. Basically, it is a heuristic, but one that works really well.

这显然是错误的。每个 Web 浏览器都有某种通用字符集检测器来处理没有任何编码指示的页面。Firefox 有一个。你可以下载代码,看看它是如何做到的。请参阅此处的一些文档。基本上,它是一种启发式方法,但效果很好。

Given a reasonable amount of text, it is even possible to detect the language.

给定合理数量的文本,甚至可以检测语言。

Here's another oneI just found using Google:

这是我刚刚使用 Google 找到的另一个

回答by devstuff

Since it basically comes down to heuristics, it may help to use the encoding of previously received files from the same source as a first hint.

由于它基本上归结为启发式方法,因此使用先前从同一来源接收到的文件的编码作为第一个提示可能会有所帮助。

Most people (or applications) do stuff in pretty much the same order every time, often on the same machine, so its quite likely that when Bob creates a .csv file and sends it to Mary it'll always be using Windows-1252 or whatever his machine defaults to.

大多数人(或应用程序)每次都以几乎相同的顺序做事,通常在同一台机器上,所以很可能当 Bob 创建一个 .csv 文件并将其发送给 Mary 时,它总是使用 Windows-1252 或无论他的机器默认是什么。

Where possible a bit of customer training never hurts either :-)

在可能的情况下,一点客户培训也不会受到伤害:-)

回答by hegearon

Notepad++has this feature out-of-the-box. It also supports changing it.

Notepad++具有开箱即用的此功能。它还支持更改它。

回答by Intraday Tips

I was actually looking for a generic, not programming way of detecting the file encoding, but I didn't find that yet. What I did find by testing with different encodings was that my text was UTF-7.

我实际上正在寻找一种检测文件编码的通用而非编程方式,但我还没有找到。通过使用不同的编码进行测试,我确实发现我的文本是 UTF-7。

So where I first was doing: StreamReader file = File.OpenText(fullfilename);

所以我第一次做的地方:StreamReader file = File.OpenText(fullfilename);

I had to change it to: StreamReader file = new StreamReader(fullfilename, System.Text.Encoding.UTF7);

我不得不将其更改为:StreamReader file = new StreamReader(fullfilename, System.Text.Encoding.UTF7);

OpenText assumes it's UTF-8.

OpenText 假定它是 UTF-8。

you can also create the StreamReader like this new StreamReader(fullfilename, true), the second parameter meaning that it should try and detect the encoding from the byteordermark of the file, but that didn't work in my case.

你也可以像这个 new StreamReader(fullfilename, true) 一样创建 StreamReader,第二个参数意味着它应该尝试从文件的 byteordermark 检测编码,但这在我的情况下不起作用。

回答by Tao

I know it's very late for this question and this solution won't appeal to some (because of its english-centric bias and its lack of statistical/empirical testing), but it's worked very well for me, especially for processing uploaded CSV data:

我知道这个问题已经很晚了,这个解决方案不会吸引一些人(因为它以英语为中心的偏见和缺乏统计/经验测试),但它对我来说效果很好,特别是对于处理上传的 CSV 数据:

http://www.architectshack.com/TextFileEncodingDetector.ashx

http://www.architectshack.com/TextFileEncodingDetector.ashx

Advantages:

好处:

  • BOM detection built-in
  • Default/fallback encoding customizable
  • pretty reliable (in my experience) for western-european-based files containing some exotic data (eg french names) with a mixture of UTF-8 and Latin-1-style files - basically the bulk of US and western european environments.
  • 内置BOM检测
  • 可定制的默认/后备编码
  • 对于包含一些异国情调数据(例如法语名称)以及 UTF-8 和 Latin-1 样式文件的基于西欧的文件(基本上是美国和西欧环境的大部分)非常可靠(根据我的经验)。

Note: I'm the one who wrote this class, so obviously take it with a grain of salt! :)

注意:我是编写这门课的人,所以显然要持保留态度!:)