C# Streamreader 和外来字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/591273/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 09:28:41  来源:igfitidea点击:

Streamreader and foreign characters

c#encoding

提问by

Which encoding should I use to read ?,?,?,?,?,ü etc?

我应该使用哪种编码来读取 ?,?,?,?,?,ü 等?

回答by leppie

Encoding.UTF8 or Encoding.Unicode.

Encoding.UTF8 或 Encoding.Unicode。

The StreamReader class has a bool parameter in it's constructor allow it to auto detect the encoding.

StreamReader 类在它的构造函数中有一个 bool 参数,允许它自动检测编码。

回答by cwap

Unicode => UTF-8/UTF-16 ? :)

Unicode => UTF-8/UTF-16 ? :)

回答by Jon Skeet

You should use whatever the encoding of the original data is. Where are you getting the data from, and do you have information as to which encoding it's in? If you try to read it with the wrong encoding, you'll get the wrong answer: even if your encoding can handle the characters, it's going to misinterpret the binary data.

您应该使用原始数据的任何编码。你从哪里获取数据,你有关于它采用哪种编码的信息?如果你试图用错误的编码读取它,你会得到错误的答案:即使你的编码可以处理字符,它也会误解二进制数据。

If you get to pickthe encoding, then UTF-8 is usually a good bet. It's bad in terms of size if you've got a lot of far eastern characters, but otherwise good. In particular, ASCII still comes out at one byte per character.

如果您要选择编码,那么 UTF-8 通常是一个不错的选择。如果你有很多远东角色,就尺寸而言是糟糕的,但其他方面都很好。特别是,ASCII 仍然以每个字符一个字节的方式出现。

回答by Mike

Encodings all boil down to the fact that if you use 8 bits for a character, you can only handle 256 distinct characters. Seeing as the UK and US set up the conventions, the 256 standard ASCII characters are mostly unaccented western characters.

编码都归结为这样一个事实:如果您使用 8 位作为字符,则只能处理 256 个不同的字符。鉴于英国和美国建立了约定,256 个标准 ASCII 字符大多是无重音的西方字符。

That's where UTF8 and UTF16 come into play. UTF8 is a lot like ASCII - it uses one byte for most western characters. However, there are some special bytes that indicate a character out of normal ASCII range - the two bytes that immediately follow the special byte then indicate the true character.

这就是 UTF8 和 UTF16 发挥作用的地方。UTF8 很像 ASCII——它使用一个字节来表示大多数西方字符。但是,有一些特殊字节指示超出正常 ASCII 范围的字符 - 紧跟在特殊字节之后的两个字节然后指示真正的字符。

UTF16 (also known as Unicode) does away with the special indicator byte, and just uses 16 bits for every character. As we all know, 16 bits gives you 65536 distinct characters, which isn't quite enough to cover all the worlds written characters, but it mostly does the job.

UTF16(也称为 Unicode)取消了特殊的指示符字节,每个字符只使用 16 位。众所周知,16 位为您提供 65536 个不同的字符,这不足以涵盖世界上所有的书面字符,但它主要完成了这项工作。

So to answer your question: if most of your characters are unaccented western characters, UTF8 will be the most compact representation for you (and most readable in many editors). If the bulk of your characters are non-western (say, Chinese), you'll probably want to use Unicode (aka UTF16).

所以回答你的问题:如果你的大部分字符都是非重音的西方字符,UTF8 将是你最紧凑的表示(并且在许多编辑器中最易读)。如果您的大部分字符是非西方的(比如中文),您可能想要使用 Unicode(又名 UTF16)。

Good luck!

祝你好运!

回答by Franci Penov

You need to use the proper encoding, as all the other answers mentioned.

您需要使用正确的编码,正如提到的所有其他答案。

The problem is how to discover the encoding. That depends on the source of your file:

问题是如何发现编码。这取决于您的文件来源:

  1. If it is an XML file, there should be an <?xml>processing instruction at the beginning of the file that specifies the encoding. If there isn't one, you should assume it's UTF8.
  2. If it is a text file, you can try UTF8 encoding, or if that fail, you should try the system locale of the machine you're running on. If that fails, you are pretty much on your own, unless you know someone that can tell you the system locale of the machine the file was created at.
  1. 如果是XML文件,文件<?xml>开头应该有指定编码的处理指令。如果没有,您应该假设它是 UTF8。
  2. 如果是文本文件,您可以尝试使用 UTF8 编码,如果失败,您应该尝试使用您正在运行的机器的系统区域设置。如果失败,您几乎只能靠自己,除非您知道有人可以告诉您创建文件的机器的系统区域设置。

In any case, you should be able to cover about 90% of all files by using UTF8 with a fallback to UTF16. Almost every programs or languages in the last five years support Unicode. However, if you are going to consume a lot of files from China, you might try first UTF16, which is a bit more prevalent for encoding GB18030.

在任何情况下,通过使用 UTF8 并回退到 UTF16,您应该能够覆盖大约 90% 的文件。过去五年中几乎所有程序或语言都支持 Unicode。但是,如果您要使用来自中国的大量文件,您可以先尝试 UTF16,它在编码GB18030 时更为普遍。

回答by Ishmael

There is no completely reliable method, but you can use some heuristics to guess the encoding.

没有完全可靠的方法,但您可以使用一些启发式方法来猜测编码。

  1. Look for a byte order mark.
  2. If you don't find a BOM, assume the file is UTF-8 and try to parse it. If it's an XML file, the declaration may contain an encoding. Similarly, an HTML file may contain a meta encoding tag.
  3. Failing all the above, assume it's UTF-8 (or ANSI -- your choice).
  1. 查找字节顺序标记
  2. 如果您没有找到 BOM,请假设该文件是 UTF-8 并尝试解析它。如果是 XML 文件,则声明可能包含编码。类似地,一个 HTML 文件可能包含一个元编码标签。
  3. 如果上述所有内容均失败,则假设它是 UTF-8(或 ANSI——您的选择)。

Rick Strahl has a handy articleon detecting encodings via the BOM. It's a bit dated -- System.Text.Encoding now has a GetPreamble method and StreamReader has an overload that will try to detect the encoding for you.

Rick Strahl 有一篇关于通过 BOM 检测编码的文章。它有点过时了——System.Text.Encoding 现在有一个 GetPreamble 方法,StreamReader 有一个重载,它会尝试为你检测编码。

回答by Vagner

Also you can put the culture to read odd carachteres like ? á á etc.

还可以把文化读成奇数carachteres 之类的吗?á 等

CultureInfo pt = CultureInfo.GetCultureInfo("pt-BR");
StreamReader fileReader = new StreamReader("C:\temp\test.txt",Encoding.GetEncoding(pt.TextInfo.ANSICodePage),true);