XML 的默认编码是 UTF-8 还是 UTF-16?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6302544/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 14:47:41  来源:igfitidea点击:

default encoding for XML is UTF-8 or UTF-16?

xmlxml-serialization

提问by Pacerier

OpenTag FAQstates:

OpenTag 常见问题说明:

If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.

First bytes        Encoding assumed
-----------------------------------------
EF BB BF           UTF-8
FE FF              UTF-16 (big-endian)
FF FE              UTF-16 (little-endian)
00 00 FE FF        UTF-32 (big-endian)
FF FE 00 00        UTF-32 (little-endian)
None of the above  UTF-8

如果 XML 文档中不存在编码声明(并且没有可用的外部编码声明机制,例如 HTTP 标头),则 XML 文档的假定编码取决于字节顺序标记 (BOM) 的存在。

BOM 是位于文件顶部的 Unicode 特殊标记,用于指示其编码。BOM 对于 UTF-8 是可选的。

First bytes        Encoding assumed
-----------------------------------------
EF BB BF           UTF-8
FE FF              UTF-16 (big-endian)
FF FE              UTF-16 (little-endian)
00 00 FE FF        UTF-32 (big-endian)
FF FE 00 00        UTF-32 (little-endian)
None of the above  UTF-8

Is there a dumbed-downexplanation of the above paragraph?

有没有对上一段的简化解释?

回答by wimh

Either you have to use a line like

要么你必须使用像

<?xml version="1.0" encoding="iso-8859-1" ?>

to specify which encoding is used. If the encoding is not specified, a Byte order mark (BOM)can be present. If a BOM for either UTF-16 or UTF-32 is present, that encoding is used. Otherwise UTF-8 is the encoding. (The BOM for UTF-8 is optional)

指定使用哪种编码。如果未指定编码,则可以存在字节顺序标记 (BOM)。如果存在 UTF-16 或 UTF-32 的 BOM,则使用该编码。否则 UTF-8 是编码。(UTF-8 的 BOM 是可选的)

Edit

编辑

The BOM is an invisible character. But there is no need to see it. Applications take care of it automatically. When you use windows notepad, you can select the encoding when you save the file. Notepad will automatically insert the BOM at the start of the file. When you later reopen the file, notepad will recognise the BOM and use the proper encoding to read the file. There is no need for you to ever modify the BOM, if you would do so, characters can get a different meaning, so the text will not be the same.

BOM 是一个不可见的字符。但没有必要看到它。应用程序会自动处理它。当您使用windows记事本时,您可以在保存文件时选择编码。记事本将自动在文件开头插入 BOM。当您稍后重新打开文件时,记事本将识别 BOM 并使用正确的编码来读取文件。您无需修改​​ BOM,如果您这样做,字符可以获得不同的含义,因此文本将不相同。

I will try to explain with an example. Consider a text file, with just the characters "test". Default notepad will use ANSI encoding, the text file will look like this when you view it in hex mode:

我会试着用一个例子来解释。考虑一个只有字符“test”的文本文件。默认记事本将使用 ANSI 编码,当您以十六进制模式查看时,文本文件将如下所示:

C:\>C:\gnuwin32\bin\hexdump -C test-ansi.txt
00000000  74 65 73 74                                       |test|
00000004

(as you see, I am using hexdump from gnuwin32, but you can also use an hex editor like Frhedto see this.

(如您所见,我使用的是gnuwin32 中的hexdump,但您也可以使用Frhed 之类的十六进制编辑器来查看此内容。

There is no BOM in front of this file. It would not be possible, because the character which is used for the BOM does not exist in ANSI encoding. (Because there is not BOM, editors which don't support ANSI encoding, would treat this file as UTF-8).

这个文件前面没有BOM。这是不可能的,因为用于 BOM 的字符在 ANSI 编码中不存在。(因为没有 BOM,不支持 ANSI 编码的编辑器会将此文件视为 UTF-8)。

when I now save the file like utf8, you will see 3 extra bytes (the BOM) in front of "test":

当我现在像 utf8 这样保存文件时,你会在“test”前面看到 3 个额外的字节(BOM):

C:\>C:\gnuwin32\bin\hexdump -C test-utf8.txt
00000000  ef bb bf 74 65 73 74                              |???test|
00000007

(if you would open this file with a text editor which does not support utf-8, you would actually see those characters "???")

(如果您使用不支持 utf-8 的文本编辑器打开此文件,您实际上会看到这些字符“???”)

Notepad can also save the file as unicode, this means UTF-16 little-endian (UTF-16LE):

记事本也可以将文件保存为 unicode,这意味着 UTF-16 little-endian (UTF-16LE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode.txt
00000000  ff fe 74 00 65 00 73 00  74 00                    |?tt.e.s.t.|
0000000a

And here is the version saved as unicode (big endian) (UTF-16BE):

这是保存为 unicode (big endian) (UTF-16BE) 的版本:

C:\>C:\gnuwin32\bin\hexdump -C test-unicode-big-endian.txt
00000000  fe ff 00 74 00 65 00 73  00 74                    |t?.t.e.s.t|
0000000a

Now consider a text file with the 4 chinese characters "琀攀猀琀". When I save that as unicode (big endian), the result looks like this:

现在考虑一个包含 4 个汉字“琀攀猀琀”的文本文件。当我将其保存为 unicode(大端)时,结果如下所示:

C:\>C:\gnuwin32\bin\hexdump -C test2-unicode-big-endian.txt
00000000  fe ff 74 00 65 00 73 00  74 00                    |t?t.e.s.t.|
0000000a

As you see, the word "test" in UTF-16LE is stored the same way as the word "琀攀猀琀" in UTF-16BE. But because the BOM if stored different, you can see whether the file contains "test" or "琀攀猀琀". Without a BOM you would have to guess.

如您所见,UTF-16LE 中“test”一词的存储方式与 UTF-16BE 中“琀攀猀琀”一词的存储方式相同。但是因为BOM如果存储不同,你可以看到文件是否包含“测试”或“琀攀猀琀”。如果没有 BOM,您将不得不猜测。