XML 标头中的“编码”有什么用?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5165347/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 14:35:16  来源:igfitidea点击:

What use is the 'encoding' in the XML header?

xmlheadercharacter-encoding

提问by xtofl

Looking at the XML header

查看 XML 标头

<?xml version="1.0" encoding="UTF-16" standalone="no"?>

Am I right to state that the encodingattribute is

我是否有权声明该encoding属性是

  • coming too late (you can't read it properly unless you know the encoding...)
  • redundant, hence error-prone: it's all too easy to replace it with "Big5" yet save the file in UTF-8
  • 来得太晚了(除非您知道编码,否则您无法正确阅读它...)
  • 冗余,因此容易出错:用“Big5”替换它太容易了,但将文件保存为 UTF-8

Or is that attribute not about the contentof the stream?

或者该属性与流的内容无关?

Am I mixing up things here?

我在这里搞混了吗?

采纳答案by Joachim Sauer

As you mentioned, you'd have to know the encoding of the file to read the encodingattribute.

正如您所提到的,您必须知道文件的编码才能读取encoding属性。

However, there is a heuristic that can easily get you close enough to the "real" encoding to allow you to read the encoding attribute. This works, because the <?xmlpart by definition can only contain characters in the ASCII range (however they are encoded).

但是,有一种启发式方法可以轻松地使您足够接近“真实”编码,从而允许您读取编码属性。这是有效的,因为<?xml根据定义,该部分只能包含 ASCII 范围内的字符(无论它们是如何编码的)。

The XML standard even describes the exact process used to find out the encoding.

XML 标准甚至描述了用于找出编码的确切过程

And the encoding label isn't redundant either. For example, if you use the algorithm in the XML spec to find out that some ASCII-based (or ASCII-compatible) encoding is used you stillneed to read the encoding to find out which one is actually use (valid candidates would be ASCII, UTF-8, any of the ISO-8859-* encodings, any of the Windows-*encodings, KOI8-Rand many, many others). For the <?xmlpart itself it won't make a difference which one it is, but for the rest of the document, it can make a huge difference.

并且编码标签也不是多余的。例如,如果您使用 XML 规范中的算法来发现使用了一些基于 ASCII(或 ASCII 兼容)的编码,您仍然需要阅读编码以找出实际使用的是哪个(有效的候选者将是 ASCII 、UTF-8、任何ISO-8859-* 编码、任何Windows-*编码、KOI8-R以及许多其他编码)。对于<?xml零件本身,它不会对它是哪一个产生影响,但对于文档的其余部分,它可以产生巨大的差异。

Regarding mis-labeled XML files: yes, it's easy to produce those, however: the XML spec clearly specifies that those files are mal-formed and as such are not correct XML. Incorrect encodings must be reported as an error (as long as they can be detected!). So it's the problem of whoever is producing the XML.

关于错误标记的 XML 文件:是的,生成这些文件很容易,但是:XML 规范明确指出这些文件格式错误,因此不是正确的 XML。不正确的编码必须报告为错误(只要它们可以被检测到!)。所以这是生成 XML 的人的问题。

回答by Michael Kay

You're quite right that it looks like an odd design. It only works because the XML declaration uses only ASCII characters, and nearly all encodings are supersets of ASCII. If you're prepared to accept something that isn't, for example EBCDIC, you can check whether the file starts with whatever the EBCDIC representation of "<?xml"is. Which means you're relying on the general level of redundancy in the header of the file, rather than purely the encoding attribute itself. Like many things in XML, it's pragmatic and works, but isn't particularly elegant.

你说得对,它看起来像一个奇怪的设计。它之所以有效,是因为 XML 声明仅使用 ASCII 字符,并且几乎所有编码都是 ASCII 的超集。如果您准备接受不是 EBCDIC 的内容,例如 EBCDIC,您可以检查文件是否以 EBCDIC 表示的任何内容开头"<?xml"。这意味着您依赖于文件头中的一般冗余级别,而不是纯粹的编码属性本身。与 XML 中的许多内容一样,它实用且有效,但不是特别优雅。

回答by Delan Azabani

XML parsers are only required to support at least UTF-8 and UTF-16. The XML parser starts by trying the encodings based on the Byte Order Mark (BOM), if present (for UTF-16, UTF-32 and even UTF-8 with the dummy BOM). If none is found, then the parser will try UTF-32, UTF-16, UTF-8, ASCII and other ASCII-compatible single-byte encodings. Only then will it see the encoding attribute, and will restart parsing if necessary.

XML 解析器只需要至少支持 UTF-8 和 UTF-16。XML 解析器首先尝试基于字节顺序标记 (BOM) 的编码(如果存在)(对于 UTF-16、UTF-32 甚至带有虚拟 BOM 的 UTF-8)。如果未找到,则解析器将尝试 UTF-32、UTF-16、UTF-8、ASCII 和其他与 ASCII 兼容的单字节编码。只有这样它才会看到编码属性,并在必要时重新开始解析。

回答by Zsub

I think in principle you might have a point that the encodingstatement is 'late' in the file, however, the whole first line only uses basic characters. AFAIK, those are the same in almost all encodings, so whatever you decode it as, it'll read <?xml ... ?>anyway.

我认为原则上您可能认为该encoding语句在文件中“迟到”,但是,整个第一行仅使用基本字符。AFAIK,这些在几乎所有编码中都是相同的,所以无论你将它解码为什么,它都会读取<?xml ... ?>

Whatever comes after thathowever, could matter. For example text in a CDATA section could be encoded in a Cyrillic encoding.

然而,无论之后发生什么,都可能很重要。例如,CDATA 部分中的文本可以用 Cyrillic 编码进行编码。