为什么 XML 1.0 中的“控制”字符是非法的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/404107/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 12:16:07  来源:igfitidea点击:

Why are "control" characters illegal in XML 1.0?

xmlunicodehistory

提问by Trochee

There are a variety of characters that are not legally encodeable in XML 1.0, e.g. U+0007('bell') and U+001B('escape'). Most of the interesting ones are non-whitespace 'control' characters.

在 XML 1.0 中有多种不能合法编码的字符,例如U+0007('bell') 和U+001B('escape')。大多数有趣的是非空白“控制”字符。

It's clear from (e.g.) this questionand others that it's the XML spec that's the issue-- but can anyone illuminate me as to whythe XML spec forbids these characters?

从(例如)这个问题和其他问题中可以清楚地看出,问题在于XML 规范——但是谁能告诉我为什么XML 规范禁止这些字符?

It seems like it could have been required that they be encoded in escapes, e.g. as and respectively, but perhaps there's a practical reason that the characters were forbidden rather than required to be escaped?

似乎可能需要将它们编码为转义符,例如 as分别,但也许有一个实际原因禁止字符而不是要求转义字符?

Answerers have suggested that there is some motivation towards avoiding transmission control characters, but Unicode includes many othercontrol-like characters (consider U+200C"zero width non joiner"). I recognize there may be no good reason for this behavior, but I would still like to understand it better.

回答者建议避免传输控制字符有一些动机,但 Unicode 包括许多其他类似控制的字符(考虑U+200C“零宽度非连接符”)。我承认这种行为可能没有充分的理由,但我仍然想更好地理解它。

It's particularly frustrating because when those character values appear in other encodingsdata formats, I end up "double-escaping" new XML documents that need to encode this.

这尤其令人沮丧,因为当这些字符值出现在其他编码数据格式中时,我最终会“双重转义”需要对其进行编码的新 XML 文档。

采纳答案by annakata

My understanding is that this range is barred on the grounds that a markup language should not have any need to support transmission and flow control characters and including them would create a problem for any editors and parsers in binary conversion.

我的理解是,这个范围是被禁止的,理由是标记语言不应该有任何需要支持传输和流控制字符,并且包含它们会给二进制转换中的任何编辑器和解析器带来问题。

I'm struggling to find anything ex cathedra on this from Tim Bray et al though.

不过,我正在努力从 Tim Bray 等人那里找到任何关于此的前大教堂。

edit: somediscussionof control chars and a vague admission it wasn't exactly over-engineered:

编辑:对控制字符的一些讨论和模糊的承认它并没有完全过度设计:

At 09:27 AM 17/06/00 -0500, Mark Volkmann wrote:

I've never seen a discussion of the reason why most ASCII control characters, such as a form feed, are not allowed in XML documents. Can anyone tell me the reason behind that decision or point me to a spec. that explains that?

I'm not sure we'd do it the same way if we were doing it again. I don't see that they do any real harm. Clearly, if you're optimizing for a highly interoperable contentmarkup language (and XML is) it's legitimate to be suspicious of things like vertical-tab and backspace and so on... but then how can it be consistent to leave in \n and DEL and so on? -Tim

在 09:27 AM 17/06/00 -0500,Mark Volkmann 写道:

我从未见过有关为什么大多数 ASCII 控制字符(例如换页符)不允许在 XML 文档中出现的原因的讨论。谁能告诉我该决定背后的原因或向我指出规范。这就解释了?

如果我们再次这样做,我不确定我们是否会以同样的方式这样做。我看不出他们有什么真正的伤害。显然,如果您正在针对高度可互操作的内容标记语言(而 XML 是)进行优化,那么对垂直制表符和退格符等内容持怀疑态度是合理的......但是如何保持一致 \n和DEL等等?-蒂姆

回答by bobince

It seems like it could have been required that they be encoded in escapes, e.g. as  and 

似乎可能需要将它们编码为转义符,例如  和

You can do exactly that in XML 1.1, for all but \0.

您可以在 XML 1.1 中完全做到这一点,除了 \0 之外的所有内容。

回答by bobince

That was a long time ago, but my best recollection was that they have no graphical representation and also no agreed-upon semantics. Picking a couple at random we see U+0006 "Acknowledge" or U+0016 "Synchronous idle"... what do those mean? Unicode doesn't say. Even back when everyone claimed to support ASCII, there was no interoperability around this junk. XML is supposed to be about interoperability.

那是很久以前的事了,但我最好的回忆是它们没有图形表示,也没有商定的语义。随机选择一对,我们看到 U+0006“确认”或 U+0016“同步空闲”......这些是什么意思?Unicode 没有说。即使在每个人都声称支持 ASCII 的时候,这个垃圾也没有互操作性。XML 应该是关于互操作性的。

The experience has been that people who want to use these things really want to jam binary data into their XML elements (and the next thing they want is to include U+0000 NULL), which has been an explicit non-goal of XML since day 1. If you want to represent the numbers 0x6 or 0x16, there are lots of good ways to do that which don't muddy the notion of "character".

经验是,想要使用这些东西的人真的想将二进制数据塞进他们的 XML 元素中(他们想要的下一件事情是包含 U+0000 NULL),这一直是 XML 的明确非目标1. 如果你想表示数字 0x6 或 0x16,有很多很好的方法可以做到这一点,不会混淆“字符”的概念。

回答by Jirka Hanika

It's probably time to resummarize, also with a view at XML 1.1.

现在可能是重新总结的时候了,也有关于 XML 1.1 的观点。

What control character code points are there in Unicode?

Unicode 中有哪些控制字符代码点?

  • U+0000to U+001f, inherited from ASCII.
  • U+007F, inherited from ASCII
  • U+0080to U+009F, inherited from Latin-1
  • various special purpose ranges, standardized explicitly for Unicode, and mostly useful especially in non-markup contexts. They are discussed hereblock by block, including reasons why and how to use them or to not use them in XML and what to do if you run into them anyway.
  • U+0000to U+001f,继承自 ASCII。
  • U+007F, 继承自 ASCII
  • U+0080to U+009F, 继承自 Latin-1
  • 各种特殊用途范围,为 Unicode 明确标准化,尤其在非标记上下文中非常有用。此处逐块讨论它们,包括为什么以及如何在 XML 中使用它们或不使用它们的原因以及如果遇到它们该怎么办。

How does XML look at those control characters?

XML 如何看待这些控制字符?

This is a different classification.

这是一个不同的分类。

  • Tab and newline (regardless of the platform dependency of what's a newline) are good. Everybody uses them. Everybody knows what they are supposed to stand for. Allowed in almost all known forms, often even for pretty printing of the markup itself.
  • U+0000is evil. Null character? String terminator? Binary noise? Antithesis to both interoperability and markup. Forbidden in all forms.
  • Anything else? Scarcely used, problematic interoperability, but there are ways to tolerate them even without knowing much about what they are supposed to "control".
  • Tab 和换行符(无论换行符的平台依赖性如何)都很好。每个人都在使用它们。每个人都知道他们应该代表什么。几乎所有已知的形式都允许使用,通常甚至用于标记本身的漂亮打印。
  • U+0000是邪恶的。空字符?字符串终止符?二进制噪声?互操作性和标记的对立面。禁止一切形式。
  • 还要别的吗?很少使用,互操作性有问题,但即使不知道它们应该“控制”什么,也有办法容忍它们。

Let's now switch our attention to this last category only, control codes proper. That is, the following summary does NOT apply to tabs and newlines: U+0009, U+000a, U+000D, U+0085, U+2028.

现在让我们将注意力转移到最后一个类别,正确的控制代码。也就是说,以下摘要不适用于制表符和换行符:U+0009, U+000a, U+000D, U+0085, U+2028

XML 1.0 allows all the above ranges of control characters, except U+0000to U+001f, as text (directly included characters), and as numeric character references. Allowing U+007Fto U+009Fwas apparentlyby omission and this inconsistency was corrected in XML 1.1, but the other way round. They even gave a detailed rationale inside the standard:

XML 1.0 允许上述所有范围的控制字符,除了U+0000to U+001f、作为文本(直接包含的字符)和作为数字字符引用。让U+007FU+009F明显的疏漏和这种不一致是在XML 1.1纠正,但反过来想。他们甚至在标准中给出了详细的理由:

Finally, there is considerable demand to define a standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents. In order to improve the robustness of character encoding detection, the additional control characters #x7F through #x9F, which were freely allowed in XML 1.0 documents, now must also appear only as character references. (Whitespace characters are of course exempt.) The minor sacrifice of backward compatibility is considered not significant. Due to potential problems with APIs, #x0 is still forbidden both directly and as a character reference.

最后,需要在 XML 文档中定义任意 Unicode 字符的标准表示。因此,XML 1.1 允许使用对控制字符#x1 到#x1F 的字符引用,其中大部分在XML 1.0 中是被禁止的。然而,出于健壮性的原因,这些字符仍然不能直接在文档中使用。为了提高字符编码检测的鲁棒性,XML 1.0 文档中自由允许的额外控制字符#x7F 到#x9F 现在也必须仅作为字符引用出现。(空白字符当然是豁免的。)向后兼容性的微小牺牲被认为是不重要的。由于 API 的潜在问题,#x0 仍然被禁止直接和作为字符引用。

Why does Unicode and XML allow free use of markup-like control characters, apart from the few "inherited" ranges? People should be using markup for those.

为什么 Unicode 和 XML 允许自由使用类似标记的控制字符,除了少数“继承”范围之外?人们应该为那些使用标记。

Unicode is also used in non-markup contexts, and it is a still evolving character set. It would be too difficult to implement a conforming XML processor if the set of non-control characters was a moving target.

Unicode 也用于非标记上下文,它是一个仍在发展的字符集。如果非控制字符集是一个移动目标,那么实现符合标准的 XML 处理器就太困难了。

OK, what's wrong with the inherited ranges then, compared to the Unicode-specific control characters?

好吧,与特定于 Unicode 的控制字符相比,继承的范围有什么问题呢?

Lack of standardization. The Unicode consortium didn't really get to choose which numbers are assigned to those "characters", or what is their typical visual presentation or meaning. Full backward compatibility with ASCII (on encoded UTF-8 level) and with Latin-1 (on code point assignment level) forced raw inclusion of these code points regardless of the various specialized and overloaded meanings often attached to them in various text processing contexts.

缺乏标准化。Unicode 联盟并没有真正选择将哪些数字分配给这些“字符”,或者它们的典型视觉表现或含义是什么。与 ASCII(在编码的 UTF-8 级别)和与 Latin-1(在代码点分配级别)的完全向后兼容性强制原始包含这些代码点,而不管在各种文本处理上下文中经常附加到它们的各种专门的和重载的含义。

Wait, are you saying that XML isn't meant to be fully backward compatible with ASCII, unlike UTF-8?

等等,您是说与 UTF-8 不同,XML 并不意味着与 ASCII 完全向后兼容?

Yeah. That's correct. You need a document element. You can't even put in a raw <or &. So why would you ever need to put in raw control characters?

是的。没错。您需要一个文档元素。你甚至不能放入一个 raw<&. 那么为什么你需要输入原始控制字符呢?

回答by foxxtrot

XML was designed specially around Unicode (specifically UTF-8 and UTF-16) and ISO/IEC 10646, both of which (I'm not quitepositive about ISO 10646) contain the transmission/flow control characters which were left over from ASCII and the days of character-based terminals. While those characters still have uses, they don't belong in a format like XML.

XML 是专门围绕 Unicode(特别是 UTF-8 和 UTF-16)和 ISO/IEC 10646 设计的,两者(我对 ISO 10646不是肯定)都包含从 ASCII 和基于字符的终端时代。虽然这些字符仍然有用,但它们不属于 XML 之类的格式。

As for these new encodings that use those codes for something else, well, it seems that the XML spec may need to adapt.

至于这些将这些代码用于其他用途的新编码,看来 XML 规范可能需要进行调整。

回答by MSalters

Why are you double-escaping them? This seems like a good place for &bell; and &escape;. (Undefined, handled by callback from the parser to your code)

你为什么要双重逃避他们?这似乎是 &bell; 的好去处。和 &escape;。(未定义,由解析器回调到您的代码处理)