哪些字符必须在 HTML 5 中转义?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25612166/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 02:37:52  来源:igfitidea点击:

What characters must be escaped in HTML 5?

htmlescapinghtml-escape-characters

提问by ezequiel-garzon

HTML 4 states pretty which charactersshould be escaped:

HTML 4 很好地说明了哪些字符应该被转义:

Four character entity references deserve special mention since they are frequently used to escape special characters:

  • "&lt;" represents the < sign.
  • "&gt;" represents the > sign.
  • "&amp;" represents the & sign.
  • "&quot; represents the " mark.

Authors wishing to put the "<" character in text should use "&lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use "&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&amp;" in attribute values since character references are allowed within CDATA attribute values.

Some authors use the character entity reference "&quot;" to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

四个字符实体引用值得特别提及,因为它们经常用于转义特殊字符:

  • “<” 表示 < 符号。
  • “>” 表示 > 符号。
  • “&” 代表 & 符号。
  • “”代表“”标志。

希望在文本中放置“<”字符的作者应使用“<” (ASCII 十进制 60)以避免可能与标记的开头(起始标记开放分隔符)混淆。同样,作者应该使用“>” (ASCII 十进制 62) 文本而不是 ">" 以避免旧用户代理的问题,当它出现在引用的属性值中时,这些代理错误地将其视为标签的结尾(标签关闭分隔符)。

作者应使用“&” (ASCII 十进制 38)而不是“&”以避免与字符引用的开头(实体引用开放分隔符)混淆。作者还应该使用“&” 在属性值中,因为在 CDATA 属性值中允许字符引用。

一些作者使用字符实体引用“"” 对双引号 (") 的实例进行编码,因为该字符可用于分隔属性值。

I'm surprised I can't find anything like this in HTML 5. With the help of grep the only non-XML mention I could find comes as an aside regarding the deprecated XMP element:

我很惊讶我在 HTML 5 中找不到这样的东西。在 grep 的帮助下,我能找到的唯一非 XML 提及是关于已弃用的 XMP 元素的旁白:

Use pre and code instead, and escape "<" and "&" characters as "&lt;" and "&amp;" respectively.

使用 pre 和 code 代替,并将“<”和“&”字符转义为“<” 和“&” 分别。

Could somewhat point to the official source on this matter?

能否在某种程度上指向有关此事的官方消息来源?

采纳答案by Ry-

The specification defines the syntax for normal elementsas:

规范将普通元素的语法定义为:

Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.

普通元素可以包含文本、字符引用、其他元素和注释,但文本不得包含字符 U+003C LESS-THAN SIGN (<) 或不明确的与符号。除了内容模型和本段中描述的那些限制之外,一些普通元素还对它们被允许保存的内容有更多的限制。这些限制如下所述。

So you have to escape <, or &when followed by anything that could begin a character reference. The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. (Obviously, if you don't want to terminate the attribute value there, escape the quotation mark.)

所以你必须转义<,或者&后面跟着任何可以开始字符引用的东西。&符号规则是引用属性的唯一此类规则,因为匹配的引号是唯一可以终止的规则。(显然,如果您不想在那里终止属性值,请转义引号。)

These rules don't apply to <script>and <style>; you should avoid putting dynamic content in those. (If you have toinclude JSON in a <script>, replace <with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029after JSON serialization.)

这些规则不适用于<script><style>;您应该避免将动态内容放入其中。(如果您包括JSON的<script>,替换<\x3c的U + 2028字符以\u2028与和U + 2029 \u2029JSON序列化后)。

回答by user123444555621

From http://www.w3.org/html/wg/drafts/html/master/single-page.html#serializing-html-fragments

来自http://www.w3.org/html/wg/drafts/html/master/single-page.html#serializing-html-fragments

Escaping a string(for the purposes of the algorithm* above) consists of running the following steps:

  1. Replace any occurrence of the "&" character by the string "&amp;".
  2. Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string "&nbsp;".
  3. If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string "&quot;".
  4. If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "&lt;", and any occurrences of the ">" character by the string "&gt;".

转义字符串(出于上述算法*的目的)包括运行以下步骤:

  1. 用字符串“&”替换任何出现的“&”字符。
  2. 用字符串“ ”替换任何出现的 U+00A0 NO-BREAK SPACE 字符。
  3. 如果算法是在属性模式下调用的,则用字符串“””替换任何出现的“””字符。
  4. 如果算法不是在属性模式下调用的,则用字符串“<”替换任何出现的“<”字符,用字符串“>”替换任何出现的“>”字符。

*Algorithmis the built-in serialization algorithm as called e.g. by the innerHTMLgetter.

* Algorithm是内置的序列化算法,例如由innerHTMLgetter调用。

Strictly speaking, this is not exactly an aswer to your question, since it deals with serialization rather than parsing. But on the other hand, the serialized output is designed to be safely parsable. So, by implication, when writing markup:

严格来说,这并不完全是对您问题的回答,因为它涉及序列化而不是解析。但另一方面,序列化的输出被设计为可安全解析。因此,暗示,在编写标记时:

  1. The &character should be replaced by &amp;
  2. Non-breaking spaces should be escaped as &nbsp;(surprise!...)
  3. Within attributes, "should be escaped as &quot;
  4. Outside of attributes, <should be escaped as &lt;and >should be escaped as &gt;
  1. &字符应替换为&amp;
  2. 不间断空格应该被转义为&nbsp;(惊喜!...)
  3. 在属性内,"应该转义为&quot;
  4. 在属性之外,<应该被转义为&lt;并且>应该被转义为&gt;

I'm intentionaly writing "should", not "must", since parsers may be able to correct violations of the above.

我故意写“应该”,而不是“必须”,因为解析器可能能够纠正上述违规行为。

回答by Sylvain Leroux

Adding my voice to insist that things are not that easy -- strictly speaking:

添加我的声音以坚持认为事情并不那么容易 - 严格来说:

Case 1 : HTML serialization

案例 1:HTML 序列化

(the most common)

(最普遍的)

If you serialize your HTML5 as HTML, "the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand."

如果您将 HTML5 序列化为 HTML,“文本不得包含字符 U+003C LESS-THAN SIGN (<) 或不明确的&符号。”

An ambiguous ampersand is an "ampersand followed by one or more alphanumeric ASCII characters, followed by a U+003B SEMICOLON character (;)"

歧义与号是“与号后跟一个或多个字母数字 ASCII 字符,后跟 U+003B 分号字符 (;)”

Furthermore, "the parsing of certain named character references in attributes happens even with the closing semicolon being omitted."

此外,“即使省略了结束分号,也会对属性中的某些命名字符引用进行解析。”

So, in that case editable && copy(notice the spaces around &&) is valid HTML5 serialized as HTML construction as none of the ampersands is followed by a letter.

因此,在这种情况下editable && copy(注意 && 周围的空格)是有效的 HTML5 序列化为 HTML 结构,因为没有一个 & 符号后跟一个字母。

As a counter example: editable&&copyis not safe (even if this might work) as the last sequence &copymight be interpreted as the entity reference for ?

作为反例:editable&&copy不安全(即使这可能有效),因为最后一个序列&copy可能被解释为实体引用?

Case 1 : XML serialization

案例 1:XML 序列化

(the less common)

(不太常见)

Here the classic XML rules apply. For example, each and every ampersand either in the text or in attributes should be escaped as &amp;.

这里应用了经典的 XML 规则。例如,文本或属性中的每个 & 符号都应转义为&amp;.

In that case &&(with or without spaces) is invalid XML. You should write &amp;&amp;

在这种情况下&&(有或没有空格)是无效的 XML。你应该写&amp;&amp;

Tricky, isn't it ?

棘手,不是吗?