哪些是 HTML 和 XML 特殊字符？

Question

提问by Ian Boyd

What are the special reserved character entities in HTML and in XML?

HTML 和 XML 中的特殊保留字符实体是什么？

The information that I have says:

我掌握的信息说：

HTML:

HTML：

&(replace with &)
<(replace with <)
>(replace with >)
"(replace with ")
'(replace with ')

&（替换为&）
<（替换为<）
>（替换为>）
"（替换为"）
'（替换为'）

XML:

XML：

<(replace with <)
>(replace with >)
&(replace with &)
'(replace with ')
"(replace with ")

<（替换为<）
>（替换为>）
&（替换为&）
'（替换为'）
"（替换为"）

But I cannot find documentation on either of these.

但是我找不到有关其中任何一个的文档。

The W3C does mention, in Extensible Markup Language (XML) 1.0 (Fifth Edition), certain predefined entity references. But it says that these entities are predefined (in the same way that ©is predefined); not that they must be escaped:

W3C 在可扩展标记语言 (XML) 1.0（第五版）中确实提到了某些预定义的实体引用。但是它说这些实体是预定义的（与©预定义的方式相同）；并不是说他们必须逃脱：

4.6 Predefined Entities
[Definition: Entity and character references may both be used to escapethe left angle bracket, ampersand, and other delimiters. A set of general entities (amp, lt, gt, apos, quot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references " < " and " & " may be used to escape < and & when they occur in character data.]

4.6 预定义实体
[定义：实体和字符引用都可以用于转义左尖括号、与号和其他分隔符。为此指定了一组通用实体（amp、lt、gt、apos、quot）。也可以使用数字字符引用；它们在识别时立即展开并且必须被视为字符数据，因此数字字符引用“ < ”和“ & ”可用于转义 < 和 & 当它们出现在字符数据中时。]

What characters mustbe escaped into entity references in HTML? What characters mustbe escaped into entity references in XML?

哪些字符必须转义为HTML 中的实体引用？哪些字符必须转义为XML 中的实体引用？

Update:

更新：

From Extensible Markup Language (XML) 1.0 (Fifth Edition):

来自可扩展标记语言 (XML) 1.0（第五版）：

2.4 Character Data and Markup
The ampersand character (&) and the left angle bracket (<) must notappear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they mustbe escaped using either numeric character references or the strings "&" and "<" respectively.
The right angle bracket (>) may be represented using the string ">", and must, for compatibility, be escaped using either ">" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.
To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as "'", and the double-quote character (") as """.

2.4 字符数据和标记
与符号 ( &) 和左尖括号 ( <)不得以其文字形式出现，除非用作标记定界符，或者在注释、处理指令或 CDATA 部分中。如果在其他地方需要它们，则必须分别使用数字字符引用或字符串“ &”和“ <”进行转义。
右尖括号 ( >) 可以使用字符串 " >"表示，为了兼容性，当它出现在内容中的字符串 " " 中时，当该字符串未标记结尾时，必须使用 " >" 或字符引用进行转义]]>的 CDATA 部分。
为了允许属性值同时包含单引号和双引号，撇号或单引号字符 ( ') 可以表示为“ '”，双引号字符 ( ) 可以表示"为“ "”。

I read the former as saying that

我读前者说的是

must be:

必须是：

<(<) must be
&(&) must be

<( <) 必须是
&( &) 必须是

may, but mustwhen appearing as ]]>

可能，但在出现时必须]]>

>(>) must be, if appearing as ]]>

>( >) 必须是，如果出现为]]>

And that 'and "don't have to be escaped at all; unless you want to have quotes inside quoted attributes.

而这'并"没有在所有被转义; 除非您想在带引号的属性中使用引号。

From HTML 4.01 Specification, HTML Document Representation:

来自HTML 4.01 规范，HTML 文档表示：

5.3.2 Character entity references
Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter).
Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
Some authors use the character entity reference """ to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

5.3.2 字符实体引用
希望将“ <”字符放在文本中的作者应使用“ <”（ASCII十进制60）以避免可能与标签的开头（起始标签开放分隔符）混淆。
类似地，作者应该>在文本中使用“ ”（ASCII 十进制 62）而不是“ >”，以避免旧用户代理出现问题，当它出现在引用的属性值中时，错误地将其视为标签的结尾（标签关闭分隔符）。
作者应使用“ &”（ASCII 十进制 38）而不是“ &”，以避免与字符引用的开头（实体引用的开放分隔符）混淆。作者还应该&在属性值中使用“ ”，因为在 CDATA 属性值中允许字符引用。
一些作者使用字符实体引用“ "”来编码双引号 ( ") 的实例，因为该字符可用于分隔属性值。

HTML is much more wishy-washy on the rules, but it sounds like I should:

HTML 在规则上更加一厢情愿，但听起来我应该：

<should be with <
>should be with >
&should be with &
"should be with "

<应该与 <
>应该与 >
&应该与 &
"应该与 "

And if "can be an entity reference, I should also replace 'with &.

如果"可以是实体引用，我也应该替换'为&.

Update Two

更新二

From HTML5 - A vocabulary and associated APIs for HTML and XHTML:

来自HTML5 - HTML 和 XHTML 的词汇表和相关 API：

8.3 Serializing HTML fragments
Escaping a string(for the purposes of the algorithm above) consists of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any occurrences of the ">" character by the string ">".

8.3 序列化 HTML 片段
转义字符串（出于上述算法的目的）包括运行以下步骤：
用&字符串“ &”替换任何出现的“ ”字符。
将任何出现的 U+00A0 NO-BREAK SPACE 字符替换为字符串“  ”。
如果算法是在属性模式下调用的，则将所有出现的“ "”字符替换为字符串“ "”。
如果算法不是在属性模式下调用的，则将任何出现的“ <”字符替换为字符串“ <”，并将出现的任何“ ”字符替换>为字符串“ >”。

Which I read as HTML:

我读为HTML：

&by &always
by  always
"by "if it's inside an attribute
<by <if it's notin an attribute (i.e. attributes can contain <)
>by >if it's notin an attribute (i.e. attributes can contain >)

&通过&始终
通过 始终
"通过"如果它在属性内
<by<如果它不在属性中（即属性可以包含<）
>by>如果它不在属性中（即属性可以包含>）

Answer 1

采纳答案by johnluetke

First, you're comparing a HTML 4.01 specificationwith an HTML 5 one. HTML5 ties more closely in with XML than HTML 4.01 ever does (that's why we have XHTML), so this answer will stick to HTML 5 and XML.

首先，您将HTML 4.01 规范与HTML 5 规范进行比较。与 HTML 4.01 相比，HTML5 与 XML 的联系更紧密（这就是我们拥有 XHTML 的原因），因此这个答案将坚持 HTML 5 和 XML。

Your quoted references are all consistent on the following points:

您引用的参考文献在以下几点上都是一致的：

<should always be represented with <when not indicating a processing instruction
>should always be represented with >when not indicating a processing instruction
&should always be represented with &
exceptwhen within <![CDATA[ ]]>(which only applies to XML)

<<不指示处理指令时应始终用
>>不指示处理指令时应始终用
&应该总是用 &
除非在<![CDATA[ ]]>（仅适用于 XML）内

I agree 100% with this. You never want the parser to mistake literals for instructions, so it's a solid idea to always encode any non-space (see below) character. Good parsers know that anything contained within <![CDATA[ ]]>are not instructions, so the encoding is not necessary there.

我 100% 同意这一点。您永远不希望解析器将文字误认为指令，因此始终对任何非空格（见下文）字符进行编码是一个可靠的想法。好的解析器知道其中包含的任何内容<![CDATA[ ]]>都不是指令，因此在那里不需要编码。

In practice, I never encode 'or "unless

在实践中，我从不编码'或"除非

it appears within the value of an attribute (XML or HTML)
it appears within the text of XML tags. (<tag>"Yoinks!", he said.</tag>)

它出现在属性的值中（XML 或 HTML）
它出现在 XML 标签的文本中。( <tag>"Yoinks!", he said.</tag>)

Both specifications also agree with this.

两个规范也同意这一点。

So, the only point of contention is the (space). The only mention of it in either specification is when serialization is attempted. When not, you should always use a literal (space). Unless you are writing your own parser, I don't see the need to be doing any kind of serialization, so this is beside the point.

因此，唯一的争论点是（空间）。在任一规范中唯一提到它是在尝试序列化时。否则，您应该始终使用文字（空格）。除非您正在编写自己的解析器，否则我认为不需要进行任何类型的序列化，因此这无关紧要。

哪些是 HTML 和 XML 特殊字符？

提问by Ian Boyd

4.6 Predefined Entities

4.6 预定义实体

2.4 Character Data and Markup

2.4 字符数据和标记

5.3.2 Character entity references

5.3.2 字符实体引用

Update Two

更新二

8.3 Serializing HTML fragments

8.3 序列化 HTML 片段

采纳答案by johnluetke

相关推荐

最近更新

标签