什么是 XML 中的无效字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/730133/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What are invalid characters in XML
提问by RailsSon
I am working with some XML that holds strings like:
我正在使用一些包含以下字符串的 XML:
<node>This is a string</node>
Some of the strings that I am passing to the nodes will have characters like &, #, $, etc.:
有的,我传递给各节点的字符串将有字符,如&,#,$,等:
<node>This is a string & so is this</node>
This is not valid due to &.
由于&.
I cannot wrap these strings in CDATA as they need to be as they are. I tried looking for a list of characters that cannot be put in XML nodes without being in a CDATA.
我不能将这些字符串包装在 CDATA 中,因为它们需要保持原样。我试图寻找一个字符列表,这些字符在没有 CDATA 的情况下不能放在 XML 节点中。
Can someone point me in the direction of one or provide me with a list of illegal characters?
有人可以指出我的方向或向我提供非法字符列表吗?
采纳答案by Welbog
The only illegal characters are &, <and >(as well as "or 'in attributes).
唯一的非法字符是&,<和>(以及属性中的"或')。
They're escaped using XML entities, in this case you want &for &.
它们使用XML 实体进行转义,在这种情况下您需要&for &。
Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.
不过,实际上,您应该使用一个工具或库来为您编写 XML 并为您抽象出这种东西,这样您就不必担心了。
回答by potame
OK, let's separate the question of the characters that:
好的,让我们分开字符的问题:
- aren't valid at all in any XML document.
- need to be escaped.
- 在任何 XML 文档中根本无效。
- 需要逃脱。
The answer provided by @dolmen in "What are invalid characters in XML" is still valid but needs to be updated with the XML 1.1 specification.
@dolmen 在“什么是 XML 中的无效字符”中提供的答案仍然有效,但需要使用 XML 1.1 规范进行更新。
1. Invalid characters
1. 无效字符
The characters described here are all the characters that are allowed to be inserted in an XML document.
这里描述的字符是所有允许插入到 XML 文档中的字符。
1.1. In XML 1.0
1.1. 在 XML 1.0 中
- Reference: see XML recommendation 1.0, §2.2 Characters
- 参考:参见XML 建议 1.0,§2.2 字符
The global list of allowed characters is:
允许字符的全局列表是:
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Basically, the control characters and characters out of the Unicode ranges are not allowed.
This means also that calling for example the character entity is forbidden.
基本上,不允许使用 Unicode 范围之外的控制字符和字符。这也意味着禁止调用例如字符实体。
1.2. In XML 1.1
1.2. 在 XML 1.1 中
- Reference: see XML recommendation 1.1, §2.2 Characters, and 1.3 Rationale and list of changes for XML 1.1
The global list of allowed characters is:
允许字符的全局列表是:
[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
This revision of the XML recommendation has extended the allowed characters so control characters are allowed, and takes into account a new revision of the Unicode standard, but these ones are still not allowed : NUL (x00), xFFFE, xFFFF...
XML 建议的此修订版扩展了允许的字符,因此允许使用控制字符,并考虑了 Unicode 标准的新修订版,但仍然不允许使用这些字符:NUL (x00)、xFFFE、xFFFF...
However, the use of control characters and undefined Unicode char is discouraged.
但是,不鼓励使用控制字符和未定义的 Unicode 字符。
It can also be noticed that all parsers do not always take this into account and XML documents with control characters may be rejected.
还可以注意到,所有解析器并不总是考虑到这一点,带有控制字符的 XML 文档可能会被拒绝。
2. Characters that need to be escaped (to obtain a well-formed document):
2.需要转义的字符(获得格式良好的文档):
The <must be escaped with a <entity, since it is assumed to be the beginning of a tag.
在<必须使用转义<实体,因为它被认为是一个标签的开始。
The &must be escaped with a &entity, since it is assumed to be the beginning a entity reference
在&必须使用转义&实体,因为它被认为是开始时的实体引用
The >should be escaped with >entity. It is not mandatory -- it depends on the context -- but it is strongly advised to escape it.
该>应进行转义>实体。它不是强制性的——它取决于上下文——但强烈建议你避开它。
The 'should be escaped with a 'entity -- mandatory in attributes defined within single quotes but it is strongly advised to always escape it.
本'应该用转义'实体-在单引号内定义的属性强制性的,但它强烈建议总是逃避它。
The "should be escaped with a "entity -- mandatory in attributes defined within double quotes but it is strongly advised to always escape it.
本"应该用转义"实体-在双引号内定义的属性强制性的,但它强烈建议总是逃避它。
回答by dolmen
The list of valid characters is in the XML specification:
有效字符列表在XML 规范中:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
回答by mathifonseca
This is a C# code to remove the XML invalid characters from a string and return a new valid string.
这是一个 C# 代码,用于从字符串中删除 XML 无效字符并返回一个新的有效字符串。
public static string CleanInvalidXmlChars(string text)
{
// From xml spec valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
// any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]";
return Regex.Replace(text, re, "");
}
回答by cgp
The predeclared characters are:
预先声明的字符是:
& < > " '
See "What are the special characters in XML?" for more information.
有关详细信息,请参阅“ XML 中的特殊字符是什么?”。
回答by bvdb
In addition to potame's answer, if you do want to escape using a CDATA block.
除了potame的答案之外,如果您确实想使用CDATA块进行转义。
If you put your text in a CDATA block then you don't need to use escaping. In that case you can use all characters in the following range:
如果您将文本放在 CDATA 块中,则不需要使用 escaping。在这种情况下,您可以使用以下范围内的所有字符:
Note: On top of that, you're not allowed to use the ]]>character sequence. Because it would match the end of the CDATA block.
注意:除此之外,您不能使用]]>字符序列。因为它将匹配 CDATA 块的末尾。
If there are still invalid characters (e.g. control characters), then probably it's better to use some kind of encoding (e.g. base64).
如果仍然存在无效字符(例如控制字符),那么最好使用某种编码(例如 base64)。
回答by Alex Vazhev
Another way to remove incorrect XML chars in C# is using XmlConvert.IsXmlChar(Available since .NET Framework 4.0)
在 C# 中删除不正确 XML 字符的另一种方法是使用XmlConvert.IsXmlChar(自 .NET Framework 4.0 起可用)
public static string RemoveInvalidXmlChars(string content)
{
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
or you may check that all characters are XML-valid:
或者您可以检查所有字符是否都是 XML 有效的:
public static bool CheckValidXmlChars(string content)
{
return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}
For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.
例如,垂直制表符 ( \v) 对 XML 无效,它是有效的 UTF-8,但不是有效的 XML 1.0,甚至许多库(包括 libxml2)都错过了它并默默地输出无效的 XML。
回答by tiands
Another easy way to escape potentially unwanted XML / XHTML chars in C# is:
在 C# 中转义可能不需要的 XML / XHTML 字符的另一种简单方法是:
WebUtility.HtmlEncode(stringWithStrangeChars)
回答by Kalpesh Popat
"XmlWriter and lower ASCII characters" worked for me
“ XmlWriter 和较低的 ASCII 字符”对我有用
string code = Regex.Replace(item.Code, @"[\u0000-\u0008,\u000B,\u000C,\u000E-\u001F]", "");
回答by rghome
In summary, valid characters in the text are:
总之,文本中的有效字符是:
- tab, line-feed and carriage-return.
- all non-control characters are valid except
&and<. >is not valid if following]].
- 制表符、换行符和回车符。
- 除了
&和之外,所有非控制字符都是有效的<。 >如果遵循则无效]]。
Sections 2.2 and 2.4 of the XML specification provide the answer in detail:
XML 规范的第 2.2 和 2.4 节详细提供了答案:
Characters
人物
Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646
合法字符有制表符、回车符、换行符、Unicode 和 ISO/IEC 10646 的合法字符
Character data
字符数据
The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
与符号 (&) 和左尖括号 (<) 不得以其文字形式出现,除非用作标记定界符,或者在注释、处理指令或 CDATA 部分中。如果在其他地方需要它们,则必须分别使用数字字符引用或字符串“&”和“<”进行转义。右尖括号 (>) 可以使用字符串“>”表示,并且为了兼容性,当它出现在内容中的字符串“]]>”中时,必须使用“>”或字符引用进行转义,当字符串未标记 CDATA 部分的结尾。


