XML 文件中的特殊字符 - 使用 DOM API 处理

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/871963/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-06 12:30:12  来源:igfitidea点击:

Special characters in XML files - processing with the DOM API

xmldomspecial-characters

提问by user42155

I have a file, which is in XML format (consists just of root start and end tags, and children of the root). The text elements of the children contain the ampersand symbol &. In XML it is not allowed to have this symbol in order the document to be valid, and when I tried to process the file using the DOM API in Java and an XML parser, I obtained parsing errors. Therefore, I have replaced & with &, and I processed the file successfully: I had to extract the values of the text elements in different plain text files.

我有一个 XML 格式的文件(仅由根开始和结束标记以及根的子项组成)。子元素的文本元素包含与符号 &。在 XML 中,为了使文档有效,不允许使用此符号,并且当我尝试使用 Java 中的 DOM API 和 XML 解析器处理文件时,我遇到了解析错误。因此,我将 & 替换为&,并且成功处理了文件:我必须提取不同纯文本文件中文本元素的值。

When I opened these newly created text files, I expected to see &, but there was & instead. Why is this? I have stored the text in text files without any extension (my original file with the XML format also did not have .xml extension), and I do have just & in the text of the new file, no matter how I open the file: as txt or as xml file (these are some of the options in my XML editor). What happens exactly? Does Java (?) convert &to & automatically? Or there is some default encoding? Well, &stands for &, and I suppose there is some "invisible" automatic conversion, but I am confused when and how this happens. Here are examples of my original file and the extracted file which I receive after I processed the original file with Java:

当我打开这些新创建的文本文件时,我希望看到&,但看到的是 & 。为什么是这样?我将文本存储在没有任何扩展名的文本文件中(我的 XML 格式的原始文件也没有 .xml 扩展名),并且无论我如何打开文件,我在新文件的文本中都只有 & :作为 txt 或 xml 文件(这些是我的 XML 编辑器中的一些选项)。究竟会发生什么?Java (?) 是否&自动转换为 &?或者有一些默认编码?嗯,&代表 &,我想有一些“隐形”自动转换,但我很困惑何时以及如何发生这种情况。以下是我的原始文件和使用 Java 处理原始文件后收到的提取文件的示例:

This is my "negative.review" file in XML format:

这是我的 XML 格式的“negative.review”文件:

<review>
<review_text>
I will not wear it as it is too big &amp; looks funny on me. 
</review_text>
</review>

This is my extracted file "negative_1":

这是我提取的文件“negative_1”:

I will not wear it as it is too big & looks funny on me. 

For me it is important to have the original data as it is (without doing any conversions/replacements), so I thought that I have to process the extracted file "negative_1" converting back &amp;to &. As you see, it seems I don't have to do this. But I don't understand why :(.

对我来说,保持原始数据原样很重要(不进行任何转换/替换),所以我认为我必须将提取的文件“negative_1”转换回&amp;&。如您所见,我似乎不必这样做。但我不明白为什么:(。

Thank you in advance!

先感谢您!

回答by Tomalak

The reason is simple: The XML file really containsan "&"character.

原因很简单:XML 文件确实包含一个"&"字符。

It is just represented differently (i.e. it is "escaped"), because a real "&"on it's own breaks XML files, as you've seen. Read the relevant section in the XML 1.0 spec: "2.4 Character Data and Markup". It's just a few lines, but it explains the issue quite well.

它只是以不同的方式表示(即它被“转义”),因为"&"正如您所看到的那样,它本身会破坏 XML 文件。阅读 XML 1.0 规范中的相关部分:“2.4 字符数据和标记”。这只是几行,但它很好地解释了这个问题。

XML is a representation of data (!). Don't think of it as a text file. Example:

XML 是数据的表示 (!)。不要将其视为文本文件。例子:

You want to store the string "17 < 20" in an XML file. Initially, you can't, since the "<" is reserved as the opening tag bracket. So this would be invalid:

您想将字符串“17 < 20”存储在 XML 文件中。最初,您不能,因为“<”被保留为开始标记括号。所以这将是无效的:

<xml>17 < 20</xml>

Solution: You employ character escaping on the special/reserved character, just for the means of retaining the validity of the file:

解决方案:您在特殊/保留字符上使用字符转义,只是为了保持文件的有效性:

<xml>17 &lt; 20</xml>

For all practical purposes the above snippet contains the following data (in JSON representation this time):

出于所有实际目的,上述代码段包含以下数据(这次以 JSON 表示):

{
  "xml": "17 < 20"
}

This is why you see the real "&"in your post-processing. It had been escaped in just the same way, but it's meaningstayed the same all the time.

这就是您"&"在后期处理中看到真实的原因。它以同样的方式逃脱,但它的含义一直保持不变。

The above example also explains why the "&"must be treated specially: It is itself part of the XML escaping mechanism. It marks the start of an escape sequence, like in "&lt;". Therefore it must be escaped itself (with "&amp;", like you've done).

上面的例子也解释了为什么"&"必须特殊对待:它本身就是 XML 转义机制的一部分。它标志着一个转义序列的开始,就像在"&lt;". 因此它必须自己转义(使用"&amp;",就像你所做的那样)。

回答by Alex Martelli

Any XML parser will implicitly translate entities such as &amp;, &lt;, &gt;, into the corresponding characters, as part of the process of parsing the file.

作为解析文件过程的一部分,任何 XML 解析器都会隐式地将&amp;, &lt;, &gt;,等实体转换为相应的字符。