Java 元素文本中带有“&”的 XML 解析问题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3838316/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 05:40:40  来源:igfitidea点击:

XML parsing issue with '&' in element text

javaxmlparsing

提问by Chris Knight

I have the following code:

我有以下代码:

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(inputXml)));

And the parse step is throwning:

解析步骤抛出:

SAXParseException: The entity name must immediately follow 
                   the '&' in the entity reference

due to the following '&' in my inputXml:

由于以下“&”在我的inputXml

<Line1>Day & Night</Line1>

I'm not in control of in the inbound XML. How can I safely/correctly parse this?

我无法控制入站 XML。我怎样才能安全/正确地解析这个?

采纳答案by Andrzej Doyle

Quite simply, the input "XML" is not valid XML. The entity should be encoded, i.e.:

很简单,输入“XML”不是有效的 XML。实体应该被编码,即:

<Line1>Day &amp; Night</Line1>

Basically, there's no "proper" way to fix this other than telling the XML supplier that they're giving you garbage and getting themto fix it. If you're in some horrible situation where you've just got to deal with it, then the approach you take will likely depend on what range of values you're expected to receive.

基本上,除了告诉 XML 供应商他们给你垃圾并让他们修复它之外,没有“正确”的方法来解决这个问题。如果您正处于必须处理它的可怕情况中,那么您采取的方法可能取决于您期望收到的值的范围。

If there's no entities in the document at all, a regex replace of &with &amp;before processing would do the trick. But if they're sending some entities correctly, you'd need to exclude these from the matching. And on the rare chance that they actually wanted to send the entity code (i.e. sent &amp;but meant &amp;amp;) you're going to be completely out of luck.

如果文档中根本没有实体,则在处理之前使用正则表达式替换&with&amp;就可以解决问题。但是如果他们正确地发送了一些实体,你需要从匹配中排除这些。在他们真正想要发送实体代码(即已发送&amp;但意味着&amp;amp;)的极少数情况下,您将完全不走运。

But hey - it's the supplier's fault anyway, and if your attempt to fix up invalid input isn't exactly what they wanted, there's a simple thing they can do to address that. :-)

但是,嘿 - 无论如何,这是供应商的错,如果您尝试修复无效输入的尝试不完全是他们想要的,那么他们可以做一件简单的事情来解决这个问题。:-)

回答by Flynn1179

Your input XML isn't valid XML; unfortunately you can't realistically use an XML parser to parse this.

您输入的 XML 不是有效的 XML;不幸的是,您实际上无法使用 XML 解析器来解析它。

You'll need to pre-process the text before passing it to an XML parser. Although you can do a string replace, replacing '& 'with '&amp; ', this isn't going to catch every occurrence of &in the input, but you may be able to come up with something that does.

在将文本传递给 XML 解析器之前,您需要对文本进行预处理。尽管您可以执行字符串替换,替换'& ''&amp; ',但这不会捕获&输入中的每个出现,但您可能会想出一些可以做到的事情。

回答by Denis Tulskiy

is inputXMLa string? Then use this:

inputXML字符串吗?然后使用这个:

inputXML = inputXML.replaceAll("&\s+", "&amp;");

回答by Ivan Drizhiruk

I used Tidy framework before xml parsing

我在xml解析之前使用了Tidy框架

final StringWriter errorMessages = new StringWriter();
final String res = new TidyChecker().doCheck(html, errorMessages);
...
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(addRoot(html))));  
...

And all Ok

一切都好