java 如何解析无效（错误/格式不正确）的 XML？

Question

提问by jvhashe

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilderand I'm getting an error on input that looks like the following.

目前，我正在研究一项涉及解析我们从另一个产品接收到的 XML 的功能。我决定对一些实际的客户数据进行一些测试，看起来其他产品允许来自用户的输入应该被认为是无效的。无论如何，我仍然必须尝试找出解析它的方法。我们正在使用javax.xml.parsers.DocumentBuilder，但在输入时出现错误，如下所示。

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)

如您所见，描述中似乎包含一个无效标签 ( <THIS-IS-PART-OF-DESCRIPTION>)。现在，这个描述标签被认为是一个叶子标签，里面不应该有任何嵌套的标签。无论如何，这仍然是一个问题，并在DocumentBuilder.parse(...)

I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

我知道这是无效的 XML，但可以预见它是无效的。关于解析此类输入的任何想法？

Answer 1

回答by kjhughes

That "XML" is worse than invalid– it's not well-formed; see Well Formed vs Valid XML.

“XML”比无效更糟糕——它的格式不正确；请参阅格式良好与有效 XML。

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

对违规的可预测性进行非正式评估无济于事。该文本数据不是 XML。没有符合标准的 XML 工具或库可以帮助您处理它。

Options, most desirable first:

选项，最可取的：

Have the provider fix the problem on their end. Demand well-formed XML.(Technically the phrase well-formed XMLis redundant but may be useful for emphasis.)
Use a tolerant markup parserto cleanup the problem ahead of parsing as XML:
- Standalone:xmlstarlethas robust recovering and repair capabilities^{_{credit: RomanPerekhrest}}
```
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
```
- Standalone and C/C++:HTML Tidyworks with XML too. Taggleis a port of TagSoup to C++.
- Python:Beautiful Soupis Python-based. See notes in the Differences between parserssection. See also answers to this questionfor more suggestions for dealing with not-well-formed markup in Python. See also this answerfor how to use codecs.EncodedFile()to cleanup illegal characters.
- Java:TagSoupand JSoupfocus on HTML. FilterInputStreamcan be used for preprocessing cleanup.
- .NET:
  - XmlReaderSettings.CheckCharacterscan be disabled to get past illegal XML character problems.
  - @jdweng notesthat XmlReaderSettings.ConformanceLevelcan be set to ConformanceLevel.Fragmentso that XmlReadercan read XML Well-Formed Parsed Entitieslacking a root element.
  - @jdweng also reportsthat XmlReader.ReadToFollowing()can sometimes be used to work-around XML syntactical issues, but note rule-breaking warning in #3 below.
  - Microsoft.Language.Xml.XMLParseris said to be “error-tolerant”.
- PHP:See DOMDocument::$recoverand libxml_use_internal_errors(true). See nice example here.
- Ruby:Nokogiri supports “Gentle Well-Formedness”.
- R:See htmlTreeParse()for fault-tolerant markup parsing in R.
- Perl:See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as textmanually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossibleas what appears to be predictable often is not -- rule breaking is rarely bound by rules.
- For invalid character errors, use regex to remove/replace invalid characters:
  - PHP:preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
  - Ruby:string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000??}-\u{FFFD}", ' ')
  - JavaScript:inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
- For ampersands, use regex to replace matches with &:^{_{credit: blhsin, demo}}
```
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
```
Note that the above regular expressions won't take comments or CDATA sections into account.

让提供者解决他们的问题。 需要格式良好的 XML。（从技术上讲，格式良好的 XML是多余的，但可能有助于强调。）
使用容忍标记解析器在解析为 XML 之前清除问题：
- 独立：xmlstarlet具有强大的恢复和修复能力^{_{信用：RomanPerekhrest}}
```
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
```
- 独立和 C/C++：HTML Tidy 也适用于 XML。 Taggle是 TagSoup 到 C++ 的端口。
- Python：Beautiful Soup是基于 Python 的。请参阅解析器之间的差异部分中的注释。有关在 Python 中处理格式不正确的标记的更多建议，另请参阅此问题的答案。另请参阅此答案以了解如何codecs.EncodedFile()用于清除非法字符。
- Java：TagSoup和JSoup专注于 HTML。 FilterInputStream可用于预处理清理。
- 。网：
  - 可以禁用XmlReaderSettings.CheckCharacters以解决非法 XML 字符问题。
  - @jdweng票据是XmlReaderSettings.ConformanceLevel可以被设置为 ConformanceLevel.Fragment使XmlReader可以读取XML格式良好的解析实体缺少根元素。
  - @jdweng 还报告说，XmlReader.ReadToFollowing()有时可用于解决 XML 语法问题，但请注意下面 #3 中的违规警告。
  - Microsoft.Language.Xml.XMLParser据说是“容错的”。
- PHP：请参阅DOMDocument::$recover和libxml_use_internal_errors(true)。在这里看到很好的例子。
- Ruby：Nokogiri 支持“ Gentle Well-Formedness”。
- R：请参阅htmlTreeParse()以了解R 中的容错标记解析。
- Perl：请参阅XML::Liberal，这是一个“解析损坏的 XML 的超级自由的 XML 解析器”。
使用文本编辑器手动或使用字符/字符串函数以编程方式将数据处理为文本。以编程方式执行此操作的范围从棘手到不可能，因为看似可预测的事情往往并非如此——打破规则很少受规则约束。
- 对于无效字符错误，使用正则表达式删除/替换无效字符：
  - PHP：preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
  - 红宝石：string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000??}-\u{FFFD}", ' ')
  - JavaScript：inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
- 对于& 符号，使用正则表达式替换匹配项&：^{_{credit: blhsin, demo}}
```
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
```
请注意，上述正则表达式不会考虑注释或 CDATA 部分。

Answer 2

回答by Jim Garrison

A standard XML parser will NEVER accept invalid XML, by design.

按照设计，标准的 XML 解析器永远不会接受无效的 XML。

Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.

您唯一的选择是预处理输入以删除“可预测无效”的内容，或在解析之前将其包装在 CDATA 中。

Answer 3

回答by Benj

IMO these cases should be solved by using JSoup.

IMO 这些情况应该通过使用JSoup来解决。

Below is a not-really answer for this specific case, but found this on the web(thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.

以下是针对此特定情况的非真正答案，但在网上找到了此答案（感谢 Coderwall 上的 inuyasha82）。这段代码确实激发了我在处理格式错误的 XML 时遇到的另一个类似问题，所以我在这里分享。

Please do not edit what is below, as it is as it on the original website.

请不要编辑下面的内容，因为它是原始网站上的内容。

The XML format, requires to be valid a unique root element declared in the document. So for example a valid xml is:

XML 格式要求在文档中声明的唯一根元素是有效的。例如，一个有效的 xml 是：

<root>
     <element>...</element>
     <element>...</element>
</root>

But if you have a document like:

但是如果你有这样的文件：

<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>

This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.

这将被视为格式错误的 XML，因此许多 xml 解析器只是抛出一个 Exception 抱怨没有根元素。等等。

In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.

在这个例子中，有一个关于如何解决这个问题并成功解析上面格式错误的 xml 的解决方案。

Basically what we will do is to add programmatically a root element.

基本上我们要做的是以编程方式添加一个根元素。

So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):

因此，首先您必须打开包含“格式错误”的 xml（即文件）的资源：

File file = new File(pathtofile);

Then open a FileInputStream:

然后打开一个 FileInputStream：

FileInputStream fis = new FileInputStream(file);

If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.

如果我们此时尝试使用任何 XML 库解析此流，我们将引发格式错误的文档异常。

Now we create a list of InputStream objects with three lements:

现在我们创建一个包含三个元素的 InputStream 对象列表：

A ByteIputStream element that contains the string: "" Our FileInputStream A ByteInputStream with the string: "" So the code is:

包含字符串的 ByteIputStream 元素："" 我们的 FileInputStream 带有字符串的 ByteInputStream："" 所以代码是：

List<InputStream> streams = 
    Arrays.asList(
        new ByteArrayInputStream("<root>".getBytes()),
    fis,
    new ByteArrayInputStream("</root>".getBytes()));

Now using a SequenceInputStream, we create a container for the List created above:

现在使用 SequenceInputStream，我们为上面创建的 List 创建一个容器：

InputStream cntr = 
new SequenceInputStream(Collections.enumeration(str));

Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

现在我们可以在 cntr 上使用任何 XML Parser 库，并且可以毫无问题地解析它。（与 Stax 图书馆核对）；

Answer 4

回答by imhotap

The accepted answer is good advice, and contains very useful links.

接受的答案是很好的建议，并包含非常有用的链接。

I'd like to add that this, and many othercases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTIONelement as SGML empty element and then use eg. the osxprogram (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx

我想补充一点，以及许多其他格式不正确和/或 DTD 无效的 XML 的情况可以使用 SGML（HTML 和 XML 的 ISO 标准化超集）进行修复。在您的情况下，有效的是将虚假THIS-IS-PART-OF-DESCRIPTION元素声明为 SGML 空元素，然后使用例如。该osx程序（OpenSP的/ OpenJade SGML包的一部分）以将其转换为XML。例如，如果您将以下内容提供给osx

<!DOCTYPE xml [
  <!ELEMENT xml - - ANY>
  <!ELEMENT description - - ANY>
  <!ELEMENT THIS-IS-PART-OF-DESCRIPTION -  - EMPTY>
]>
<xml>
  <description>blah blah
    <THIS-IS-PART-OF-DESCRIPTION>
  </description>
</xml>

it will output well-formed XML for further processing with the XML tools of your choice.

它将输出格式良好的 XML，以便使用您选择的 XML 工具进一步处理。

Note, however, that your example snippet has another problem in that element names starting with the letters xmlor XMLor Xmletc. are reserved in XML, and won't be accepted by conforming XML parsers.

但是请注意，您的示例代码片段将有首字母为该元素的名称另一个问题xml或XML或Xml等在XML中被保留，并不会由符合XML解析器接受。

java 如何解析无效（错误/格式不正确）的 XML？

提问by jvhashe

回答by kjhughes

Options, most desirable first:

选项，最可取的：

回答by Jim Garrison

回答by Benj

回答by imhotap

相关推荐

最近更新

标签

java 如何解析无效（错误/格式不正确）的 XML？

提问by jvhashe

回答by kjhughes

Options, most desirable first:

选项，最可取的：

回答by Jim Garrison

回答by Benj

回答by imhotap

相关推荐

java Chrome 59 和使用 Selenium/Fluentlenium 的基本身份验证

如何在 java 中使用 ExtentReports 侦听器打印日志？

java 使用 Spring Rest 服务时在 Date 中获取错误的时间

应用程序启动方法 java.lang.reflect.InvocationTargetException JavaFX 图像转换中的异常

相关推荐

最近更新

标签