java 如何解析无效(错误/格式不正确)的 XML?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44765194/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 08:20:06  来源:igfitidea点击:

How to parse invalid (bad / not well-formed) XML?

javaxmlxml-parsingxml-validation

提问by jvhashe

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilderand I'm getting an error on input that looks like the following.

目前,我正在研究一项涉及解析我们从另一个产品接收到的 XML 的功能。我决定对一些实际的客户数据进行一些测试,看起来其他产品允许来自用户的输入应该被认为是无效的。无论如何,我仍然必须尝试找出解析它的方法。我们正在使用javax.xml.parsers.DocumentBuilder,但在输入时出现错误,如下所示。

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)

如您所见,描述中似乎包含一个无效标签 ( <THIS-IS-PART-OF-DESCRIPTION>)。现在,这个描述标签被认为是一个叶子标签,里面不应该有任何嵌套的标签。无论如何,这仍然是一个问题,并在DocumentBuilder.parse(...)

I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

我知道这是无效的 XML,但可以预见它是无效的。关于解析此类输入的任何想法?

回答by kjhughes

That "XML" is worse than invalid– it's not well-formed; see Well Formed vs Valid XML.

“XML”比无效更糟糕——它的格式不正确;请参阅格式良好与有效 XML

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

对违规的可预测性进行非正式评估无济于事。该文本数据不是 XML。没有符合标准的 XML 工具或库可以帮助您处理它。

Options, most desirable first:

选项,最可取的:

  1. Have the provider fix the problem on their end. Demand well-formed XML.(Technically the phrase well-formed XMLis redundant but may be useful for emphasis.)
  2. Use a tolerant markup parserto cleanup the problem ahead of parsing as XML:

  3. Process the data as textmanually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossibleas what appears to be predictable often is not -- rule breaking is rarely bound by rules.

    • For invalid character errors, use regex to remove/replace invalid characters:
      • PHP:preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
      • Ruby:string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000??}-\u{FFFD}", ' ')
      • JavaScript:inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
    • For ampersands, use regex to replace matches with &amp;:credit: blhsin, demo

      &(?!(?:#\d+|#x[0-9a-f]+|\w+);)
      

    Note that the above regular expressions won't take comments or CDATA sections into account.

  1. 让提供者解决他们的问题。 需要格式良好的 XML。(从技术上讲,格式良好的 XML是多余的,但可能有助于强调。)
  2. 使用容忍标记解析器在解析为 XML 之前清除问题:

  3. 使用文本编辑器手动或使用字符/字符串函数以编程方式将数据处理为文本。以编程方式执行此操作的范围从棘手到不可能,因为看似可预测的事情往往并非如此——打破规则很少受规则约束

    • 对于无效字符错误,使用正则表达式删除/替换无效字符:
      • PHP:preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
      • 红宝石:string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000??}-\u{FFFD}", ' ')
      • JavaScript:inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
    • 对于& 符号,使用正则表达式替换匹配项&amp;credit: blhsin, demo

      &(?!(?:#\d+|#x[0-9a-f]+|\w+);)
      

    请注意,上述正则表达式不会考虑注释或 CDATA 部分。

回答by Jim Garrison

A standard XML parser will NEVER accept invalid XML, by design.

按照设计,标准的 XML 解析器永远不会接受无效的 XML。

Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.

您唯一的选择是预处理输入以删除“可预测无效”的内容,或在解析之前将其包装在 CDATA 中。

回答by Benj

IMO these cases should be solved by using JSoup.

IMO 这些情况应该通过使用JSoup来解决。

Below is a not-really answer for this specific case, but found this on the web(thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.

以下是针对此特定情况的非真正答案,但在网上找到了答案(感谢 Coderwall 上的 inuyasha82)。这段代码确实激发了我在处理格式错误的 XML 时遇到的另一个类似问题,所以我在这里分享。

Please do not edit what is below, as it is as it on the original website.

请不要编辑下面的内容,因为它是原始网站上的内容。

The XML format, requires to be valid a unique root element declared in the document. So for example a valid xml is:

XML 格式要求在文档中声明的唯一根元素是有效的。例如,一个有效的 xml 是:

<root>
     <element>...</element>
     <element>...</element>
</root>

But if you have a document like:

但是如果你有这样的文件:

<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>

This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.

这将被视为格式错误的 XML,因此许多 xml 解析器只是抛出一个 Exception 抱怨没有根元素。等等。

In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.

在这个例子中,有一个关于如何解决这个问题并成功解析上面格式错误的 xml 的解决方案。

Basically what we will do is to add programmatically a root element.

基本上我们要做的是以编程方式添加一个根元素。

So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):

因此,首先您必须打开包含“格式错误”的 xml(即文件)的资源:

File file = new File(pathtofile);

Then open a FileInputStream:

然后打开一个 FileInputStream:

FileInputStream fis = new FileInputStream(file);

If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.

如果我们此时尝试使用任何 XML 库解析此流,我们将引发格式错误的文档异常。

Now we create a list of InputStream objects with three lements:

现在我们创建一个包含三个元素的 InputStream 对象列表:

A ByteIputStream element that contains the string: "" Our FileInputStream A ByteInputStream with the string: "" So the code is:

包含字符串的 ByteIputStream 元素:"" 我们的 FileInputStream 带有字符串的 ByteInputStream:"" 所以代码是:

List<InputStream> streams = 
    Arrays.asList(
        new ByteArrayInputStream("<root>".getBytes()),
    fis,
    new ByteArrayInputStream("</root>".getBytes()));

Now using a SequenceInputStream, we create a container for the List created above:

现在使用 SequenceInputStream,我们为上面创建的 List 创建一个容器:

InputStream cntr = 
new SequenceInputStream(Collections.enumeration(str));

Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

现在我们可以在 cntr 上使用任何 XML Parser 库,并且可以毫无问题地解析它。(与 Stax 图书馆核对);

回答by imhotap

The accepted answer is good advice, and contains very useful links.

接受的答案是很好的建议,并包含非常有用的链接。

I'd like to add that this, and manyothercases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTIONelement as SGML empty element and then use eg. the osxprogram (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx

我想补充一点,以及许多其他格式不正确和/或 DTD 无效的 XML 的情况可以使用 SGML(HTML 和 XML 的 ISO 标准化超集)进行修复。在您的情况下,有效的是将虚假THIS-IS-PART-OF-DESCRIPTION元素声明为 SGML 空元素,然后使用例如。该osx程序(OpenSP的/ OpenJade SGML包的一部分)以将其转换为XML。例如,如果您将以下内容提供给osx

<!DOCTYPE xml [
  <!ELEMENT xml - - ANY>
  <!ELEMENT description - - ANY>
  <!ELEMENT THIS-IS-PART-OF-DESCRIPTION -  - EMPTY>
]>
<xml>
  <description>blah blah
    <THIS-IS-PART-OF-DESCRIPTION>
  </description>
</xml>

it will output well-formed XML for further processing with the XML tools of your choice.

它将输出格式良好的 XML,以便使用您选择的 XML 工具进一步处理。

Note, however, that your example snippet has another problem in that element names starting with the letters xmlor XMLor Xmletc. are reserved in XML, and won't be accepted by conforming XML parsers.

但是请注意,您的示例代码片段将有首字母为该元素的名称另一个问题xmlXMLXml等在XML中被保留,并不会由符合XML解析器接受。