如何告诉 Java SAX 解析器忽略无效的字符引用?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2997255/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 15:24:17  来源:igfitidea点击:

How to tell Java SAX Parser to ignore invalid character references?

javaxmlerror-handlingsax

提问by Epaga

When trying to parse incorrect XML with a character reference such as &#x1, Java's SAX Parser dies a horrible death with a fatal error such as

当尝试使用诸如 的字符引用解析不正确的 XML 时&#x1,Java 的 SAX 解析器会因致命错误而死亡,例如

    org.xml.sax.SAXParseException: Character reference "&#x1"
                                   is an invalid XML character.

Is there any way around this? Will I have to clean up the XML file before I hand it off to the SAX Parser? If so, is there an elegant way of going about this?

有没有办法解决?在将 XML 文件交给 SAX 解析器之前,我是否必须清理它?如果是这样,有没有一种优雅的方式来解决这个问题?

采纳答案by wowest

Use XML 1.1! skaffman is completely right, but you can just stick <?xml version="1.1"?>on the top of your files and you'll be in good shape. If you're dealing with streams, write a wrapper that rewrites or adds that processing instruction.

使用 XML 1.1!skaffman 是完全正确的,但你可以坚持<?xml version="1.1"?>你的文件的顶部,你会处于良好的状态。如果您正在处理流,请编写一个包装器来重写或添加该处理指令。

回答by skaffman

You're going to have to clean up your XML, I'm afraid. Such characters are invalid according to the XML spec, and no amount of persuasion is going to convince the parser otherwise.

恐怕您将不得不清理您的 XML。根据 XML 规范,这些字符是无效的,并且再多的说服也无法说服解析器。

Valid XML charactersfor XML 1.0:

XML 1.0 的有效 XML 字符

  • U+0009
  • U+000A
  • U+000D
  • U+0020U+D7FF
  • U+E000U+FFFD
  • U+10000U+10FFFF
  • U+0009
  • U+000A
  • U+000D
  • U+0020—— U+D7FF
  • U+E000—— U+FFFD
  • U+10000—— U+10FFFF

In order to clean up, you'll have to pass the data through a more low-level processor, which treats it as a unicode character stream, removing those characters that are invalid.

为了清理,您必须通过一个更底层的处理器传递数据,该处理器将其视为 unicode 字符流,删除那些无效的字符。

回答by ZZ Coder

This is invalid XML so no parser should parse it without error.

这是无效的 XML,因此任何解析器都不应正确解析它。

But you do encounter such hand-crafted invalid XML in real world. My solution is to manually insert CDATAmarkers to the data. For example,

但是您确实会在现实世界中遇到这种手工制作的无效 XML。我的解决方案是手动将CDATA标记插入到数据中。例如,

  <data><![CDATA[ garbage with &invalid characters ]]></data>

Of course, you will get the data back as is and you have to deal with the invalid characters yourself.

当然,您将按原样恢复数据,您必须自己处理无效字符。