java 解组期间无效的 XML 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5815134/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 12:54:52  来源:igfitidea点击:

Invalid XML Character During Unmarshall

javaxml-serializationjaxbunmarshalling

提问by oliverwood

I am marshalling objects to XML file using encoding "UTF-8". It generates file successfully. But when I try to unmarshal it back, there is an error:

我正在使用编码“UTF-8”将对象编组到 XML 文件。它成功生成文件。但是当我尝试将其解组时,出现错误:

An invalid XML character (Unicode: 0x{2}) was found in the value of attribute "{1}" and element is "0"

在属性“{1}”的值中发现无效的 XML 字符(Unicode:0x{2})并且元素为“0”

The character is 0x1A or \u001a, which is valid in UTF-8 but illegal in XML. Marshaller in JAXB allows writing this character into XML file, but Unmarshaller cannot parse it back. I tried to use another encoding (UTF-16, ASCII, etc) but still error.

字符为 0x1A 或 \u001a,在 UTF-8 中有效但在 XML 中非法。JAXB 中的 Marshaller 允许将此字符写入 XML 文件,但 Unmarshaller 无法解析它。我尝试使用另一种编码(UTF-16、ASCII 等),但仍然出错。

The common solution is to remove/replace this invalid character before XML parsing. But if we need this character back, how to get the original character after unmarshalling?

常见的解决方案是在 XML 解析之前删除/替换这个无效字符。但是如果我们需要这个字符回来,解组后如何得到原来的字符呢?



While looking for this solution, I want to replace the invalid characters with a substitute character (for example dot = ".") before unmarshalling.

在寻找此解决方案时,我想在解组之前用替换字符(例如点 = ".")替换无效字符。

I have created this class:

我创建了这个类:

public class InvalidXMLCharacterFilterReader extends FilterReader {

    public static final char substitute = '.'; 

    public InvalidXMLCharacterFilterReader(Reader in) {
        super(in);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {

        int read = super.read(cbuf, off, len);

        if (read == -1)
            return -1;

        for (int readPos = off; readPos < off + read; readPos++) {
            if(!isValid(cbuf[readPos])) {
                   cbuf[readPos] = substitute;
            }
        }

        return readPos - off + 1; 
    }

    public boolean isValid(char c) {
        if((c == 0x9)
                || (c == 0xA) 
                || (c == 0xD) 
                || ((c >= 0x20) && (c <= 0xD7FF)) 
                || ((c >= 0xE000) && (c <= 0xFFFD)) 
                || ((c >= 0x10000) && (c <= 0x10FFFF)))
        {
            return true;
        } else
            return false;
    }
 }

Then this is how I read and unmarshall the file:

然后这就是我读取和解组文件的方式:

FileReader fileReader = new FileReader(this.getFile());
Reader reader = new InvalidXMLCharacterFilterReader(fileReader);
Object o = (Object)um.unmarshal(reader);

Somehow the reader does not replace invalid characters with the character I want. It results a wrong XML data which can't be unmarshalled. Is there something wrong with my InvalidXMLCharacterFilterReader class?

不知何故,读者不会用我想要的字符替换无效字符。这会导致无法解组的错误 XML 数据。我的 InvalidXMLCharacterFilterReader 类有问题吗?

回答by JMelnik

I think the main problem is about escaping illegal characters during marshalling. Something similar was mentioned here, you could try it out.

我认为主要问题是在编组期间转义非法字符这里提到类似的东西,你可以尝试一下。

It offers to change encoding to Unicode marshaller.setProperty("jaxb.encoding", "Unicode");

它提供将编码更改为 Unicode marshaller.setProperty("jaxb.encoding", "Unicode");

回答by Joachim Sauer

The Unicode character U+001A is illegal in XML 1.0:

Unicode 字符 U+001A在 XML 1.0 中非法的

The encoding used to represent it does not matter in this case, it's simply not allowed in XML content.

在这种情况下,用于表示它的编码并不重要,它在 XML 内容中是不允许的。

XML 1.1 allows some of the restricted characters(including U+001A) to be included, but they mustbe present as numeric character references (&#x1a;)

XML 1.1 允许包含一些受限制的字符(包括 U+001A),但它们必须作为数字字符引用 ( &#x1a;)

Wikipedia has a nice summary of the situation.

维基百科对情况一个很好的总结