使用 Java 和 UTF-8 编码生成有效的 XML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/443305/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 14:42:06  来源:igfitidea点击:

Producing valid XML with Java and UTF-8 encoding

javaxmlencodingutf-8

提问by Mike Tunnicliffe

I am using JAXP to generate and parse an XML document from which some fields are loaded from a database.

我正在使用 JAXP 生成和解析 XML 文档,其中一些字段是从数据库加载的。

Code to serialize the XML:

序列化 XML 的代码:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("test");
root.setAttribute("version", text);
doc.appendChild(root);

DOMSource domSource = new DOMSource(doc);
TransformerFactory tFactory = TransformerFactory.newInstance();

FileWriter out = new FileWriter("test.xml");
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(domSource, new StreamResult(out)); 

Code to parse the XML:

解析 XML 的代码:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("test.xml");

And I encounter the following exception:

我遇到以下异常:

[Fatal Error] test.xml:1:4: Invalid byte 1 of 1-byte UTF-8 sequence.
Exception in thread "main" org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
    at com.test.Test.xml(Test.java:27)
    at com.test.Test.main(Test.java:55)

The String text includes u-umlaut and o-umlaut (character codes 0xFC and 0xF6). These are the characters that are causing the error. When I escape the String myself to use ü and ö then the problem goes away. Other entities are automatically encoded when I write out the XML.

字符串文本包括 u-umlaut 和 o-umlaut(字符代码 0xFC 和 0xF6)。这些是导致错误的字符。当我自己转义字符串以使用 ü 时 和 ö 那么问题就迎刃而解了。当我写出 XML 时,其他实体会自动编码。

How do I get my output to be written / read properly without substituting these characters myself?

如何在不自己替换这些字符的情况下正确写入/读取我的输出?

(I've read the following questions already:

(我已经阅读了以下问题:

How to encode characters from Oracle to XML?

如何将 Oracle 中的字符编码为 XML?

Repairing wrong encoding in XML files)

修复 XML 文件中的错误编码

采纳答案by kdgregory

Use a FileOutputStream rather than a FileWriter.

使用 FileOutputStream 而不是 FileWriter。

The latter applies its own encoding, which is almost certainly not UTF-8 (depending on your platform, it's probably Windows-1252 or IS-8859-1).

后者应用自己的编码,几乎可以肯定不是 UTF-8(取决于您的平台,它可能是 Windows-1252 或 IS-8859-1)。

Edit (now that I have some time):

编辑(现在我有一些时间):

An XML document without a prologue is permitted to be encoded as UTF-8 or UTF-16. With a prologue, it iss allowed to specify its encoding (the prologue can contain only US-ASCII characters, so prologue is always readable).

允许将没有序言的 XML 文档编码为 UTF-8 或 UTF-16。对于序言,允许指定其编码(序言只能包含 US-ASCII 字符,因此序言始终可读)。

A Reader deals with characters; it will decode the byte stream of the underlying InputStream. As a result, when you pass a Reader to the parser, you are telling it that you've already handled the encoding, so the parser will ignore the prologue. When you pass an InputStream (which reads bytes), it does not make this assumption, and will look to the prologue to define the encoding -- or default to UTF-8/UTF-16 if it's not there.

Reader 处理字符;它将解码底层 InputStream 的字节流。因此,当您将 Reader 传递给解析器时,您是在告诉它您已经处理了编码,因此解析器将忽略序言。当您传递 InputStream (读取字节)时,它不会做出这种假设,并且会查看序言来定义编码——或者如果不存在则默认为 UTF-8/UTF-16。

I've never tried reading a file that is encoded in UTF-16. I suspect that the parser will look for a Byte Order Mark (BOM) as the first 2 bytes of the file.

我从未尝试过读取以 UTF-16 编码的文件。我怀疑解析器会寻找字节顺序标记 (BOM) 作为文件的前 2 个字节。

回答by James Anderson

Well, for sure 0xFCand 0xF6are not valid UTF-8characters. These should have been finnesed to the two byte sequences: 0x3CBCand 0x3CB6.

好吧,可以肯定0xFC0xF6并且不是有效UTF-8字符。这些应该已经被定义为两个字节序列:0x3CBC0x3CB6.

Most likely the problem is with the original source of the characters being defined as UTF-8when they are not.

最有可能的问题是字符的原始来源被定义为UTF-8当它们不是时。