Java 如何让 SAX 解析器从 xml 声明中确定编码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3482494/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 01:11:09  来源:igfitidea点击:

Howto let the SAX parser determine the encoding from the xml declaration?

javaxmlencodingsaxxml-parsing

提问by Allan

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:

我正在尝试解析来自不同来源的 xml 文件(我几乎无法控制)。它们中的大多数都以 UTF-8 编码,使用以下代码段不会导致任何问题:

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);

Since SAX defaults to UTF-8 this is fine. However some of the documents declare:

由于 SAX 默认为 UTF-8,这很好。然而,一些文件声明:

<?xml version="1.0" encoding="ISO-8859-1"?>

Even though ISO-8859-1 is declared SAX still defaults to UTF-8. Only if I add:

即使声明了 ISO-8859-1,SAX 仍然默认为 UTF-8。只有当我添加:

is.setEncoding("ISO-8859-1");

Will SAX use the correct encoding.

SAX 将使用正确的编码。

How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.

如何让 SAX 自动从 xml 声明中检测正确的编码,而无需我专门设置它?我需要这个,因为我事先不知道文件的编码是什么。

Thanks in advance, Allan

提前致谢,艾伦

采纳答案by Jarekczek

Use InputStreamas argument to InputSourcewhen you want Sax to autodetect the encoding.

当您希望 Sax 自动检测编码时,使用InputStream作为InputSource 的参数。

If you want to set a specific encoding, use Readerwith a specified encoding or setEncodingmethod.

如果要设置特定编码,请使用具有指定编码或setEncoding方法的Reader

Why? Because autodetection encoding algorithmsrequire raw data, not converted to characters.

为什么?因为自动检测编码算法需要原始数据,而不是转换为字符。

The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration?I found Allan's answer to the question misleading and I provided the alternative one, based on J?rn Horstmann's comment and my later experience.

主题中的问题是:如何让 SAX 解析器从 xml 声明中确定编码?我发现艾伦对这个问题的回答具有误导性,我根据 J?rn Horstmann 的评论和我后来的经验提供了替代答案。

回答by Allan

I found the answer myself.

我自己找到了答案。

The SAX parser uses InputSource internally and from the InputSource docs:

SAX 解析器在内部和来自 InputSource 文档中使用 InputSource:

The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.

SAX 解析器将使用 InputSource 对象来确定如何读取 XML 输入。如果有可用的字符流,解析器将直接读取该流,忽略在该流中找到的任何文本编码声明。如果没有字符流,但有字节流,解析器将使用该字节流,使用 InputSource 中指定的编码,否则(如果未指定编码)使用算法自动检测字符编码,例如XML 规范。如果字符流和字节流都不可用,解析器将尝试打开到由系统标识符标识的资源的 URI 连接。

So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:

所以基本上你需要将一个字符流传递给解析器,以便它获取正确的编码。请参阅下面的解决方案:

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);