Java 什么是 XML BOM 以及如何检测它?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1772321/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is XML BOM and how do I detect it?
提问by djangofan
What exactly is the BOM in a ANSI XML document and should it be removed? Should a XML document be in UTF-8 instead? Can anyone tell me a Java method that will detect the BOM? The BOM consists of the characters EF BB BF .
ANSI XML 文档中的 BOM 到底是什么,是否应该将其删除?XML 文档应该使用 UTF-8 吗?谁能告诉我一种检测 BOM 的 Java 方法?BOM 由字符 EF BB BF 组成。
采纳答案by jitter
For a ANSI XML file it should actually be removed. If you want to use UTF-8 you don't really need it. Only for UTF-16 and UTF-32 it is needed.
对于 ANSI XML 文件,它实际上应该被删除。如果你想使用 UTF-8,你真的不需要它。只有 UTF-16 和 UTF-32 才需要它。
The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8.
字节顺序标记(或 BOM)是在以 UTF-8、UTF-16 或 UTF-32 编码的 Unicode 文件开头添加的特殊标记。它用于指示文件是使用 big-endian 还是 little-endian 字节顺序。BOM 对于 UTF-16 和 UTF-32 是必需的,但对于 UTF-8 是可选的。
(Source: https://www.opentag.com/xfaq_enc.htm#enc_bom)
(来源:https: //www.opentag.com/xfaq_enc.htm#enc_bom)
Regarding the question on how detect this in java.
关于如何在java中检测这个问题。
Check the following answer to this question: Java : How to determine the correct charset encoding of a streamand if you now want to determine the BOM yourself (at your own risk) check for example this code Java Tip: How to read a file and automatically specify the correct encoding.
检查此问题的以下答案:Java:如何确定流的正确字符集编码,如果您现在想自己确定 BOM(风险自负),请检查例如此代码 Java 提示:如何读取文件和自动指定正确的编码。
Basically just read in the first few bytes yourself and then determine if you mayhave found a BOM.
基本上只是阅读的前几个字节自己,然后确定是否可以找到一个BOM。
回答by McDowell
The byte order mark is likely to be one of these byte sequences:
字节顺序标记可能是以下字节序列之一:
UTF-8 BOM: ef bb bf
UTF-16BE BOM: fe ff
UTF-16LE BOM: ff fe
UTF-32BE BOM: 00 00 fe ff
UTF-32LE BOM: ff fe 00 00
These are the variously encoded forms of the Unicode codepoint U+FEFF. This can be expressed as a Java char literal using '\uFEFF'
(Java char values are implicitlyUTF-16). Since U+FEFF isn't in most encodings, it is not possible for this BOM codepoint to be encoded by them. (More on encoding the BOM using Java here.)
这些是 Unicode 代码点 U+FEFF 的各种编码形式。这可以表示为 Java 字符文字使用'\uFEFF'
(Java 字符值隐式为UTF-16)。由于 U+FEFF 不在大多数编码中,因此该 BOM 代码点不可能由它们编码。(更多关于在此处使用 Java 编码 BOM 的信息。)
When it comes to BOMs and XML, they are optional (see also the Unicode BOM FAQ). Detection of encoding in XML is relatively straightforward if the encoding is specified in the declaration. Always make sure that the XML declaration (<?xml version="1.0" encoding="UTF-8"?>
) matches the encoding used to write the document. If you are strict about this, parsers should be able to interpret your documents correctly. (XML spec on encoding detection.)
对于 BOM 和 XML,它们是可选的(另请参阅Unicode BOM 常见问题解答)。如果在声明中指定了编码,则检测 XML 中的编码相对简单。始终确保 XML 声明 ( <?xml version="1.0" encoding="UTF-8"?>
) 与用于编写文档的编码相匹配。如果您对此很严格,解析器应该能够正确解释您的文档。(关于编码检测的 XML 规范。)
I advocate encoding as Unicode wherever possible (see also the 10 Commandments of Unicode). That said, XML allows the representation of any Unicode character via escape entities (e.g. 'A' could be represented by A
), so it isn't necessarily a requirement to avoid data loss.
我提倡尽可能使用 Unicode 编码(另请参阅Unicode的10 条诫命)。也就是说,XML 允许通过转义实体来表示任何 Unicode 字符(例如,'A' 可以由 表示A
),因此它不一定是避免数据丢失的必要条件。
回答by bill seacham
Do not insert a BOM in a utf-8 file: if two such files are merged, you end up with a BOM in the middle which might break an applicaton, or cause an xml parser to throw an exception.
不要在 utf-8 文件中插入 BOM:如果合并两个这样的文件,最终会在中间出现一个 BOM,这可能会破坏应用程序,或导致 xml 解析器抛出异常。
回答by Robert Fleming
OP:
操作:
Can anyone tell me a Java method that will detect the BOM?
谁能告诉我一种检测 BOM 的 Java 方法?
org.apache.commons.io.input.BOMInputStream
Javadocs:
org.apache.commons.io.input.BOMInputStream
Java 文档:
This class detects these bytes and, if required, can automatically skip them and return the subsequent byte as the first byte in the stream.
此类检测这些字节,如果需要,可以自动跳过它们并将后续字节作为流中的第一个字节返回。