Java 如何使用 BOM 编码/解码 UTF-16LE 字节数组？

Question

提问by Jared Oberhaus

I need to encode/decode UTF-16 byte arrays to and from java.lang.String. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM.

我需要将 UTF-16 字节数组编码/解码为java.lang.String. 字节数组是通过Byte Order Marker (BOM)提供给我的，我需要使用 BOM 对字节数组进行编码。

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world.

另外，因为我正在处理 Microsoft 客户端/服务器，所以我想以小端（连同 LE BOM）发出编码以避免任何误解。我确实意识到使用 BOM 应该可以使用大端，但我不想在 Windows 世界中逆流而上。

As an example, here is a method which encodes a java.lang.Stringas UTF-16in little endian with a BOM:

例如，这里有一个方法将 a java.lang.Stringas编码为UTF-16带有 BOM 的 little endian：

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

What is the best way to do this in Java? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning.

在 Java 中执行此操作的最佳方法是什么？理想情况下，我想避免将整个字节数组复制到一个新的字节数组中，该数组在开头分配了两个额外的字节。

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.Stringconstructor:

解码这样的字符串也是如此，但使用java.lang.String构造函数更简单：

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)

Answer 1

采纳答案by McDowell

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this postfor how to use "\uFEFF" to handle BOMs manually. See herefor canonical naming of charset string names or (preferably) the Charsetclass. Also take note that only a limited subset of encodingsare absolutely required to be supported.

“UTF-16”字符集名称将始终使用 BOM 进行编码，并将使用大/小字节序解码数据，但“UnicodeBig”和“UnicodeLittle”对于以特定字节顺序进行编码很有用。对于无 BOM 使用 UTF-16LE 或 UTF-16BE -有关如何使用“\uFEFF”手动处理BOM，请参阅此帖子。有关字符集字符串名称或（最好）Charset类的规范命名，请参见此处。另请注意，绝对只需要支持有限的编码子集。

Answer 2

回答by Yishai

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

EDIT: Rereading your question, I see you would rather avoid the double array allocation altogether. Unfortunately the API doesn't give you that, as far as I know. (There was a method, but it is deprecated, and you can't specify encoding with it).

编辑：重读你的问题，我看到你宁愿完全避免双数组分配。不幸的是，据我所知，API 并没有给你这些。（有一种方法，但它已被弃用，您不能用它指定编码）。

I wrote the above before I saw your comment, I think the answer to use the nio classes is on the right track. I was looking at that, but I'm not familiar enough with the API to know off hand how you get that done.

我在看到你的评论之前写了上面的内容，我认为使用 nio 类的答案是正确的。我正在看那个，但我对 API 不够熟悉，无法立即知道您是如何完成这项工作的。

Answer 3

回答by Daniel Martin

First off, for decoding you can use the character set "UTF-16"; that automatically detects an initial BOM. For encoding UTF-16BE, you can also use the "UTF-16" character set - that'll write a proper BOM and then output big endian stuff.

首先，对于解码，您可以使用字符集“UTF-16”；自动检测初始 BOM。对于 UTF-16BE 编码，您还可以使用“UTF-16”字符集 - 这将编写适当的 BOM，然后输出大端内容。

For encoding to little endian with a BOM, I don't think your current code is too bad, even with the double allocation (unless your strings are truly monstrous). What you might want to do if they are is not deal with a byte array but rather a java.nio ByteBuffer, and use the java.nio.charset.CharsetEncoder class. (Which you can get from Charset.forName("UTF-16LE").newEncoder()).

对于带有 BOM 的小端编码，我不认为您当前的代码太糟糕，即使是双重分配（除非您的字符串真的很可怕）。如果它们不是处理字节数组而是处理 java.nio ByteBuffer，并且使用 java.nio.charset.CharsetEncoder 类，那么您可能想要做什么。（您可以从 Charset.forName("UTF-16LE").newEncoder() 中获得）。

Answer 4

回答by Yishai

This is how you do it in nio:

这就是你在 nio 中的做法：

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

It is certainly supposed to be faster, but I don't know how many arrays it makes under the covers, but my understanding of the point of the API is that it is supposed to minimize that.

它当然应该更快，但我不知道它在幕后制作了多少个数组，但我对 API 的理解是它应该将其最小化。

Answer 5

回答by hopia

This is an old question, but still, I couldn't find an acceptable answer for my situation. Basically, Java doesn't have a built-in encoder for UTF-16LE with a BOM. And so, you have to roll out your own implementation.

这是一个老问题，但仍然无法为我的情况找到可接受的答案。基本上，Java 没有带有 BOM 的 UTF-16LE 的内置编码器。因此，您必须推出自己的实现。

Here's what I ended up with:

这是我的结果：

private byte[] encodeUTF16LEWithBOM(final String s) {
    ByteBuffer content = Charset.forName("UTF-16LE").encode(s);
    byte[] bom = { (byte) 0xff, (byte) 0xfe };
    return ByteBuffer.allocate(content.capacity() + bom.length).put(bom).put(content).array();
}

Java 如何使用 BOM 编码/解码 UTF-16LE 字节数组？

提问by Jared Oberhaus

采纳答案by McDowell

回答by Yishai

回答by Daniel Martin

回答by Yishai

回答by hopia

相关推荐

最近更新

标签

Java 如何使用 BOM 编码/解码 UTF-16LE 字节数组？

提问by Jared Oberhaus

采纳答案by McDowell

回答by Yishai

回答by Daniel Martin

回答by Yishai

回答by hopia

相关推荐

Java 中的 getter/setter

使用 Java 的 Selenium - 驱动程序可执行文件的路径必须由 webdriver.gecko.driver 系统属性设置

Java 输入类型=“日期”百里香叶

Java 如何在框架可见后调用 setUndecorated()？

相关推荐

最近更新

标签