Java 字节缓冲区、字符缓冲区、字符串和字符集

Question

提问by mins

I'm trying to sort out characters, their representation in byte sequences according to character sets, and how to convert from one character set to another in Java. I've some difficulties.

我正在尝试根据字符集整理字符、它们在字节序列中的表示，以及如何在 Java 中从一种字符集转换为另一种字符集。我有一些困难。

For instance,

例如，

ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());

My understanding is that:

我的理解是：

String are always stored as UTF-16 byte sequence in Java (2 bytes per character, big endian)
getBytes()result is this same UTF-16 byte sequence
wrap()maintains this sequence
bybfis therefore an UTF-16 big endian representation of the string Olé

字符串在 Java 中始终存储为 UTF-16 字节序列（每个字符 2 个字节，大端）
getBytes()结果是相同的 UTF-16 字节序列
wrap()保持这个序列
bybf因此是字符串的 UTF-16 big endian 表示 Olé

Thus in this code:

因此在这段代码中：

Charset utf16 = Charset.forName("UTF-16");  
CharBuffer chbf = utf16.decode(bybf);  
System.out.println(chbf);

decode()should

decode()应该

Interpret bybfas an UTF-16 string representation
"convert" it to the original string Olé.

解释bybf为 UTF-16 字符串表示
将其“转换”为原始字符串Olé。

Actually no byte should be altered since everything is UTF-16 stored and UTF-16 Charsetshould be a kind of "neutral operator". However the result is printed as:

实际上不应更改任何字节，因为所有内容都是 UTF-16 存储的，并且 UTF-16Charset应该是一种“中性运算符”。但是结果打印为：

??

How can that be?

怎么可能？

Additional question: For converting correctly, it seems Charset.decode(ByteBuffer bb)requires bbto be an UTF-16 big endian byte sequence image of a string. Is that correct?

附加问题：为了正确转换，似乎Charset.decode(ByteBuffer bb)需要bb是字符串的 UTF-16 大端字节序列图像。那是对的吗？

Edit: From the answers provided, I did some testing to print a ByteBuffercontent and the charsobtained by decoding it. Bytes [encoding with ="Olé".getBytes(charsetName)] are printed on first line of groups, the other line(s) are the strings obtained by decoding back the bytes [with Charset#decode(ByteBuffer)] with various Charset.

编辑：根据提供的答案，我做了一些测试来打印ByteBuffer内容并chars通过解码获得。字节 [encoding with = "Olé".getBytes(charsetName)] 打印在组的第一行，其他行是通过Charset#decode(ByteBuffer)使用各种Charset.

I also confirmed that the default encoding for storing String into byte[]on a Windows 7 computer is windows-1252(unless strings contain chars requiring UTF-8).

我还确认byte[]在 Windows 7 计算机上存储字符串的默认编码是windows-1252（除非字符串包含需要 UTF-8 的字符）。

Default VM encoding: windows-1252  
Sample string: "Olé"  


  getBytes() no CS provided : 79 108 233  <-- default (windows-1252), 1 byte per char
     Decoded as windows-1252: Olé         <-- using the same CS than getBytes()
           Decoded as UTF-16: ??          <-- using another CS (doesn't work indeed)

  getBytes with windows-1252: 79 108 233  <-- same than getBytes()
     Decoded as windows-1252: Olé

         getBytes with UTF-8: 79 108 195 169  <-- 'é' in UTF-8 use 2 bytes
            Decoded as UTF-8: Olé

        getBytes with UTF-16: 254 255 0 79 0 108 0 233 <-- each char uses 2 bytes with UTF-16
           Decoded as UTF-16: Olé                          (254-255 is an encoding tag)

Answer 1

采纳答案by BevynQ

You are mostly correct.

你大多是正确的。

The native character representation in java is UTF-16. However when converting characters to bytes you either specify the charset you are using, or the system uses it's default which has usually been UTF-8whenever I checked. This will yield interesting results if you are mixing and matching.

java 中的本机字符表示是UTF-16。但是，在将字符转换为字节时，您可以指定您正在使用的字符集，或者系统使用它的默认值，每当我检查时通常都是UTF-8。如果您正在混合和匹配，这将产生有趣的结果。

eg for my system the following

例如对于我的系统如下

System.out.println(Charset.defaultCharset().name());
ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());
Charset utf16 = Charset.forName("UTF-16");
CharBuffer chbf = utf16.decode(bybf);
System.out.println(chbf);
bybf = ByteBuffer.wrap("Olé".getBytes(utf16));
chbf = utf16.decode(bybf);
System.out.println(chbf);

produces

产生

UTF-8
佬?
Olé

UTF-8
大佬？
奥莱

So this part is only correct if UTF-16 is the default charset
getBytes() result is this same UTF-16 byte sequence.

所以这部分只有在 UTF-16 是默认字符集时才是正确的
getBytes() result is this same UTF-16 byte sequence.

So either always specify the charset you are using which is safest as you will always know what is going on, or always use the default.

因此，要么始终指定您使用的最安全的字符集，因为您将始终知道发生了什么，或者始终使用默认值。

Answer 2

回答by user207421

String are always stored as UTF-16 byte sequence in Java (2 bytes per character, big endian)

字符串在 Java 中始终存储为 UTF-16 字节序列（每个字符 2 个字节，大端）

Yes.

是的。

getBytes() result is this same UTF-16 byte sequence

getBytes() 结果是同样的 UTF-16 字节序列

No. It encodes the UTF-16 chars into the platform default charset, whatever that is. Deprecated.

不。它将 UTF-16 字符编码为平台默认字符集，无论是什么。已弃用。

wrap() maintains this sequence

wrap() 维护这个序列

wrap()maintains everything.

wrap()维护一切。

bybf is therefore an UTF-16 big endian representation of the string Olé

bybf 因此是字符串 Olé 的 UTF-16 big endian 表示

No. It wraps the platform's default encoding of the original string.

不。它包装了平台的原始字符串的默认编码。

decode() should
Interpret bybf as an UTF-16 string representation

decode() 应该
将 bybf 解释为 UTF-16 字符串表示

No, see above.

不，见上文。

"convert" it to the original string Olé.

将其“转换”为原始字符串 Olé。

Not unless the platform's default encoding is "UTF-16".

除非平台的默认编码是“UTF-16”。

Answer 3

回答by Wolf

I had nearly the same problem with data encoded in doublebyte charset. Answer 3 above contains already the critical pitfalls you should keep an eye on.

对于以双字节字符集编码的数据，我遇到了几乎相同的问题。上面的答案 3 已经包含了您应该注意的关键陷阱。

Define a Charset for the source encoding.
Define a Charset only for the target encoding if it is different from your local sytem encoding.

为源编码定义一个字符集。
如果目标编码与本地系统编码不同，则仅为目标编码定义字符集。

Following code works

以下代码有效

public static String convertUTF16ToString(byte[] doc)
{
    final Charset doublebyte = StandardCharsets.UTF_16;
    // Don't need this because it is my local (system default).  
    //final Charset ansiCharset = StandardCharsets.ISO_8859_1;

    final CharBuffer encoded = doublebyte.decode(ByteBuffer.wrap(doc));
    StringBuffer sb = new StringBuffer(encoded);
    return sb.toString();        
}

Replace system default by your favorite encoding.

用您喜欢的编码替换系统默认值。

public static String convertUTF16ToUTF8(byte[] doc)
{
    final Charset doublebyte = StandardCharsets.UTF_16; 
    final Charset utfCharset = StandardCharsets.UTF_8; 
    final Charset ansiCharset = StandardCharsets.ISO_8859_1;

    final CharBuffer encoded1 = doublebyte.decode(ByteBuffer.wrap(doc));
    StringBuffer sb = new StringBuffer(encoded1);
    final byte[] result = ansiCharset.encode(encoded1).array();
    // alternative to utf-8
    //final byte[] result = utfCharset.encode(encoded1).array();

    return new String(result);        
}

Java 字节缓冲区、字符缓冲区、字符串和字符集

提问by mins

采纳答案by BevynQ

回答by user207421

回答by Wolf

相关推荐

最近更新

标签

Java 字节缓冲区、字符缓冲区、字符串和字符集

提问by mins

采纳答案by BevynQ

回答by user207421

回答by Wolf

相关推荐

Java/android 如何在延迟 3 秒后启动 AsyncTask？

Java @JsonIgnore 和 @Getter 注释

Java 检查 URL 是否存在

如何使用 JDK8 在 Eclipse 中导入 javax.faces 库？

相关推荐

最近更新

标签