Java 字节缓冲区、字符缓冲区、字符串和字符集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24481238/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
ByteBuffer, CharBuffer, String and Charset
提问by mins
I'm trying to sort out characters, their representation in byte sequences according to character sets, and how to convert from one character set to another in Java. I've some difficulties.
我正在尝试根据字符集整理字符、它们在字节序列中的表示,以及如何在 Java 中从一种字符集转换为另一种字符集。我有一些困难。
For instance,
例如,
ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());
My understanding is that:
我的理解是:
- String are always stored as UTF-16 byte sequence in Java (2 bytes per character, big endian)
getBytes()
result is this same UTF-16 byte sequencewrap()
maintains this sequencebybf
is therefore an UTF-16 big endian representation of the stringOlé
- 字符串在 Java 中始终存储为 UTF-16 字节序列(每个字符 2 个字节,大端)
getBytes()
结果是相同的 UTF-16 字节序列wrap()
保持这个序列bybf
因此是字符串的 UTF-16 big endian 表示Olé
Thus in this code:
因此在这段代码中:
Charset utf16 = Charset.forName("UTF-16");
CharBuffer chbf = utf16.decode(bybf);
System.out.println(chbf);
decode()
should
decode()
应该
- Interpret
bybf
as an UTF-16 string representation - "convert" it to the original string
Olé
.
- 解释
bybf
为 UTF-16 字符串表示 - 将其“转换”为原始字符串
Olé
。
Actually no byte should be altered since everything is UTF-16 stored and UTF-16 Charset
should be a kind of "neutral operator". However the result is printed as:
实际上不应更改任何字节,因为所有内容都是 UTF-16 存储的,并且 UTF-16Charset
应该是一种“中性运算符”。但是结果打印为:
??
How can that be?
怎么可能?
Additional question: For converting correctly, it seems Charset.decode(ByteBuffer bb)
requires bb
to be an UTF-16 big endian byte sequence image of a string. Is that correct?
附加问题:为了正确转换,似乎Charset.decode(ByteBuffer bb)
需要bb
是字符串的 UTF-16 大端字节序列图像。那是对的吗?
Edit: From the answers provided, I did some testing to print a ByteBuffer
content and the chars
obtained by decoding it. Bytes [encoding with ="Olé".getBytes(charsetName)
] are printed on first line of groups, the other line(s) are the strings obtained by decoding back the bytes [with Charset#decode(ByteBuffer)
] with various Charset
.
编辑:根据提供的答案,我做了一些测试来打印ByteBuffer
内容并chars
通过解码获得。字节 [encoding with = "Olé".getBytes(charsetName)
] 打印在组的第一行,其他行是通过Charset#decode(ByteBuffer)
使用各种Charset
.
I also confirmed that the default encoding for storing String into byte[]
on a Windows 7 computer is windows-1252
(unless strings contain chars requiring UTF-8).
我还确认byte[]
在 Windows 7 计算机上存储字符串的默认编码是windows-1252
(除非字符串包含需要 UTF-8 的字符)。
Default VM encoding: windows-1252
Sample string: "Olé"
getBytes() no CS provided : 79 108 233 <-- default (windows-1252), 1 byte per char
Decoded as windows-1252: Olé <-- using the same CS than getBytes()
Decoded as UTF-16: ?? <-- using another CS (doesn't work indeed)
getBytes with windows-1252: 79 108 233 <-- same than getBytes()
Decoded as windows-1252: Olé
getBytes with UTF-8: 79 108 195 169 <-- 'é' in UTF-8 use 2 bytes
Decoded as UTF-8: Olé
getBytes with UTF-16: 254 255 0 79 0 108 0 233 <-- each char uses 2 bytes with UTF-16
Decoded as UTF-16: Olé (254-255 is an encoding tag)
采纳答案by BevynQ
You are mostly correct.
你大多是正确的。
The native character representation in java is UTF-16. However when converting characters to bytes you either specify the charset you are using, or the system uses it's default which has usually been UTF-8whenever I checked. This will yield interesting results if you are mixing and matching.
java 中的本机字符表示是UTF-16。但是,在将字符转换为字节时,您可以指定您正在使用的字符集,或者系统使用它的默认值,每当我检查时通常都是UTF-8。如果您正在混合和匹配,这将产生有趣的结果。
eg for my system the following
例如对于我的系统如下
System.out.println(Charset.defaultCharset().name());
ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());
Charset utf16 = Charset.forName("UTF-16");
CharBuffer chbf = utf16.decode(bybf);
System.out.println(chbf);
bybf = ByteBuffer.wrap("Olé".getBytes(utf16));
chbf = utf16.decode(bybf);
System.out.println(chbf);
produces
产生
UTF-8
佬?
Olé
UTF-8
大佬?
奥莱
So this part is only correct if UTF-16 is the default charsetgetBytes() result is this same UTF-16 byte sequence.
所以这部分只有在 UTF-16 是默认字符集时才是正确的getBytes() result is this same UTF-16 byte sequence.
So either always specify the charset you are using which is safest as you will always know what is going on, or always use the default.
因此,要么始终指定您使用的最安全的字符集,因为您将始终知道发生了什么,或者始终使用默认值。
回答by user207421
String are always stored as UTF-16 byte sequence in Java (2 bytes per character, big endian)
字符串在 Java 中始终存储为 UTF-16 字节序列(每个字符 2 个字节,大端)
Yes.
是的。
getBytes() result is this same UTF-16 byte sequence
getBytes() 结果是同样的 UTF-16 字节序列
No. It encodes the UTF-16 chars into the platform default charset, whatever that is. Deprecated.
不。它将 UTF-16 字符编码为平台默认字符集,无论是什么。已弃用。
wrap() maintains this sequence
wrap() 维护这个序列
wrap()
maintains everything.
wrap()
维护一切。
bybf is therefore an UTF-16 big endian representation of the string Olé
bybf 因此是字符串 Olé 的 UTF-16 big endian 表示
No. It wraps the platform's default encoding of the original string.
不。它包装了平台的原始字符串的默认编码。
decode() should
- Interpret bybf as an UTF-16 string representation
decode() 应该
- 将 bybf 解释为 UTF-16 字符串表示
No, see above.
不,见上文。
- "convert" it to the original string Olé.
- 将其“转换”为原始字符串 Olé。
Not unless the platform's default encoding is "UTF-16".
除非平台的默认编码是“UTF-16”。
回答by Wolf
I had nearly the same problem with data encoded in doublebyte charset. Answer 3 above contains already the critical pitfalls you should keep an eye on.
对于以双字节字符集编码的数据,我遇到了几乎相同的问题。上面的答案 3 已经包含了您应该注意的关键陷阱。
- Define a Charset for the source encoding.
- Define a Charset only for the target encoding if it is different from your local sytem encoding.
- 为源编码定义一个字符集。
- 如果目标编码与本地系统编码不同,则仅为目标编码定义字符集。
Following code works
以下代码有效
public static String convertUTF16ToString(byte[] doc)
{
final Charset doublebyte = StandardCharsets.UTF_16;
// Don't need this because it is my local (system default).
//final Charset ansiCharset = StandardCharsets.ISO_8859_1;
final CharBuffer encoded = doublebyte.decode(ByteBuffer.wrap(doc));
StringBuffer sb = new StringBuffer(encoded);
return sb.toString();
}
Replace system default by your favorite encoding.
用您喜欢的编码替换系统默认值。
public static String convertUTF16ToUTF8(byte[] doc)
{
final Charset doublebyte = StandardCharsets.UTF_16;
final Charset utfCharset = StandardCharsets.UTF_8;
final Charset ansiCharset = StandardCharsets.ISO_8859_1;
final CharBuffer encoded1 = doublebyte.decode(ByteBuffer.wrap(doc));
StringBuffer sb = new StringBuffer(encoded1);
final byte[] result = ansiCharset.encode(encoded1).array();
// alternative to utf-8
//final byte[] result = utfCharset.encode(encoded1).array();
return new String(result);
}