java的UTF-16字符编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20966802/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF-16 Character Encoding of java
提问by priyaranjan
I was trying to understand character encoding in Java. Characters in Java are being stored in 16 bits using UTF-16 encoding. So while i am converting a string containing 6 character to byte i am getting 6 bytes as below, I am expecting it to be 12. Is there any concept i am missing ?
我试图理解 Java 中的字符编码。Java 中的字符使用 UTF-16 编码以 16 位存储。因此,当我将包含 6 个字符的字符串转换为字节时,我得到 6 个字节,如下所示,我希望它是 12。有什么我遗漏的概念吗?
package learn.java;
public class CharacterTest {
public static void main(String[] args) {
String str = "Hadoop";
byte bt[] = str.getBytes();
System.out.println("the length of character array is " + bt.length);
}
}
O/p :the length of character array is 6
O/p : 字符数组的长度为 6
As per @Darshan When trying with UTF-16 encoding to get bytes the result is also not expecting .
根据@Darshan 当尝试使用 UTF-16 编码来获取字节时,结果也出乎意料。
package learn.java;
public class CharacterTest {
public static void main(String[] args) {
String str = "Hadoop";
try{
byte bt[] = str.getBytes("UTF-16");
System.out.println("the length of character array is " + bt.length);
}
catch(Exception e)
{
}
}
}
o/p: the length of character array is 14
采纳答案by tucuxi
In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).
在 UTF-16 版本中,由于插入了一个标记来区分 Big Endian(默认)和 Little Endian,因此您将获得 14 个字节。如果您指定 UTF-16LE,您将获得 12 个字节(小端,未添加字节顺序标记)。
See http://www.unicode.org/faq/utf_bom.html#gen7
见http://www.unicode.org/faq/utf_bom.html#gen7
EDIT -Use this program to look into the actual bytes generated by different encodings:
编辑 -使用此程序查看由不同编码生成的实际字节:
public class Test {
public static void main(String args[]) throws Exception {
// bytes in the first argument, encoded using second argument
byte[] bs = args[0].getBytes(args[1]);
System.err.println(bs.length + " bytes:");
// print hex values of bytes and (if printable), the char itself
char[] hex = "0123456789ABCDEF".toCharArray();
for (int i=0; i<bs.length; i++) {
int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
System.err.print(hex[b>>4] + "" + hex[b&0xf]
+ ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
+ ( (i%4 == 3) ? "\n" : " "));
}
System.err.println();
}
}
For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:
例如,在 UTF-8 下运行时(在其他 JVM 默认编码下,FE 和 FF 的字符会显示不同),输出为:
$ javac Test.java && java -cp . Test hello UTF-16
12 bytes:
FEt FF? 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo
And
和
$ javac Test.java && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00.
And
和
$ javac Test.java && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo
回答by Evgeniy Dorofeev
String.getBytes()
uses default platformencoding. Try this
String.getBytes()
使用默认平台编码。尝试这个
byte bt[] = str.getBytes("UTF-16");
回答by Oleg Sklyar
I think this will help: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
我认为这会有所帮助:Joel Spolsky 的绝对最低要求每个软件开发人员绝对,肯定地必须了解 Unicode 和字符集(没有借口!)
And this will help as well: "UTF-16 (16-bit Unicode Transformation Format) is a character encoding [...] The encoding is a variable-length encodingas code points are encoded with one or two 16-bit code units." (from Wikipedia)
这也会有所帮助:“UTF-16(16 位 Unicode 转换格式)是一种字符编码 [...] 该编码是一种可变长度编码,因为代码点是用一个或两个 16 位代码单元编码的.” (来自维基百科)
回答by Seelenvirtuose
As per the String.getBytes()
method's documentation, the string is encoded into a sequence of bytes using the platform's default charset.
根据String.getBytes()
方法的文档,使用平台的默认字符集将字符串编码为字节序列。
I assume, your platform default charset will be ISO-8859-1 (or a similar one-byte-per-char-charset). These charsets will encode one character into one byte.
我假设,您的平台默认字符集将是 ISO-8859-1(或类似的每字符一字节字符集)。这些字符集将一个字符编码为一个字节。
If you want to specify the encoding, use the method String.getBytes(Charset)
or String.getBytes(String)
.
如果要指定编码,请使用方法String.getBytes(Charset)
或String.getBytes(String)
。
About the 16-bit storing: This is how Java internallystores characters, so also strings. It is based on the original Unicode specification.
关于 16 位存储:这是 Java内部存储字符的方式,字符串也是如此。它基于原始的 Unicode 规范。
回答by Darshan Patel
For UTF-16
encoding use str.getBytes("UTF-16");
对于UTF-16
编码使用str.getBytes("UTF-16");
but it gives 14 length for byte[] please refer [link] http://rosettacode.org/wiki/String_lengthfor more details.
但它为 byte[] 提供了 14 个长度,请参阅 [link] http://rosettacode.org/wiki/String_length了解更多详细信息。