Java 使用 UTF-8 或 UTF-16 哪种编码?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39955169/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Which encoding does Java uses UTF-8 or UTF-16?
提问by Nitin Bhardwaj
I've already read the following posts:
我已经阅读了以下帖子:
- What is the Java's internal represention for String? Modified UTF-8? UTF-16?
- https://docs.oracle.com/javase/8/docs/api/java/lang/String.html
- Java 对 String 的内部表示是什么?修改过的 UTF-8?UTF-16?
- https://docs.oracle.com/javase/8/docs/api/java/lang/String.html
Now consider the code given below:
现在考虑下面给出的代码:
public static void main(String[] args) {
printCharacterDetails("最");
}
public static void printCharacterDetails(String character){
System.out.println("Unicode Value for "+character+"="+Integer.toHexString(character.codePointAt(0)));
byte[] bytes = character.getBytes();
System.out.println("The UTF-8 Character="+character+" | Default: Number of Bytes="+bytes.length);
String stringUTF16 = new String(bytes, StandardCharsets.UTF_16);
System.out.println("The corresponding UTF-16 Character="+stringUTF16+" | UTF-16: Number of Bytes="+stringUTF16.getBytes().length);
System.out.println("----------------------------------------------------------------------------------------");
}
When I tried to debug the line character.getBytes()
in the code above, the debugger took me into the getBytes()
method of String class and then subsequently into the static byte[] encode(char[] ca, int off, int len)
method of StringCoding class. The first line of the encode method (String csn = Charset.defaultCharset().name();
) returned "UTF-8" as the default encoding during the debugging. I expected it to be "UTF-16".
当我尝试调试character.getBytes()
上面代码中的行时,调试器带我进入getBytes()
String类的方法,然后进入static byte[] encode(char[] ca, int off, int len)
StringCoding类的方法。编码方法 ( String csn = Charset.defaultCharset().name();
)的第一行在调试过程中返回“UTF-8”作为默认编码。我希望它是“UTF-16”。
The output of the program is:
程序的输出是:
Unicode Value for 最=6700 The UTF-8 Character=最 | Default: Number of Bytes=3
Unicode 值为最=6700 UTF-8 字符=最 | 默认值:字节数=3
The corresponding UTF-16 Character=? | UTF-16: Number of Bytes=6
对应的 UTF-16 Character=? | UTF-16:字节数=6
When I converted it to UTF-16 explicitly in the program it took 6 bytes to represent the character. Shouldn't it use 2 or 4 bytes for UTF-16? Why 6 bytes were used?
当我在程序中将它显式转换为 UTF-16 时,它需要 6 个字节来表示字符。UTF-16 不应该使用 2 或 4 个字节吗?为什么使用了 6 个字节?
Where am I going wrong in my understanding? I use Ubuntu 14.04 and the locale command shows the following:
我的理解哪里出了问题?我使用 Ubuntu 14.04 并且 locale 命令显示以下内容:
LANG=en_US.UTF-8
Does this mean that JVM decides which encoding to use on the basis of underlying OS or does it use UTF-16 only? Please help me understand the concept.
这是否意味着 JVM 会根据底层操作系统决定使用哪种编码,还是仅使用 UTF-16?请帮助我理解这个概念。
回答by RealSkeptic
Characters are a graphical entity which is part of human culture. When a computer needs to handle text, it uses a representationof those characters in bytes. The exact representation used is called an encoding.
人物是一种图形实体,是人类文化的一部分。当一台计算机需要处理的文本,它采用了代表字节这些字符。使用的确切表示称为编码。
There are many encodings that can represent the same character - either through the Unicode character set, or through other character sets like the various ISO-8859 encodings, or the JIS X 0208.
有许多编码可以表示相同的字符——要么通过 Unicode 字符集,要么通过其他字符集,如各种 ISO-8859 编码或 JIS X 0208。
Internally, Java uses UTF-16. This means that each character can be represented by one or two sequences of two bytes. The character you were using, 最, has the code point U+6700 which is represented in UTF-16 as the byte 0x67 and the byte 0x00.
在内部,Java 使用 UTF-16。这意味着每个字符可以由一个或两个两个字节的序列表示。您使用的字符最具有代码点 U+6700,它在 UTF-16 中表示为字节 0x67 和字节 0x00。
That's the internalencoding. You can't see it unless you dump your memory and look at the bytes in the dumped image.
这就是内部编码。除非您转储内存并查看转储图像中的字节,否则您无法看到它。
But the method getBytes()
does notreturn this internal representation. Its documentation says:
但这种方法getBytes()
并没有返回这个内部表示。它的文档说:
public byte[] getBytes()
Encodes this
String
into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
public byte[] getBytes()
String
使用平台的默认字符集将其编码为字节序列,并将结果存储到新的字节数组中。
The "platform's default charset" is what your locale variables say it is. That is, UTF-8
. So it takes the UTF-16 internal representation, and converts that into a different representation - UTF-8.
“平台的默认字符集”是您的语言环境变量所说的。也就是说,UTF-8
。所以它采用 UTF-16 内部表示,并将其转换为不同的表示 - UTF-8。
Note that
注意
new String(bytes, StandardCharsets.UTF_16);
does not"convert it to UTF-16 explicitly" as you assumed it does. This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding.
不会像您假设的那样“将其显式转换为 UTF-16”。这个字符串构造函数接受一个字节序列,它应该是你在第二个参数中给出的编码,并将它转换为这些字节在该编码中表示的任何字符的 UTF-16 表示。
But you have given it a sequence of bytes encoded in UTF-8, and told it to interpret that as UTF-16. This is wrong, and you do not get the character - or the bytes - that you expect.
但是你已经给了它一个以 UTF-8 编码的字节序列,并告诉它把它解释为 UTF-16。这是错误的,您没有得到您期望的字符或字节。
You can't tell Java how to internally store strings. It always stores them as UTF-16. The constructor String(byte[],Charset)
tells Java to create a UTF-16 string from an array of bytes that is supposed to be in the given character set. The method getBytes(Charset)
tells Java to give you a sequence of bytes that represent the string in the given encoding (charset). And the method getBytes()
without an argument does the same - but uses your platform's default character set for the conversion.
您无法告诉 Java 如何在内部存储字符串。它始终将它们存储为 UTF-16。构造函数String(byte[],Charset)
告诉 Java 从应该在给定字符集中的字节数组创建一个 UTF-16 字符串。该方法getBytes(Charset)
告诉 Java 为您提供一个字节序列,这些字节代表给定编码(字符集)中的字符串。getBytes()
没有参数的方法也一样——但使用平台的默认字符集进行转换。
So you misunderstood what getBytes()
gives you. It's notthe internal representation. You can't get that directly. only getBytes(StandardCharsets.UTF_16)
will give you that, and only because you know that UTF-16
is the internal representation in Java. If a future version of Java decided to represent the characters in a different encoding, then getBytes(StandardCharsets.UTF_16)
would not show you the internal representation.
所以你误解了什么getBytes()
给你。这不是内部表示。你不能直接得到那个。onlygetBytes(StandardCharsets.UTF_16)
会给你那个,而且只是因为你知道这UTF-16
是 Java 的内部表示。如果 Java 的未来版本决定以不同的编码表示字符,则getBytes(StandardCharsets.UTF_16)
不会向您显示内部表示。
Edit:in fact, Java 9 introduced just such a change in internal representation of strings, where, by default, strings whose characters all fall in the ISO-8859-1 range are internally represented in ISO-8859-1, whereas strings with at least one character outside that range are internally represented in UTF-16 as before. So indeed, getBytes(StandardCharsets.UTF_16)
no longer returns the internal representation.
编辑:事实上,Java 9 在字符串的内部表示中引入了这样的变化,默认情况下,字符都落在 ISO-8859-1 范围内的字符串在 ISO-8859-1 中内部表示,而具有 at与以前一样,该范围之外的至少一个字符在内部以 UTF-16 表示。所以确实,getBytes(StandardCharsets.UTF_16)
不再返回内部表示。
回答by Erwin Smout
As stated above, java uses UTF-16 as the encoding for character data.
如上所述,java 使用 UTF-16 作为字符数据的编码。
To which it may be added that the set of representable characters is limited to a proper subset of the entire Unicode character set. (I believe java restricts its character set to the Unicode BMP, all of which fit in two bytes of UTF-16.)
可以补充的是,可表示的字符集仅限于整个 Unicode 字符集的适当子集。(我相信 java 将其字符集限制为 Unicode BMP,所有这些都适合 UTF-16 的两个字节。)
So the encoding applied is indeed UTF-16, but the character set to which it is applied is a proper subset of the entire Unicode character set, and this guarantees that Java always uses two bytes per token in its internal String encodings.
所以应用的编码确实是 UTF-16,但它应用的字符集是整个 Unicode 字符集的一个适当的子集,这保证了 Java 在其内部字符串编码中始终使用每个标记两个字节。