Java 对 String 的内部表示是什么？修改过的 UTF-8？UTF-16？

Question

提问by Johnny Lim

I searched Java's internal representation for String, but I've got two materials which look reliable but inconsistent.

我搜索了 Java 的 String 内部表示，但我有两种看起来可靠但不一致的材料。

One is:

一种是：

http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451

and it says:

它说：

Java uses UTF-16 for the internal text representation and supports a non-standard modification of UTF-8 for string serialization.

Java 使用 UTF-16 作为内部文本表示，并支持对 UTF-8 进行非标准修改以进行字符串序列化。

The other is:

另一个是：

http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

and it says:

它说：

Tcl also uses the same modified UTF-8[25] as Java for internal representation of Unicode data, but uses strict CESU-8 for external data.

Tcl 还使用与 Java 相同的修改后的 UTF-8[25] 来表示 Unicode 数据的内部表示，但对外部数据使用严格的 CESU-8。

Modified UTF-8? Or UTF-16? Which one is correct? And how many bytes does Java use for a char in memory?

修改过的 UTF-8？还是UTF-16？哪一个是正确的？Java 使用多少字节作为内存中的字符？

Please let me know which one is correct and how many bytes it uses.

请让我知道哪个是正确的以及它使用了多少字节。

Answer 1

采纳答案by Peter Lawrey

Java uses UTF-16 for the internal text representation

Java 使用 UTF-16 作为内部文本表示

The representation for String and StringBuilder etc in Java is UTF-16

Java 中 String 和 StringBuilder 等的表示是 UTF-16

https://docs.oracle.com/javase/8/docs/technotes/guides/intl/overview.html

How is text represented in the Java platform?
The Java programming language is based on the Unicode character set, and several libraries implement the Unicode standard. The primitive data type char in the Java programming language is an unsigned 16-bit integer that can represent a Unicode code point in the range U+0000 to U+FFFF, or the code units of UTF-16. The various types and classes in the Java platform that represent character sequences - char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator - are UTF-16 sequences.

文本在 Java 平台中是如何表示的？
Java 编程语言基于 Unicode 字符集，并且有几个库实现了 Unicode 标准。Java 编程语言中的原始数据类型 char 是一个无符号的 16 位整数，可以表示 U+0000 到 U+FFFF 范围内的 Unicode 代码点，或者 UTF-16 的代码单元。Java 平台中表示字符序列的各种类型和类 - char[]、java.lang.CharSequence 的实现（例如 String 类）和 java.text.CharacterIterator 的实现 - 都是 UTF-16 序列。

At the JVM level, if you are using -XX:+UseCompressedStrings(which is default for some updates of Java 6) The actual in-memory representation can be 8-bit, ISO-8859-1 but only for strings which do not need UTF-16 encoding.

在 JVM 级别，如果您正在使用-XX:+UseCompressedStrings（这是 Java 6 的某些更新的默认值），实际的内存中表示可以是 8 位的 ISO-8859-1，但仅适用于不需要 UTF-16 编码的字符串。

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

and supports a non-standard modification of UTF-8 for string serialization.

并支持对 UTF-8 进行非标准修改以进行字符串序列化。

Serialized Strings use UTF-8 by default.

序列化字符串默认使用 UTF-8。

And how many bytes does Java use for a char in memory?

Java 使用多少字节作为内存中的字符？

A charis always two bytes, if you ignore the need for padding in an Object.

Achar总是两个字节，如果你忽略在对象中填充的需要。

Note: a code point (which allows character > 65535) can use one or two characters, i.e. 2 or 4 bytes.

注意：一个代码点（允许字符> 65535）可以使用一个或两个字符，即2 或4 个字节。

Answer 2

回答by Stephen C

Prior to Java 9, the standard in-memory representation for a Java Stringis UTF-16 code-units held in a char[]. Modified UTF-8 is used in other contexts; e.g. in ".class" files, and the object serialization format.

在 Java 9 之前，Java的标准内存表示String是保存在char[]. 修改后的 UTF-8 用于其他上下文；例如在“.class”文件和对象序列化格式中。

You can confirm this by looking at the source code of the java.lang.Stringclass.

您可以通过查看java.lang.String类的源代码来确认这一点。

With Java 6 update 21 and later, there was a non-standard option (-XX:UseCompressedStrings) to enable compressed strings. This feature was removed in Java 7.

在 Java 6 update 21 及更高版本中，有一个非标准选项 ( -XX:UseCompressedStrings) 来启用压缩字符串。这个特性在 Java 7 中被移除了。

For Java 9 and later, the behavior if Stringhas been changed to use a compact representation for Strings by default. The javacommand documentationnow says this:

对于 Java 9 及更高版本，行为 ifString已更改为默认情况下使用字符串的紧凑表示。该java命令的文档现在这样说：

-XX:-CompactStrings
Disables the Compact Strings feature. By default, this option is enabled.When this option is enabled, Java Strings containing only single-byte characters are internally represented and stored as single-byte-per-character Strings using ISO-8859-1 / Latin-1 encoding. This reduces, by 50%, the amount of space required for Strings containing only single-byte characters. For Java Strings containing at least one multibyte character: these are represented and stored as 2 bytes per character using UTF-16 encoding. Disabling the Compact Strings feature forces the use of UTF-16 encoding as the internal representation for all Java Strings.

-XX:-CompactStrings
禁用压缩字符串功能。默认情况下，启用此选项。启用此选项后，仅包含单字节字符的 Java 字符串在内部表示并存储为使用 ISO-8859-1/Latin-1 编码的单字节每字符字符串。这将仅包含单字节字符的字符串所需的空间量减少了 50%。对于包含至少一个多字节字符的 Java 字符串：这些字符串使用 UTF-16 编码表示和存储为每个字符 2 个字节。禁用压缩字符串功能会强制使用 UTF-16 编码作为所有 Java 字符串的内部表示。

Note that neither "compressed" or "compact" strings used / use UTF-8 encoding.

请注意，“压缩”或“紧凑”字符串均未使用/使用 UTF-8 编码。

回答by belgther

The size of a charis 2 bytes.

a 的大小char为 2 个字节。

Therefore, I would say that Java uses UTF-16 for internal String representation.

因此，我会说 Java 使用 UTF-16 来表示内部字符串。

Answer 4

回答by Andreas Johansson

UTF-16.

UTF-16。

From http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp:

从http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp：

How is text represented in the Java platform?
The Java programming language is based on the Unicode character set, and several libraries implement the Unicode standard. The primitive data type char in the Java programming language is an unsigned 16-bit integer that can represent a Unicode code point in the range U+0000 to U+FFFF, or the code units of UTF-16. The various types and classes in the Java platform that represent character sequences - char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator - are UTF-16 sequences.

文本在 Java 平台中是如何表示的？
Java 编程语言基于 Unicode 字符集，并且有几个库实现了 Unicode 标准。Java 编程语言中的原始数据类型 char 是一个无符号的 16 位整数，可以表示 U+0000 到 U+FFFF 范围内的 Unicode 代码点，或者 UTF-16 的代码单元。Java 平台中表示字符序列的各种类型和类 - char[]、java.lang.CharSequence 的实现（例如 String 类）和 java.text.CharacterIterator 的实现 - 都是 UTF-16 序列。

Answer 5

回答by AlexR

Java stores strings internally as UTF-16 and uses 2 bytes for each character.

Java 在内部将字符串存储为 UTF-16，每个字符使用 2 个字节。

Answer 6

回答by mohan.reddy8

java is available in 18 international languages and following UNICODE character set, which contains all the characters which are available in 18 international languages and contains 65536 characters.And java following UTF-16 so the size of char in java is 2 bytes.

java有18种国际语言及以下UNICODE字符集，包含18种国际语言的所有字符，共65536个字符。而java遵循UTF-16，所以java中char的大小为2个字节。

Java 对 String 的内部表示是什么？修改过的 UTF-8？UTF-16？

提问by Johnny Lim

采纳答案by Peter Lawrey

回答by Stephen C

回答by belgther

回答by Andreas Johansson

回答by AlexR

回答by mohan.reddy8

相关推荐

最近更新

标签

Java 对 String 的内部表示是什么？修改过的 UTF-8？UTF-16？

提问by Johnny Lim

采纳答案by Peter Lawrey

回答by Stephen C

回答by belgther

回答by Andreas Johansson

回答by AlexR

回答by mohan.reddy8

相关推荐

Java 您如何检查异常的类型以及嵌套异常的类型？

Java 如何编写递归方法来返回整数中的数字总和？

Java 将 bean 注入枚举

使用来自 java 对象的值从模板动态创建 word 文档

相关推荐

最近更新

标签