Java中String的字符编码是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4453269/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 17:44:17  来源:igfitidea点击:

What is the character encoding of String in Java?

javastringcharacter-encoding

提问by

I am actually confused regarding the encoding of strings in Java. I have a couple of questions. Please help me if you know the answer to them:

我实际上对 Java 中的字符串编码感到困惑。我有一些问题。如果你知道他们的答案,请帮助我:

1) What is the native encoding of Java strings in memory? When I write String a = "Hello"in which format will it be stored? Since Java is machine independent I don't think the system will do the encoding.

1) Java 字符串在内存中的原生编码是什么?当我写入String a = "Hello"时,它将以哪种格式存储?由于 Java 是独立于机器的,我认为系统不会进行编码。

2) I read on the net that "UTF-16" is the default encoding but I got confused because say when I write that int a = 'c'I get the number of the character in the ASCII table. So are ASCII and UTF-16 the same?

2)我在网上读到“UTF-16”是默认编码,但我很困惑,因为我写的时候说int a = 'c'我得到了ASCII表中的字符数。那么 ASCII 和 UTF-16 是一样的吗?

3) Also I wasn't sure on what the storage of a string in the memory depends: OS, language?

3)我也不确定内存中字符串的存储取决于什么:操作系统,语言?

采纳答案by David R Tribble

1) Strings are objects, which typically contain a chararray and the strings's length. The character array is usually implemented as a contiguous array of 16-bit words, each one containing a Unicode character in native byte order.

1) 字符串是对象,通常包含一个char数组和字符串的长度。字符数组通常被实现为一个连续的 16 位字数组,每个字包含一个本机字节顺序的 Unicode 字符。

2) Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent. Thus 'c', which is U+0063, becomes 0x0063, or 99.

2) 将字符值分配给整数会将 16 位 Unicode 字符代码转换为其等效的整数。因此'c',U+0063 变为0x0063,或 99。

3) Since each Stringis an object, it contains other information than its class members (e.g., class descriptor word, lock/semaphore word, etc.).

3) 由于每个String都是一个对象,它包含除其类成员之外的其他信息(例如,类描述符字、锁/信号量字等)。

ADENDUM
The object contents depend on the JVM implementation (which determines the inherent overhead associated with each object), and how the class is actually coded (i.e., some libraries may be more efficient than others).

附录
对象内容取决于 JVM 实现(它决定了与每个对象相关的固有开销)以及类的实际编码方式(即,某些库可能比其他库更有效)。

EXAMPLE
A typical implementation will allocate an overhead of two words per object instance (for the class descriptor/pointer, and a semaphore/lock control word); a Stringobject also contains an intlength and a char[]array reference. The actual character contents of the string are stored in a second object, the char[]array, which in turn is allocated two words, plus an array length word, plus as many 16-bit charelements as needed for the string (plus any extra chars that were left hanging around when the string was created).

示例
典型的实现将为每个对象实例分配两个字的开销(用于类描述符/指针和信号量/锁控制字);一个String对象还包含一个int长度和一个char[]数组引用。字符串的实际字符内容存储在第二个对象char[]数组中,数组又分配了两个字,加上一个数组长度字,加上char字符串所需的尽可能多的 16 位元素(加上任何额外的字符)创建字符串时左挂)。

ADDENDUM 2
The case that onechar represents oneUnicode character is only true in most of the cases. This would imply UCS-2encoding and true before 2005. But by now Unicode has become larger and Strings have to be encoded using UTF-16 -- where alas a single Unicode character may use twochars in a Java String.

附录2
的情况下,一个字符代表一个Unicode字符是唯一真正在大多数情况下。这意味着UCS-2编码在 2005 年之前是正确的。但现在 Unicode 已经变得更大,字符串必须使用 UTF-16 进行编码——唉,单个 Unicode 字符在 Java 中可能使用两个chars String

Take a look at the actual source code for Apache's implementation, e.g. at:
http://www.docjar.com/html/api/java/lang/String.java.html

查看 Apache 实现的实际源代码,例如:http:
//www.docjar.com/html/api/java/lang/String.java.html

回答by Laurence Gonsalves

  1. Java stores strings as UTF-16 internally.

  2. "default encoding" isn't quite right. Java stores strings as UTF-16 internally, but the encoding used externally, the "system default encoding", varies from platform to platform, and can even be altered by things like environment variables on some platforms.

    ASCII is a subset of Latin 1 which is a subset of Unicode. UTF-16 is a way of encoding Unicode. So if you perform your int i = 'x'test for any character that falls in the ASCII range you'll get the ASCII value. UTF-16 can represent a lot more characters than ASCII, however.

  3. From the java.lang.Character docs:

    The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes.

    So it's defined as part of the Java 2 platform that UTF-16 is used for these classes.

  1. Java 在内部将字符串存储为 UTF-16。

  2. “默认编码”不太正确。Java 在内部将字符串存储为 UTF-16,但外部使用的编码,即“系统默认编码”,因平台而异,甚至可以通过某些平台上的环境变量等内容进行更改。

    ASCII 是拉丁语 1 的子集,后者是 Unicode 的子集。UTF-16 是一种 Unicode 编码方式。因此,如果您int i = 'x'对 ASCII 范围内的任何字符执行测试,您将获得 ASCII 值。然而,UTF-16 可以表示比 ASCII 多得多的字符。

  3. 来自java.lang.Character 文档

    Java 2 平台在 char 数组以及 String 和 StringBuffer 类中使用 UTF-16 表示。

    因此,它被定义为 Java 2 平台的一部分,UTF-16 用于这些类。

回答by Ralph

While this doesn't answer your question, it is worth noting that... In the java byte code (class file), the string is stored in UTF-8. http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html

虽然这不能回答您的问题,但值得注意的是...在 java 字节码(类文件)中,字符串存储在 UTF-8 中。http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html

回答by LaGrandMere

Edit : thanks to LoadMaster for helping me correcting my answer :)

编辑:感谢 LoadMaster 帮助我更正我的答案:)

1) All internal String processing is made in UTF-16.

1) 所有内部 String 处理均以 UTF-16 进行。

2) ASCII is a subset of UTF-16.

2) ASCII 是 UTF-16 的子集。

3) Internally in Java is UTF-16. For the rest, it depends on where you are, yes.

3) 在 Java 内部是 UTF-16。其余的,这取决于你在哪里,是的。