Java - 什么是字符、代码点和代理?它们之间有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23979676/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 09:30:31  来源:igfitidea点击:

Java - what are characters, code points and surrogates? What difference is there between them?

javacharacter-encodingcharacter

提问by Alium Britt

I'm trying to find an explanation of the terms "character", "code point" and "surrogate", and while these terms aren't limited to Java, if there are any language-specific differences I'd like the explanation as it relates to Java.

我试图找到术语“字符”、“代码点”和“代理”的解释,虽然这些术语不限于 Java,但如果有任何特定于语言的差异,我希望解释为它与Java有关。

I've found some information about the differences between characters and code points, characters being what is displayed for human users, and code points being a value encoding that specific character, but I have a no idea about surrogates. What are surrogates, and how are they different from characters and code points? Do I have the right definitions for characters and code points?

我发现了一些关于字符和代码点之间差异的信息,字符是向人类用户显示的内容,代码点是对特定字符进行编码的值,但我不知道代理。什么是代理,它们与字符和代码点有何不同?我对字符和代码点有正确的定义吗?

In another threadabout stepping through a string as an array of characters, the specific comment that prompted this question was "Note that this technique gives you characters, not code points, meaning you may get surrogates." I didn't really understand, and rather than create a long series of comments on a 5-year-old question I thought it would be best to ask for clarification in a new question.

另一个关于将字符串作为字符数组单步执行的帖子中,提示这个问题的具体评论是“请注意,此技术为您提供字符,而不是代码点,这意味着您可能会获得代理。” 我真的不明白,与其对一个 5 年前的问题发表一长串评论,我认为最好在一个新问题中要求澄清。

采纳答案by Cephalopod

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

要在计算机中表示文本,您必须解决两件事:首先,您必须将符号映射到数字,然后,您必须用字节表示这些数字的序列。

A Code pointis a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 216.

码点是一个数字,识别符号。为符号分配数字的两个众所周知的标准是 ASCII 和 Unicode。ASCII 定义了 128 个符号。Unicode 目前定义了 109384 个符号,比 2 16多得多。

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

此外,ASCII 指定数字序列每个数字表示一个字节,而 Unicode 指定了几种可能性,例如 UTF-8、UTF-16 和 UTF-32。

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.

当您尝试使用每个字符使用的位数少于表示所有可能值所需的位数时(例如使用 16 位的 UTF-16),您需要一些解决方法。

Thus, Surrogatesare 16-bit values that indicate symbols that do not fit into a single two-byte value.

因此,代理项是 16 位值,表示不适合单个两字节值的符号。

Java uses UTF-16internally to represent text.

Java 在内部使用UTF-16来表示文本。

In particular, a char(character) is an unsigned two-byte value that contains a UTF-16 value.

特别是,a char(character) 是一个包含 UTF-16 值的无符号两字节值。

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2

如果您想了解有关 Java 和 Unicode 的更多信息,我可以推荐此时事通讯:第 1部分第 2 部分

回答by Johan Sj?berg

To begin with, unicode is a standard which tries to define and map all individual characters from all languages, from english letters to chinese, numbers, symbols etc.

首先,unicode 是一个标准,它试图定义和映射所有语言的所有单个字符,从英文字母到中文、数字、符号等。

Basically unicode has long list of numbered characters where the code pointrefers to the numbering.

基本上 unicode 有很长的编号字符列表,其中代码点是指编号。

In short

简而言之

  • Charactersare the individual tokens in a text, whether letter, number or symbol.
  • A code pointrefers to numbering of a token in the unicode standard
  • Characters represented using the UTF-16encoding scheme houses so many characters that all does not fit in the alotted space of single a java character.
  • Surrogate pairsis the term used to say that one character needs to be represented in the space of a pair of characters. Surrogate pairsis the term used to say that one character is listed so high in the unicode table it needs a pair of character spaces to represent it.
  • 字符是文本中的单个标记,无论是字母、数字还是符号。
  • 码点是指一个令牌的编号Unicode标准
  • 使用UTF-16编码方案表示的字符包含如此多的字符,以至于所有字符都不适合单个 Java 字符的分配空间。
  • 代理对是用来表示一个字符需要在一对字符的空间中表示的术语。代理对是一个术语,用来表示一个字符在 unicode 表中列得太高,需要一对字符空格来表示它。

回答by Stephen C

Code points typically refers to Unicode codepoints. The Unicode glossary says this:

代码点通常是指 Unicode 代码点。Unicode 词汇表是这样说的:

Codepoint(1): Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16.

Codepoint(1): Unicode 代码空间中的任何值;即 0 到 10FFFF16 的整数范围。

In Java, a character (char) is an unsigned 16 bit value; i.e 0 to FFFF.

在 Java 中,字符 ( char) 是一个无符号的 16 位值;即 0 到 FFFF。

As you can see, there are more Unicode codepoints that can be represented as Java characters. And yet Java needs to be able to represent text using all valid Unicode codepoints.

如您所见,有更多 Unicode 代码点可以表示为 Java 字符。然而,Java 需要能够使用所有有效的 Unicode 代码点来表示文本。

The way that Java deals with this is to represent codepoints that are larger than FFFF as a pairof characters (code units); i.e. a surrogate pair. These encodea Unicode codepoint that is larger than FFFF as a pair of 16 bit values. This uses the fact that a subrange of the Unicode code space (i.e. D800 to U+DFFF) is reserved for representing surrogate pairs. The technical details are here.

Java 处理这种情况的方法是将大于 FFFF 的代码点表示为一字符(代码单元);即代理对。这些编码一个Unicode编码点比FFFF较大为一对16位值。这使用了 Unicode 代码空间的子范围(即 D800 到 U+DFFF)保留用于表示代理对的事实。技术细节在这里



The proper term for the encoding that Java is using is the UTF-16 Encoding Form.

Java 使用的编码的正确术语是UTF-16 Encoding Form

Another term that you might see is code unitwhich is the minimum representational unit used in a particular encoding. In UTF-16 the code unit is 16 bits, which corresponds to a Java char. Other encodings (e.g. UTF-8, ISO 8859-1, etc) have 8 bit code units, and UTF-32 has a 32 bit code unit.

您可能会看到的另一个术语是代码单元,它是特定编码中使用的最小表示单元。在 UTF-16 中,代码单元是 16 位,对应于一个 Java char. 其他编码(例如 UTF-8、ISO 8859-1 等)具有 8 位代码单元,而 UTF-32 具有 32 位代码单元。



The term character has many meanings. It means all sorts of things in different contexts. The Unicode glossary gives 4 meanings for Characteras follows:

字符这个词有很多含义。这意味着在不同的上下文中的各种事物。Unicode 词汇表给出了Character 的4 种含义,如下所示:

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader's understanding.

Character. (2) Synonym for abstract character. (Abstract Character. A unit of information used for the organization, control, or representation of textual data.)

Character. (3) The basic unit of encoding for the Unicode character encoding.

Character. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]

特点。(1) 书面语言中具有语义价值的最小成分;指的是抽象的含义和/或形状,而不是特定的形状(另见字形),尽管在代码表中,某种形式的视觉表示对于读者的理解是必不可少的。

特点。(2) 抽象字符的同义词。(抽象字符。用于组织、控制或表示文本数据的信息单元。)

特点。(3) Unicode 字符编码的基本编码单位。

特点。(4) 源自中国的表意文字的英文名称。[见表意文字 (2)。]

And then there is the Java specific meaning for character; i.e. a 16 bit signed number (of type char) that may or may notrepresent a complete or partial Unicode codepoint in UTF-16 encoding.

然后是字符的Java特定含义;即一个 16 位有符号数(类型char),它可能代表也可能不代表 UTF-16 编码中的完整或部分 Unicode 代码点。

回答by nosid

You can find a short explanation in the Javadoc for the class java.lang.Character:

您可以在 Javadoc 中找到java.lang.Character类的简短说明:

Unicode Character Representations

The chardata type (and therefore the value that a Characterobject encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code pointsis now U+0000to U+10FFFF, known as Unicode scalar value. [..]

The set of characters from U+0000to U+FFFFis sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFFare called supplementary characters. The Java platform uses the UTF-16 representation in chararrays and in the Stringand StringBufferclasses. In this representation, supplementary characters are represented as a pair of charvalues, the first from the high-surrogatesrange, (\uD800-\uDBFF), the second from the low-surrogatesrange (\uDC00-\uDFFF).

Unicode 字符表示

char数据类型(以及因此一个值Character对象封装)是基于原始Unicode规范,其定义字符为固定宽度的16位实体。Unicode 标准已更改为允许表示需要超过 16 位的字符。合法代码点的范围现在是U+0000to U+10FFFF,称为Unicode 标量值。[..]

U+0000到的字符集U+FFFF有时被称为基本多语言平面 (BMP)。码位大于的字符U+FFFF称为增补字符。Java 平台在char数组StringStringBuffer类中使用 UTF-16 表示。在这种表示中,补充字符被表示为一对char值,第一个来自高代理范围(\uD​​800-\uDBFF),第二个来自低代理范围(\uD​​C00-\uDFFF)。

In other words:

换句话说:

A code pointusually represents a single character. Originally, the values of type charmatched exactly the Unicode code points. This encoding was also known as UCS-2.

代码点通常表示单个字符。最初,类型的值char与 Unicode 代码点完全匹配。这种编码也被称为UCS-2

For that reason, charwas defined as a 16-Bit type. However, there are currently more than 2^16 charactersin Unicode. To support the whole character set, the encoding was changed from the fixed-length encoding UCS-2to the variable-length encoding UTF-16. Within this encoding, each code point is represented by a single charor by two chars. In the latter case, the two chars are called a surrogate pair.

因此,char被定义为 16 位类型。但是,目前Unicode 中有超过 2^16 个字符。为了支持整个字符集,编码从固定长度编码UCS-2更改为可变长度编码UTF-16。在这种编码中,每个代码点由一个char或两个chars 表示。在后一种情况下,这两个字符称为代理对

UTF-16 was defined in such a way, that there is no difference between text encoded with UTF-16 and UCS-2, if all code points are below 2^14. That means, charcan be used to represent some but not all characters. If a charactercan not be represented within a single char, the term charis misleading, because it is just used as as 16-Bit word.

UTF-16 的定义方式是,如果所有代码点都低于 2^14,则使用 UTF-16 和 UCS-2 编码的文本之间没有区别。这意味着,char可以用来表示一些但不是所有的字符。如果一个字符不能在单个 内表示char,则该术语char具有误导性,因为它仅用作 16 位字。