Java Unicode 编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2533097/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 08:56:13  来源:igfitidea点击:

Java Unicode encoding

javaunicodecharacter-encoding

提问by Marcus Leon

A Java charis 2 bytes(max size of 65,536) but there are 95,221Unicode characters. Does this mean that you can't handle certain Unicode characters in a Java application?

Javachar2 个字节(最大大小为 65,536),但有95,221 个Unicode 字符。这是否意味着您无法在 Java 应用程序中处理某些 Unicode 字符?

Does this boil down to what character encoding you are using?

这是否归结为您使用的字符编码?

采纳答案by kennytm

You can handle them all if you're careful enough.

如果你足够小心,你可以处理所有这些。

Java's charis a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).

Javachar是一个UTF-16 代码单元。对于代码点 > 0xFFFF 的字符,它将使用 2 chars(代理对)进行编码。

See http://www.oracle.com/us/technologies/java/supplementary-142654.htmlfor how to handle those characters in Java.

有关如何在 Java 中处理这些字符的信息,请参阅http://www.oracle.com/us/technologies/java/supplementary-142654.html

(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)

(顺便说一句,在 Unicode 5.2 中,1,114,112 个插槽中有 107,154 个分配的字符。)

回答by Brian Agnew

From the OpenJDK7 documentation for String:

来自StringOpenJDK7 文档

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

字符串表示 UTF-16 格式的字符串,其中补充字符由代理对表示(有关更多信息,请参阅字符类中的 Unicode 字符表示部分)。索引值指的是字符代码单元,因此增补字符使用字符串中的两个位置。

回答by Michael Borgwardt

Java uses UTF-16. A single Java charcan only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pairof two chars. This is reflected by API methods such as String.codePointAt().

Java 使用UTF-16。单个 Javachar只能表示来自基本多语言平面的字符。其他字符必须由两个s的代理对表示char。这反映在 API 方法中,例如String.codePointAt().

And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.

是的,这意味着许多 Java 代码在与基本多语言平面之外的字符一起使用时会以一种或另一种方式中断。

回答by Pascal Thivent

Have a look at the Unicode 4.0 support in J2SE 1.5article to learn more about the tricks invented by Sun to provide support for all Unicode 4.0 code points.

查看J2SE 1.5中的Unicode 4.0 支持文章,了解更多有关 Sun 发明的技巧以提供对所有 Unicode 4.0 代码点的支持。

In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:

  • charis a UTF-16 code unit, not a code point
  • new low-level APIs use an intto represent a Unicode code point
  • high level APIs have been updated to understand surrogate pairs
  • a preference towards char sequence APIs instead of char based methods

总之,您会发现 Java 1.5 中 Unicode 4.0 的以下更改:

  • char是 UTF-16 代码单元,不是代码点
  • 新的低级 API 使用 anint来表示 Unicode 代码点
  • 已更新高级 API 以了解代理对
  • 偏好使用字符序列 API 而不是基于字符的方法

Since Java doesn't have 32 bit chars, I'll let you judge if we can call this good Unicode support.

由于 Java 没有 32 位字符,我会让您判断我们是否可以称之为良好的 Unicode 支持。

回答by leonbloy

To add to the other answers, some points to remember:

要添加到其他答案中,请记住以下几点:

  • A Java chartakes always 16 bits.

  • A Unicode character, when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters. Hence, a Java char is NOT a Unicode character (though "almost always" is).

  • "Almost always", above, means the 64K first code points of Unicode, range 0x0000 to 0xFFFF (BMP), which take 16 bits in the UTF-16 encoding.

  • A non-BMP ("rare") Unicode character is represented as two Java chars(surrogate representation). This applies also to the literal representation as a string: For example, the character U+20000 is written as "\uD840\uDC00".

  • Corolary: string.length()returns the number of java chars, not of Unicode chars. A string that has just one "rare" unicode character (eg U+20000) would return length() = 2. Same consideration applies to any method that deals with char-sequences.

  • Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch). Those are the real fully-Unicode methods.

  • 一个Java的char需要总是16位

  • 一个Unicode字符,当为UTF-16编码,以“几乎总是”(不总是)16位:这是因为有超过64K Unicode字符。因此,Java 字符不是 Unicode 字符(尽管“几乎总是”是)。

  • 上面的“几乎总是”表示 Unicode 的第一个 64K 代码点,范围从 0x0000 到 0xFFFF ( BMP),在 UTF-16 编码中占 16 位。

  • 非 BMP(“稀有”)Unicode 字符表示为两个 Java 字符(代理表示)。这也适用于作为字符串的文字表示:例如,字符 U+20000 写为“\uD840\uDC00”。

  • Corolary:string.length()返回 java 字符的数量,而不是 Unicode 字符的数量。只有一个“稀有”Unicode 字符(例如 U+20000)的字符串将返回length() = 2. 同样的考虑适用于任何处理字符序列的方法。

  • Java 在处理整个非 BMP unicode 字符方面几乎没有什么智能。还有一些实用方法治疗字符代码点,表示为整数例如:Character.isLetter(int ch)。这些才是真正的全 Unicode 方法。

回答by Rose Perrone

Here's Oracle's documentation on Unicode Character Representations. Or, if you prefer, a more thorough documentation here.

这是 Oracle 关于Unicode Character Representations的文档。或者,如果您愿意,可以在此处查看更详尽的文档

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points.The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

  • The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
  • The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

char 数据类型(以及 Character 对象封装的值)基于原始 Unicode 规范,该规范将字符定义为固定宽度的 16 位实体。Unicode 标准已经更改为允许表示需要超过 16 位的字符。合法代码点的范围现在是 U+0000 到 U+10FFFF,称为 Unicode 标量值。(请参阅 Unicode 标准中 U+n 符号的定义。)

从 U+0000 到 U+FFFF 的字符集有时称为基本多语言平面 (BMP)。码位大于 U+FFFF 的字符称为增补字符。Java 2 平台在 char 数组以及 String 和 StringBuffer 类中使用 UTF-16 表示。在这种表示中,增补字符表示为一对字符值,第一个来自高代理范围 (\uD800-\uDBFF),第二个来自低代理范围 (\uDC00-\uDFFF)。

因此,char 值表示基本多语言平面 (BMP) 代码点,包括代理代码点或 UTF-16 编码的代码单元。int 值表示所有 Unicode 代码点,包括补充代码点。int 的低(最低有效)21 位用于表示Unicode 代码点,高(最高)11 位必须为零。除非另有说明,关于增补字符和代理字符值的行为如下:

  • 仅接受 char 值的方法不能支持增补字符。他们将代理范围中的 char 值视为未定义的字符。例如, Character.isLetter('\uD840') 返回 false,即使此特定值后跟字符串中的任何低代理值将表示一个字母。
  • 接受 int 值的方法支持所有 Unicode 字符,包括增补字符。例如,Character.isLetter(0x2F81A) 返回 true,因为代码点值表示一个字母(CJK 表意文字)。

回答by Basil Bourque

You said:

你说:

A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters.

Java char 是 2 个字节(最大大小为 65,536),但有 95,221 个 Unicode 字符。

Unicode grows

Unicode 增长

Actually, the inventory of characters defined in Unicode has grown dramatically. Unicode continues to grow — and not just because of emojis.

实际上,Unicode 中定义的字符库存急剧增加。Unicode 继续增长——不仅仅是因为表情符号

  • 143,859 characters in Unicode 13 (not yet in Java)
  • 137,994 characters in Unicode 12.1 (Java 13 & 14)
  • 136,755 characters in Unicode 10 (Java 11 & 12)
  • 120,737 characters in Unicode 8 (Java 9)
  • 110,182 characters in Unicode 6.2 (Java 8)
  • 109,449 characters in Unicode 6.0 (Java 7)
  • 96,447 characters in Unicode 4.0 (Java 5 & 6)
  • 49,259 characters in Unicode 3.0 (Java 1.4)
  • 38,952 characters in Unicode 2.1 (Java 1.1.7)
  • 38,950 characters in Unicode 2.0 (Java 1.1)
  • 34,233 characters in Unicode 1.1.5 (Java 1.0)
  • Unicode 13 中的 143,859 个字符(Java 中尚未出现)
  • Unicode 12.1 (Java 13 & 14) 中的 137,994 个字符
  • Unicode 10(Java 11 和 12)中的 136,755 个字符
  • Unicode 8 (Java 9) 中的 120,737 个字符
  • Unicode 6.2 (Java 8) 中的 110,182 个字符
  • Unicode 6.0 (Java 7) 中的 109,449 个字符
  • Unicode 4.0(Java 5 和 6)中的 96,447 个字符
  • Unicode 3.0 (Java 1.4) 中的 49,259 个字符
  • Unicode 2.1 (Java 1.1.7) 中的 38,952 个字符
  • Unicode 2.0 (Java 1.1) 中的 38,950 个字符
  • Unicode 1.1.5 (Java 1.0) 中的 34,233 个字符

charis legacy

char是遗产

The chartype is long outmoded, now legacy.

char类型早已过时,现在是legacy

Use code point numbers

使用代码点编号

Instead, you should be working with code pointnumbers.

相反,您应该使用代码点编号。



You asked:

你问:

Does this mean that you can't handle certain Unicode characters in a Java application?

这是否意味着您无法在 Java 应用程序中处理某些 Unicode 字符?

The chartype can address less than half of today's Unicode characters.

char类型可以处理不到今天的 Unicode 字符的一半。

To represent any Unicode character, use code pointnumbers. Never use char.

要表示任何 Unicode 字符,请使用代码点编号。永远不要使用char.

Every character in Unicode is assigned a code point number. These range over a million, from 0 to 1,114,112. Doing the math when comparing to the numbers listed above, this means most of the numbers in that range have not yet been assigned to a character yet. Some of those numbers are reserved as Private Use Areasand will never be assigned.

Unicode 中的每个字符都分配有一个代码点编号。这些范围超过一百万,从 0 到 1,114,112。在与上面列出的数字进行比较时进行数学计算,这意味着该范围内的大多数数字尚未分配给一个字符。其中一些号码被保留为私人使用区,永远不会被分配。

The Stringclass has gained methods for working with code point numbers, as did the Characterclass.

String班已获得方法与码点号的工作,为做Character课。

Get the code point number for any character in a string, by zero-based index number. Here we get 97for the letter a.

通过从零开始的索引号获取字符串中任何字符的代码点号。在这里,我们得到97了这封信a

int codePoint = "Cat".codePointAt( 1 ) ; // 97 = 'a', hex U+0061, LATIN SMALL LETTER A.

For the more general CharSequencerather than String, use Character.codePointAt.

对于更通用CharSequence而不是String,请使用Character.codePointAt.

We can get the Unicode name for a code point number.

我们可以获得代码点编号的 Unicode 名称。

String name = Character.getName( 97 ) ; // letter `a`

LATIN SMALL LETTER A

拉丁文小写字母 A

We can get a stream of the code point numbers of all the characters in a string.

我们可以得到一个字符串中所有字符的代码点编号的流。

IntStream codePointsStream = "Cat".codePoints() ;

We can turn that into a Listof Integerobjects. See How do I convert a Java 8 IntStream to a List?.

我们可以把它转换成一个ListInteger对象。请参阅如何将 Java 8 IntStream 转换为列表?.

List< Integer > codePointsList = codePointsStream.boxed().collect( Collectors.toList() ) ;

Any code point number can be changed into a Stringof a single character by calling Character.toString.

可以String通过调用将任何代码点编号更改为单个字符的 a Character.toString

String s = Character.toString( 97 ) ; // 97 is `a`, LATIN SMALL LETTER A. 

a

一种

We can produce a Stringobject from an IntStreamof code point numbers. See Make a string from an IntStream of code point numbers?.

我们可以String从一个IntStream代码点编号生成一个对象。请参阅从代码点编号的 IntStream 生成字符串?.

IntStream intStream = IntStream.of( 67 , 97 , 116 , 32 , 128_008 ); // 32 = SPACE, 128,008 = CAT (emoji).

String output =
        intStream
                .collect(                                     // Collect the results of processing each code point.
                        StringBuilder :: new ,                // Supplier<R> supplier
                        StringBuilder :: appendCodePoint ,    // ObjIntConsumer<R> accumulator
                        StringBuilder :: append               // BiConsumer<R,?R> combiner
                )                                             // Returns a `CharSequence` object.
                .toString();                                  // If you would rather have a `String` than `CharSequence`, call `toString`. 

Cat



You asked:

你问:

Does this boil down to what character encoding you are using?

这是否归结为您使用的字符编码?

Internally, a Stringin Java is always using UTF-16.

在内部,StringJava 中的 a 始终使用UTF-16

You only use other character encoding when importing or exporting text in or out of Java strings.

在从 Java 字符串导入或导出文本时,您只能使用其他字符编码。

So, to answer your question, no, character encoding is not directly related here. Once you get your text into a Java String, it is in UTF-16 encoding and can therefore contain any Unicode character. Of course, to seethat character, you must be using a font with a glyphdefined for that particular character.

所以,回答你的问题,不,字符编码在这里没有直接关系。将文本放入 Java 后String,它采用 UTF-16 编码,因此可以包含任何 Unicode 字符。当然,要查看该字符,您必须使用带有为该特定字符定义的字形的字体。

When exporting text from Java strings, if you specify a legacy character encodingthat cannot represent some of the Unicode characters used in your text, you will have a problem. So use a modern character encoding, which nowadays means UTF-8as UTF-16 is now considered harmful.

从 Java 字符串导出文本时,如果指定的旧字符编码无法表示文本中使用的某些 Unicode 字符,则会出现问题。所以使用现代字符编码,现在意味着UTF-8因为UTF-16 现在被认为是有害的