java Java中的4字节Unicode字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27287369/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-02 11:33:59  来源:igfitidea点击:

4 byte unicode character in Java

javaunicode

提问by Constantine

I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character. "\U" - not working (illegal escape character error) for example: U+1F701 (0xf0 0x9f 0x9c 0x81). How it can be written in a string?

我正在为我的自定义 StringDatatype 编写单元测试,我需要写下 4 字节的 unicode 字符。"\U" - 不工作(非法转义字符错误),例如:U+1F701 (0xf0 0x9f 0x9c 0x81)。怎么可以写成字符串?

回答by fge

A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).

Unicode 代码点不是 4 个字节;它是一个整数(目前范围从 U+0000 到 U+10FFFF)。

Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).

你的 4 个字节是(猜测)它的 UTF-8 编码版本(编辑:我是对的)。

You need to do this:

你需要这样做:

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);

When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a charis only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.

Java 创建的时候,Unicode 没有定义 BMP 之外的码位(即 U+0000 到 U+FFFF),这就是 achar只有 16 位长的原因(好吧,好吧,这只是一个猜测,但我认为我在这里不远);从那时起,它必须适应……并且 BMP 之外的代码点需要两个字符(前导代理和尾随代理——Java 分别将它们称为高代理和低代理)。Java 中没有字符文字允许直接在 BMP 之外输入代码点。

Given that a charis, in fact, a UTF-16 code unit and that there arestring literals for these, you can input this "character" in a String as "\uD83D\uDF01"-- or directly as the symbol if your computing environment has support for it.

鉴于 achar实际上是一个 UTF-16 代码单元,并且有这些字符串文字,您可以在字符串中输入这个“字符”作为"\uD83D\uDF01"-- 或者如果您的计算环境支持它,则直接作为符号输入。

See also the CharsetDecoderand CharsetEncoderclasses.

另请参见CharsetDecoderCharsetEncoder类。

See also String.codePointCount(), and, since Java 8, String.codePoints()(inherited from CharSequence).

另请参见String.codePointCount(),并且,自 Java 8 起,String.codePoints()(继承自CharSequence)。

回答by Andrew

String s = "";

String s = "";

Technically this is one character. But be careful s.length()will returns 2. Also java won't compile String s = ''. Java don't promise you that String.length()shall returns exact number of characters, it returns just number of java-chars required for store this string.

从技术上讲,这是一个字符。但要小心s.length()会返回 2. 而且 java 不会编译String s = ''。Java 不向您保证String.length()将返回确切数量的字符,它仅返回存储此字符串所需的 java 字符数。

Real number of characters can be obtained from s.codePointCount(0, s.length()).

可从 中获得实际字符数s.codePointCount(0, s.length())