java Java中的4字节Unicode字符

Question

提问by Constantine

I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character. "\U" - not working (illegal escape character error) for example: U+1F701 (0xf0 0x9f 0x9c 0x81). How it can be written in a string?

我正在为我的自定义 StringDatatype 编写单元测试，我需要写下 4 字节的 unicode 字符。"\U" - 不工作（非法转义字符错误），例如：U+1F701 (0xf0 0x9f 0x9c 0x81)。怎么可以写成字符串？

Answer 1

回答by fge

A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).

Unicode 代码点不是 4 个字节；它是一个整数（目前范围从 U+0000 到 U+10FFFF）。

Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).

你的 4 个字节是（猜测）它的 UTF-8 编码版本（编辑：我是对的）。

You need to do this:

你需要这样做：

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);

When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a charis only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.

Java 创建的时候，Unicode 没有定义 BMP 之外的码位（即 U+0000 到 U+FFFF），这就是 achar只有 16 位长的原因（好吧，好吧，这只是一个猜测，但我认为我在这里不远）；从那时起，它必须适应……并且 BMP 之外的代码点需要两个字符（前导代理和尾随代理——Java 分别将它们称为高代理和低代理）。Java 中没有字符文字允许直接在 BMP 之外输入代码点。

Given that a charis, in fact, a UTF-16 code unit and that there arestring literals for these, you can input this "character" in a String as "\uD83D\uDF01"-- or directly as the symbol if your computing environment has support for it.

鉴于 achar实际上是一个 UTF-16 代码单元，并且有这些字符串文字，您可以在字符串中输入这个“字符”作为"\uD83D\uDF01"-- 或者如果您的计算环境支持它，则直接作为符号输入。

See also the CharsetDecoderand CharsetEncoderclasses.

另请参见CharsetDecoder和CharsetEncoder类。

See also String.codePointCount(), and, since Java 8, String.codePoints()(inherited from CharSequence).

另请参见String.codePointCount()，并且，自 Java 8 起，String.codePoints()（继承自CharSequence）。

Answer 2

回答by Andrew

String s = "";

Technically this is one character. But be careful s.length()will returns 2. Also java won't compile String s = ''. Java don't promise you that String.length()shall returns exact number of characters, it returns just number of java-chars required for store this string.

从技术上讲，这是一个字符。但要小心s.length()会返回 2. 而且 java 不会编译String s = ''。Java 不向您保证String.length()将返回确切数量的字符，它仅返回存储此字符串所需的 java 字符数。

Real number of characters can be obtained from s.codePointCount(0, s.length()).

可从中获得实际字符数s.codePointCount(0, s.length())。

java Java中的4字节Unicode字符

提问by Constantine

回答by fge

回答by Andrew

相关推荐

最近更新

标签

java Java中的4字节Unicode字符

提问by Constantine

回答by fge

回答by Andrew

相关推荐

java.net.SocketException：使用 HTTPConnection 重置连接

java 审查词条件

java Camel：从直接路由到处理器

如何在 Java 中使用 XPath/JsonPath 更改 json 文件中的值

相关推荐

最近更新

标签