java Java中的4字节Unicode字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27287369/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
4 byte unicode character in Java
提问by Constantine
I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character. "\U" - not working (illegal escape character error) for example: U+1F701 (0xf0 0x9f 0x9c 0x81). How it can be written in a string?
我正在为我的自定义 StringDatatype 编写单元测试,我需要写下 4 字节的 unicode 字符。"\U" - 不工作(非法转义字符错误),例如:U+1F701 (0xf0 0x9f 0x9c 0x81)。怎么可以写成字符串?
回答by fge
A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).
Unicode 代码点不是 4 个字节;它是一个整数(目前范围从 U+0000 到 U+10FFFF)。
Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).
你的 4 个字节是(猜测)它的 UTF-8 编码版本(编辑:我是对的)。
You need to do this:
你需要这样做:
final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char
is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.
Java 创建的时候,Unicode 没有定义 BMP 之外的码位(即 U+0000 到 U+FFFF),这就是 achar
只有 16 位长的原因(好吧,好吧,这只是一个猜测,但我认为我在这里不远);从那时起,它必须适应……并且 BMP 之外的代码点需要两个字符(前导代理和尾随代理——Java 分别将它们称为高代理和低代理)。Java 中没有字符文字允许直接在 BMP 之外输入代码点。
Given that a char
is, in fact, a UTF-16 code unit and that there arestring literals for these, you can input this "character" in a String as "\uD83D\uDF01"
-- or directly as the symbol if your computing environment has support for it.
鉴于 achar
实际上是一个 UTF-16 代码单元,并且有这些字符串文字,您可以在字符串中输入这个“字符”作为"\uD83D\uDF01"
-- 或者如果您的计算环境支持它,则直接作为符号输入。
See also the CharsetDecoder
and CharsetEncoder
classes.
另请参见CharsetDecoder
和CharsetEncoder
类。
See also String.codePointCount()
, and, since Java 8, String.codePoints()
(inherited from CharSequence
).
另请参见String.codePointCount()
,并且,自 Java 8 起,String.codePoints()
(继承自CharSequence
)。
回答by Andrew
String s = "";
String s = "";
Technically this is one character. But be careful s.length()
will returns 2. Also java won't compile String s = ''
. Java don't promise you that String.length()
shall returns exact number of characters, it returns just number of java-chars required for store this string.
从技术上讲,这是一个字符。但要小心s.length()
会返回 2. 而且 java 不会编译String s = ''
。Java 不向您保证String.length()
将返回确切数量的字符,它仅返回存储此字符串所需的 java 字符数。
Real number of characters can be obtained from s.codePointCount(0, s.length())
.
可从 中获得实际字符数s.codePointCount(0, s.length())
。