Java 中的 Unicode 转义语法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21522770/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Unicode escape syntax in Java
提问by user3265048
In Java, I learned that the following syntax can be used for mentioning Unicode characters that are not on the keyboard (eg. non-ASCII characters):
在 Java 中,我了解到以下语法可用于提及不在键盘上的 Unicode 字符(例如非 ASCII 字符):
(\u)(u)*(HexDigit)(HexDigit)(HexDigit)(HexDigit)
My question is: What is the purpose of (u)* in the above syntax?
我的问题是:上述语法中 (u)* 的目的是什么?
One use case that I understood which represents Yen symbol in Java is:
我理解的在 Java 中代表日元符号的一个用例是:
char ch = '\u00A5';
采纳答案by Aaron Digulla
Interesting question. Section 3.3 of the JSL says:
有趣的问题。JSL 的第 3.3 节说:
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u
UnicodeMarker u
which translates to \\u+\p{XDigit}{4}
这意味着 \\u+\p{XDigit}{4}
and
和
If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.
如果一个符合条件的 \ 后面跟有 u 或多个 u,并且最后一个 u 后面没有跟四个十六进制数字,那么就会发生编译时错误。
So you're right, there can be one or more u
after the backslash. The reason is given further down:
所以你是对的,u
反斜杠后面可以有一个或多个。原因进一步如下:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
Java 编程语言指定了将用 Unicode 编写的程序转换为 ASCII 的标准方法,该方法将程序更改为可由基于 ASCII 的工具处理的形式。转换涉及通过添加额外的 u 将程序源文本中的任何 Unicode 转义符转换为 ASCII - 例如,\uxxxx 变为 \uuxxxx - 同时将源文本中的非 ASCII 字符转换为每个包含一个 u 的 Unicode 转义符.
这个转换后的版本同样可以被 Java 编译器接受,并且代表完全相同的程序。稍后可以通过将每个存在多个 u 的转义序列转换为一个少一个 u 的 Unicode 字符序列,同时将每个带有单个 u 的转义序列转换为相应的单个 Unicode 字符,从而从这种 ASCII 形式恢复确切的 Unicode 源。
So this input
所以这个输入
\u0020?
becomes
变成
\uu0020\u00e4
The first uu
means here "this was a unicode escape sequence to begin with" while the second u
says "An automatic tool converted a non-ASCII character to a unicode escape."
第一个uu
意思是“这是一个 unicode 转义序列”,而第二个意思u
是“一个自动工具将非 ASCII 字符转换为 unicode 转义字符。”
This information is useful when you want to convert back from ASCII to unicode: You can restore as much of the original code as possible.
当您想从 ASCII 转换回 unicode 时,此信息很有用:您可以尽可能多地恢复原始代码。
回答by assylias
It means you can add as many u
as you want - for example these lines are equivalent:
这意味着您可以添加任意数量的u
- 例如这些行是等效的:
char ch = '\u00A5';
char ch = '\uuuuu00A5';
char ch = '\uuuuuuuuuuuuuuuuuu00A5';
(and all compile)
(并且全部编译)