Java 如何将 unicode 代码点转换为其字符表示?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18380901/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I convert unicode codepoints to their character representation?
提问by David Michael Gang
How do I convert strings representing code points to the appropriate character?
如何将表示代码点的字符串转换为适当的字符?
For example, I want to have a function which gets U+00E4
and returns ?
.
例如,我想要一个获取U+00E4
和返回?
.
I know that in the character class I have a function toChars(int codePoint)
which takes an integer but there is no function which takes a string of this type.
我知道在字符类中,我有一个函数toChars(int codePoint)
接受一个整数,但没有函数接受这种类型的字符串。
Is there a built in function or do I have to do some transformation on the string to get the integer which I can send to the function?
是否有内置函数,或者我是否必须对字符串进行一些转换才能获得可以发送给函数的整数?
采纳答案by Anirudha
Code points are written as hexadecimal numbers prefixed by U+
代码点写为十六进制数字,前缀为 U+
So,you can do this
所以,你可以这样做
int codepoint=Integer.parseInt(yourString.substring(2),16);
char[] ch=Character.toChars(codepoint);
回答by Joop Eggen
"\u00E4"
new String(new int[] { 0x00E4 }, 0, 1);
回答by tateisu
this example does not use char[].
此示例不使用 char[]。
// this code is Kotlin, but you can write same thing in Java
val sb = StringBuilder()
val cp :Int // codepoint
when {
Character.isBmpCodePoint(cp) -> sb.append(cp.toChar())
Character.isValidCodePoint(cp) -> {
sb.append(Character.highSurrogate(cp))
sb.append(Character.lowSurrogate(cp))
}
else -> sb.append('?')
}
回答by Roovy
The easiest way I've found so far is to just cast the codepoint; if you're just expecting a single char per codepoint, then this might be fine for you.:
到目前为止,我发现的最简单的方法是直接转换代码点;如果您只是期望每个代码点有一个字符,那么这对您来说可能没问题。:
int codepoint = ...;
char c = (char)codepoint;
回答by Abdo Magdy
You can print them
你可以打印它们
s='\u0645\u0635\u0631\u064a'
print(s)
回答by Qubei
Converted from Kotlin:
从 Kotlin 转换而来:
public String codepointToString(int cp) {
StringBuilder sb = new StringBuilder();
if (Character.isBmpCodePoint(cp)) {
sb.append((char) cp);
} else if (Character.isValidCodePoint(cp)) {
sb.append(Character.highSurrogate(cp));
sb.append(Character.lowSurrogate(cp));
} else {
sb.append('?');
}
return sb.toString();
}
回答by skomisa
The question asked for a function to convert a string value representing a Unicode code point (i.e. "+Unnnn"
rather than the Java formats of "\unnnn"
or "0xnnnn
). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format:
该问题要求一个函数来转换表示 Unicode 代码点的字符串值(即,"+Unnnn"
而不是 Java 格式的"\unnnn"
或"0xnnnn
)。但是,较新版本的 Java 具有增强功能,可以简化包含 Unicode 格式的多个代码点的字符串的处理:
- The introduction of Streams in Java 8.
- Method
public static String toString?(int codePoint)
which was added to theCharacter
class in Java 11. It returns aString
rather than achar[]
, soCharacter.toString(0x00E4)
returns"?"
.
- Java 8 中 Streams 的引入。
- 方法
public static String toString?(int codePoint)
将其添加到Character
Java中11.类返回String
,而不是一个char[]
,这样Character.toString(0x00E4)
的回报"?"
。
Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable String
in a single statement:
这些增强功能允许采用不同的方法来解决 OP 中提出的问题。此方法将 Unicode 格式的一组代码点转换为String
单个语句中的可读代码:
void processUnicode() {
// Create a test string containing "Hello World " with code points in Unicode format.
// Include an invalid code point (+U0wxyz), and a code point outside the Unicode range (+U70FFFF).
String data = "+U0048+U0065+U006c+U006c+U0wxyz+U006f+U0020+U0057+U70FFFF+U006f+U0072+U006c+U0000064+U20+U1f601";
String text = Arrays.stream(data.split("\+U"))
.filter(s -> ! s.isEmpty()) // First element returned by split() is a zero length string.
.map(s -> {
try {
return Integer.parseInt(s, 16);
} catch (NumberFormatException e) {
System.out.println("Ignoring element [" + s + "]: NumberFormatException from parseInt(\"" + s + "\"}");
}
return null; // If the code point is not represented as a valid hex String.
})
.filter(v -> v != null) // Ignore syntactically invalid code points.
.filter(i -> Character.isValidCodePoint(i)) // Ignore code points outside of Unicode range.
.map(i -> Character.toString(i)) // Obtain the string value directly from the code point. (Requires JDK >= 11 )
.collect(Collectors.joining());
System.out.println(text); // Prints "Hello World "
}
And this is the output:
这是输出:
run:
Ignoring element [0wxyz]: NumberFormatException from parseInt("0wxyz"}
Hello World
BUILD SUCCESSFUL (total time: 0 seconds)
Notes:
笔记:
- With this approach there is no longer any need for a specific function to convert a code point in Unicode format. That's dispersed instead, through multiple intermediate operations in the
Stream
processing. Of course the same code could still be used to process just a single code point in Unicode format. - It's easy to add intermediate operations to perform further validation and processing on the
Stream
, such as case conversion, removal of emoticons, etc.
- 使用这种方法,不再需要特定函数来转换 Unicode 格式的代码点。相反,它是通过
Stream
处理中的多个中间操作分散的。当然,同样的代码仍可用于处理 Unicode 格式的单个代码点。 - 很容易添加中间操作对 进行进一步的验证和处理
Stream
,例如大小写转换、删除表情符号等。