Java 将字符与代码点进行比较?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1029897/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 22:26:56  来源:igfitidea点击:

Comparing a char to a code-point?

javaunicode

提问by Gili

What is the "correct" way of comparing a code-point to a Java character? For example:

将代码点与 Java 字符进行比较的“正确”方法是什么?例如:

int codepoint = String.codePointAt(0);
char token = '\n';

I know I can probably do:

我知道我可能会这样做:

if (codepoint==(int) token)
{ ... }

but this code looks fragile. Is there a formal API method for comparing codepointsto chars, or converting the charup to a codepointfor comparison?

但这段代码看起来很脆弱。有没有比较正式的API方法codepointschars,或转换char到一个codepoint比较?

采纳答案by Christian Hang-Hicks

A little bit of background: When Java appeared in 1995, the chartype was based on the original "Unicode 88" specification, which was limited to 16 bits. A year later, when Unicode 2.0 was implemented, the concept of surrogate characters was introduced to go beyond the 16 bit limit.

一点背景知识:Java 在 1995 年出现时,该char类型基于原始的“ Unicode 88”规范,该规范仅限于 16 位。一年后,当 Unicode 2.0 实施时,代理字符的概念被引入以超越 16 位的限制。

Java internally represents all Strings in UTF-16 format. For code points exceeding U+FFFF the code point is represented by a surrogate pair, i.e., two chars with the first being the high-surrogates code unit, (in the range \uD800-\uDBFF), the second being the low-surrogate code unit (in the range \uDC00-\uDFFF).

Java 在内部String以 UTF-16 格式表示所有s。对于超过 U+FFFF 的代码点,代码点由代理对表示,即两个chars,第一个是高代理代码单元(在范围 \uD800-\uDBFF 中),第二个是低代理代码单元(在 \uDC00-\uDFFF 范围内)。

From the early days, all basic Charactermethods were based on the assumption that a code point could be represented in one char, so that's what the method signatures look like. I guess to preserve backward compatibility that was not changed when Unicode 2.0 came around and caution is needed when dealing with them. To quote from the Java documentation:

从早期开始,所有基本Character方法都基于代码点可以用一个 表示的假设char,所以这就是方法签名的样子。我想保留在 Unicode 2.0 出现时没有改变的向后兼容性,并且在处理它们时需要谨慎。引用Java 文档

  • The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
  • The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
  • 仅接受 char 值的方法不能支持增补字符。他们将代理范围中的 char 值视为未定义的字符。例如, Character.isLetter('\uD840') 返回 false,即使此特定值后跟字符串中的任何低代理值将表示一个字母。
  • 接受 int 值的方法支持所有 Unicode 字符,包括增补字符。例如,Character.isLetter(0x2F81A) 返回 true,因为代码点值表示一个字母(CJK 表意文字)。

Casting the charto an int, as you do in your sample, works fine though.

就像您在示例中所做的那样,将转换char为 an int,但效果很好。

回答by Jherico

For characters in the basic multilingual plane, casting the char to an int will get you the codepoint. This corresponds to all the unicode values that can be encoded in a single 16 bit char value. Values outside this plane (with codepoints exceeding 0xffff) cannot be expressed as a single character. This is probably why there is no Character.toCodePoint(char value).

对于基本多语言平面中的字符,将 char 转换为 int 将为您提供代码点。这对应于可以编码为单个 16 位字符值的所有 unicode 值。此平面之外的值(代码点超过 0xffff)不能表示为单个字符。这可能就是没有 Character.toCodePoint(char value) 的原因。

回答by lavinio

Java uses a 16-bit (UTF-16) model for handling characters, so any characters with codepoints > 0xFFFF are stored in the strings as pairsof 16-bit characters using two surrogatecharacters to represent the plane and character within the plane.

Java 使用 16 位 (UTF-16) 模型来处理字符,因此任何代码点 > 0xFFFF 的字符都作为16 位字符存储在字符串中,使用两个代理字符来表示平面和平面内的字符。

If you want to handle characters and strings properly according to the full Unicode standard, you need to process strings taking this into account.

如果要根据完整的 Unicode 标准正确处理字符和字符串,则需要考虑到这一点来处理字符串。

XML cares a lot about this; it's useful to access the XMLChar class in Xerces (which comes with Java version 5.0 and higher) for character-related code.

XML 非常关心这一点;访问 Xerces 中的 XMLChar 类(Java 5.0 及更高版本随附)中的与字符相关的代码很有用。

It's also instructive to look at the SaxonXSLT/XQuery processor, since being a well-behaved XML application, it has to take into account how Java stores codepoints in strings. XQuery 1.0 and XPath 2.0 have functions for codepoints-to-stringand string-to-codepoints; it might be instructive to get a copy of Saxon and play with them to see how they work.

看看SaxonXSLT/XQuery 处理器也很有启发性,因为它是一个性能良好的 XML 应用程序,它必须考虑 Java 如何在字符串中存储代码点。XQuery 1.0 和 XPath 2.0 具有用于codepoints-to-stringstring-to-codepoints 的函数;获取一份撒克逊人的副本并与他们一起玩以了解他们的工作方式可能会有所启发。

回答by JimN

For a character which can be represented by a single char (16 bits, basic multilingual plane), you can get the codepoint simply by casting the char to an integer (as the question suggests), so there's no need for a special method to perform a conversion.

对于可以由单个字符(16 位,基本多语言平面)表示的字符,您只需将字符转换为整数即可获得代码点(如问题所示),因此不需要特殊方法来执行一个转换。

If you're comparing a char to a codepoint, you don't need any special casing. Just compare the char to the int directly (as the question suggests). If the int represents a codepoint outside of the basic multilingual plane, the result will always be false.

如果您将字符与代码点进行比较,则不需要任何特殊的大小写。只需直接将 char 与 int 进行比较(如问题所示)。如果 int 表示基本多语言平面之外的代码点,则结果将始终为 false。

回答by McDowell

The Characterclass contains many useful methods for working with Unicode code points. Note methods like Character.toChars(int)that return an array of chars. If your codepoint lies in the supplementary range, then the array will be two chars in length.

字符类包含与Unicode码点的工作许多有用的方法。注意像Character.toChars(int)这样返回字符数组的方法。如果您的代码点位于补充范围内,则数组的长度将为两个字符。

How you want to compare the values depends on whether you want to support the full range of Unicode values. This sample code can be used to iterate through a String's codepoints, testing to see if there is a match for the supplementary character MATHEMATICAL_FRAKTUR_CAPITAL_G (

© 2020 版权所有