用 Java 正则表达式匹配(例如)一个 Unicode 字母

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5315330/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 10:34:53  来源:igfitidea点击:

Matching (e.g.) a Unicode letter with Java regexps

javaregexunicodecharacter-propertiescharacter-class

提问by The Archetypal Paul

There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blockseach of which may have "letters".

StackOverflow 上有很多问题和答案,假设“字母”可以在正则表达式中匹配[a-zA-Z]. 然而,对于 Unicode,大多数人会将更多字符视为字母(所有希腊字母、Cyrllic ……等等。Unicode 定义了许多块,每个都可能有“字母”。

The Java definition defines Posix classesfor things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9], which also excludes many letters.

Java 定义 为诸如字母字符之的东西定义了 Posix 类,但它被指定为仅适用于 US-ASCII。预定义的字符类将单词定义为由 组成[a-zA-Z_0-9],这也排除了许多字母。

So how do you properly match against Unicode strings? Is there some other library that gets this right?

那么如何正确匹配 Unicode 字符串呢?有没有其他图书馆可以做到这一点?

回答by dLobatog

Here you have a very nice explanation:

在这里你有一个很好的解释:

http://www.regular-expressions.info/unicode.html

http://www.regular-expressions.info/unicode.html

Some hints:

一些提示:

"Java and .NET unfortunately do not support \X(yet). Use \P{M}\p{M}*as a substitute. To match any number of graphemes, use (?:\P{M}\p{M}*)+instead of \X+."

“不幸的是,Java 和 .NET 尚不支持\X\P{M}\p{M}*用作替代品。要匹配任意数量的字素,请使用(?:\P{M}\p{M}*)+代替\X+。”

"In Java, the regex token \uFFFFonly matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFFis also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0")will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0")matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you're doing, the difference may be significant."

“在 Java 中,regex 标记\uFFFF仅匹配指定的代码点,即使您打开规范等价。但是,相同的语法\uFFFF也用于在 Java 源代码Pattern.compile("\u00E0")中将Unicode 字符插入到文字字符串中。将匹配两个单一代码的 -point 和双代码点编码à,whilePattern.compile("\\u00E0")仅匹配单代码点版本。请记住,将正则表达式编写为 Java 字符串文字时,必须转义反斜杠。前者 Java 代码编译正则表达式à,而后者compiles \u00E0。根据你在做什么,差异可能很大。”

回答by erickson

Are you talking about Unicode categories, like letters? These are matched by a regex of the form \p{CAT}, where "CAT" is the category code like Lfor any letter, or a subcategory like Lufor uppercase or Ltfor title-case.

您是在谈论 Unicode 类别,例如字母吗?这些由形式的正则表达式匹配\p{CAT},其中“CAT”是L任何字母的类别代码,或者是Lu大写或Lt标题大小写的子类别。

回答by adarshr

Quoting from the JavaDoc of java.util.regex.Pattern.

引自java.util.regex.Pattern的 JavaDoc 。

Unicode support

This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.

Unicode escape sequences such as \u2014 in Java source code are processed as described in §3.3of the Java Language Specification. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

Unicode blocks and categories are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property. Blocks are specified with the prefix In, as in InMongolian. Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and categories can be used both inside and outside of a character class.

The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative. The block names supported by Pattern are the valid block names accepted and defined by UnicodeBlock.forName.

Unicode 支持

此类符合 Unicode 技术标准 #18:Unicode 正则表达式指南的第 1 级,以及 RL2.1 规范等效项。

Java 源代码中的 Unicode 转义序列(例如 \u2014)按照 Java 语言规范的§3.3中的描述进行处理。此类转义序列也由正则表达式解析器直接实现,以便可以在从文件或键盘读取的表达式中使用 Unicode 转义。因此,字符串 "\u2014" 和 "\\u2014" 虽然不相等,但编译成相同的模式,匹配十六进制值为 0x2014 的字符。

Unicode 块和类别是用 \p 和 \P 结构编写的,就像在 Perl 中一样。\p{prop} 匹配输入是否具有属性 prop,而 \P{prop} 不匹配如果输入具有该属性。块用前缀 In 指定,就像在 InMongolian 中一样。类别可以用可选的前缀 Is 指定:\p{L} 和 \p{IsL} 都表示 Unicode 字母的类别。块和类别可以在字符类的内部和外部使用。

支持的类别是由 Character 类指定的版本中的 Unicode 标准的类别。类别名称是标准中定义的那些,既是规范性的,也是信息性的。Pattern 支持的块名称是 UnicodeBlock.forName 接受和定义的有效块名称。