Java正则表达式中\w和\b的Unicode等价物？

Question

提问by Tim Pietzcker

Many modern regex implementations interpret the \wcharacter class shorthand as "any letter, digit, or connecting punctuation" (usually: underscore). That way, a regex like \w+matches words like hello, élève, GO?_432or gefr??ig.

许多现代正则表达式实现将\w字符类速记解释为“任何字母、数字或连接标点符号”（通常：下划线）。这样，像这样的正则表达式\w+匹配像hello, élève,GO?_432或之类的词gefr??ig。

Unfortunately, Java doesn't. In Java, \wis limited to [A-Za-z0-9_]. This makes matching words like those mentioned above difficult, among other problems.

不幸的是，Java 没有。在 Java 中，\w仅限于[A-Za-z0-9_]. 这使得匹配上面提到的词变得困难，还有其他问题。

It also appears that the \bword separator matches in places where it shouldn't.

似乎\b单词分隔符在不应该匹配的地方也匹配。

What would be the correct equivalent of a .NET-like, Unicode-aware \wor \bin Java? Which other shortcuts need "rewriting" to make them Unicode-aware?

什么是类似 .NET、Unicode 感知\w或\bJava的正确等价物？哪些其他快捷方式需要“重写”以使其能够识别 Unicode？

Answer 1

采纳答案by tchrist

Source code

源代码

The source code for the rewriting functions I discuss below is available here.

我在下面讨论的重写函数的源代码可以在这里找到。

Update in Java 7

Java 7 中的更新

Sun's updated Patternclass for JDK7 has a marvelous new flag, UNICODE_CHARACTER_CLASS, which makes everything work right again. It's available as an embeddable (?U)for inside the pattern, so you can use it with the Stringclass's wrappers, too. It also sports corrected definitions for various other properties, too. It now tracks The Unicode Standard, in both RL1.2and RL1.2afrom UTS#18: Unicode Regular Expressions. This is an exciting and dramatic improvement, and the development team is to be commended for this important effort.

SunPattern为 JDK7更新的类有一个了不起的新标志UNICODE_CHARACTER_CLASS，它使一切重新正常工作。它可以作为(?U)模式内部的可嵌入对象使用，因此您也可以将它与String类的包装器一起使用。它还修正了各种其他属性的定义。它现在跟踪来自UTS#18: Unicode Regular Expressions 的RL1.2和RL1.2a 中的 Unicode 标准。这是一个令人兴奋和戏剧性的改进，开发团队的这一重要努力值得表扬。

Java's Regex Unicode Problems

Java 的正则表达式 Unicode 问题

The problem with Java regexes is that the Perl 1.0 charclass escapes — meaning \w, \b, \s, \dand their complements — are not in Java extended to work with Unicode. Alone amongst these, \benjoys certain extended semantics, but these map neither to \w, nor to Unicode identifiers, nor to Unicode line-break properties.

使用Java正则表达式的问题是，Perl的1.0 charclass将逃逸-这意味着\w，\b，\s，\d和它们的补-不是Java扩展工作使用Unicode。仅在其中，\b享有某些扩展语义，但这些既不映射\w到Unicode 标识符，也不映射到Unicode 换行符属性。

Additionally, the POSIX properties in Java are accessed this way:

此外，可以通过以下方式访问 Java 中的 POSIX 属性：

POSIX syntax    Java syntax

[[:Lower:]]     \p{Lower}
[[:Upper:]]     \p{Upper}
[[:ASCII:]]     \p{ASCII}
[[:Alpha:]]     \p{Alpha}
[[:Digit:]]     \p{Digit}
[[:Alnum:]]     \p{Alnum}
[[:Punct:]]     \p{Punct}
[[:Graph:]]     \p{Graph}
[[:Print:]]     \p{Print}
[[:Blank:]]     \p{Blank}
[[:Cntrl:]]     \p{Cntrl}
[[:XDigit:]]    \p{XDigit}
[[:Space:]]     \p{Space}

This is a real mess, because it means that things like Alpha, Lower, and Spacedo notin Java map to the Unicode Alphabetic, Lowercase, or Whitespaceproperties. This is exceeedingly annoying. Java's Unicode property support is strictly antemillennial, by which I mean it supports no Unicode property that has come out in the last decade.

这是一个真正的混乱，因为这意味着一些事情，如Alpha，Lower和Space做的不是在Java中映射为Unicode Alphabetic，Lowercase或Whitespace性质。这是非常烦人的。Java 的 Unicode 属性支持是严格意义上的 antemillennial，我的意思是它不支持过去十年中出现的任何 Unicode 属性。

Not being able to talk about whitespace properly is super-annoying. Consider the following table. For each of those code points, there is both a J-results column for Java and a P-results column for Perl or any other PCRE-based regex engine:

不能正确地谈论空白是非常烦人的。考虑下表。对于这些代码点中的每一个，都有一个用于 Java 的 J-results 列和一个用于 Perl 或任何其他基于 PCRE 的正则表达式引擎的 P-results 列：

             Regex    001A    0085    00A0    2029
                      J  P    J  P    J  P    J  P
                \s    1  1    0  1    0  1    0  1
               \pZ    0  0    0  0    1  1    1  1
            \p{Zs}    0  0    0  0    1  1    0  0
         \p{Space}    1  1    0  1    0  1    0  1
         \p{Blank}    0  0    0  0    0  1    0  0
    \p{Whitespace}    -  1    -  1    -  1    -  1
\p{javaWhitespace}    1  -    0  -    0  -    1  -
 \p{javaSpaceChar}    0  -    0  -    1  -    1  -

See that?

看到了吗？

Virtually every one of those Java white space results is ? ?w?r?o?n?g?? according to Unicode. It's a really big problem.Java is just messed up, giving answers that are “wrong” according to existing practice and also according to Unicode. Plus Java doesn't even give you access to the real Unicode properties! In fact, Java does not support anyproperty that corresponds to Unicode whitespace.

实际上，这些 Java 空白结果中的每一个都是 ? ？错误的？？根据 Unicode。这真是个大问题。Java 只是一团糟，根据现有实践和 Unicode，给出了“错误”的答案。此外，Java 甚至不能让您访问真正的 Unicode 属性！事实上，Java 不支持任何与 Unicode 空格对应的属性。

The Solution to All Those Problems, and More

所有这些问题的解决方案，以及更多

To deal with this and many other related problems, yesterday I wrote a Java function to rewrite a pattern string that rewrites these 14 charclass escapes:

为了解决这个和许多其他相关问题，昨天我写了一个 Java 函数来重写一个模式字符串，重写这 14 个字符类转义：

\w \W \s \S \v \V \h \H \d \D \b \B \X \R

by replacing them with things that actually work to match Unicode in a predictable and consistent fashion. It's only an alpha prototype from a single hack session, but it is completely functional.

通过用可预测和一致的方式实际匹配 Unicode 的东西替换它们。它只是来自单个黑客会话的 alpha 原型，但它是完整的功能。

The short story is that my code rewrites those 14 as follows:

简而言之，我的代码将这 14 个重写如下：

\s => [\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
\S => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

\v => [\u000A-\u000D\u0085\u2028\u2029]
\V => [^\u000A-\u000D\u0085\u2028\u2029]

\h => [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]
\H => [^\u0009\u0020\u00A0\u1680\u180E\u2000\u2001-\u200A\u202F\u205F\u3000]

\w => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
\W => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]

\b => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))
\B => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))

\d => \p{Nd}
\D => \P{Nd}

\R => (?:(?>\u000D\u000A)|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

\X => (?>\PM\pM*)

Some things to consider...

需要考虑的一些事情...

That uses for its \Xdefinition what Unicode now refers toas a legacy grapheme cluster, not an extended grapheme cluster, as the latter is rather more complicated. Perl itself now uses the fancier version, but the old version is still perfectly workable for the most common situations. EDIT:See addendum at bottom.
What to do about \ddepends on your intent, but the default is the Uniode definition. I can see people not always wanting \p{Nd}, but sometimes either [0-9]or \pN.
The two boundary definitions, \band \B, are specifically written to use the \wdefinition.
That \wdefinition is overly broad, because it grabs the parenned letters not just the circled ones. The Unicode Other_Alphabeticproperty isn't available until JDK7, so that's the best you can do.

这用于其\X定义Unicode 现在所指的旧字素簇，而不是扩展字素簇，因为后者更为复杂。Perl 本身现在使用更高级的版本，但旧版本在最常见的情况下仍然完全可用。编辑：见底部附录。
要做什么\d取决于您的意图，但默认值是 Uniode 定义。我可以看到人们不要总是想\p{Nd}，但有时无论[0-9]或\pN。
两个边界定义\b和\B是专门为使用该\w定义而编写的。
这个\w定义过于宽泛，因为它抓住了括号内的字母，而不仅仅是圈出的字母。UnicodeOther_Alphabetic属性在 JDK7 之前不可用，所以这是您能做的最好的事情。

Exploring Boundaries

探索边界

Boundaries have been a problem ever since Larry Wall first coined the \band \Bsyntax for talking about them for Perl 1.0 back in 1987. The key to understanding how \band \Bboth work is to dispel two pervasive myths about them:

边界已自从拉里·沃尔首先创造了一个问题\b和\B语法在1987年谈论他们对Perl 1.0后面的关键是了解如何\b与\B这两个工作是打消她们两分无孔不入的神话：

They are only ever lookingfor \wword characters, neverfor non-word characters.
They do not specifically look for the edge of the string.

他们永远只能找了\w字的字符，从来没有对非单词字符。
他们不会专门寻找字符串的边缘。

A \bboundary means:

一个\b边界的机构：

    IF does follow word
        THEN doesn't precede word
    ELSIF doesn't follow word
        THEN does precede word

And those are all defined perfectly straightforwardly as:

这些都非常直接地定义为：

follows wordis (?<=\w).
precedes wordis (?=\w).
doesn't follow wordis (?<!\w).
doesn't precede wordis (?!\w).

后面的词是(?<=\w)。
在单词is之前(?=\w)。
不遵循词是(?<!\w)。
不先于单词is (?!\w)。

Therefore, since IF-THENis encoded as an and?ed-together ABin regexes, an oris X|Y, and because the andis higher in precedence than or, that is simply AB|CD. So every \bthat means a boundary can be safely replaced with:

因此，由于在正则表达式中IF-THEN被编码为and?ed-together AB，因此or是X|Y，并且因为的and优先级高于or，这就是AB|CD。所以每\b一个意味着边界可以安全地替换为：

    (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))

with the \wdefined in the appropriate way.

以\w适当的方式定义。

(You might think it strange that the Aand Ccomponents are opposites. In a perfect world, you should be able to write that AB|D, but for a while I was chasing down mutual exclusion contradictions in Unicode properties — which I thinkI've taken care of, but I left the double condition in the boundary just in case. Plus this makes it more extensible if you get extra ideas later.)

（你可能会觉得A和C组件是对立的很奇怪。在一个完美的世界里，你应该能够写出那个AB|D，但有一段时间我一直在寻找 Unicode 属性中的互斥矛盾——我想我已经解决了，但为了以防万一，我在边界中保留了双重条件。另外，如果您以后有其他想法，这使它更具可扩展性。）

For the \Bnon-boundaries, the logic is:

对于\B非边界，逻辑是：

    IF does follow word
        THEN does precede word
    ELSIF doesn't follow word
        THEN doesn't precede word

Allowing all instances of \Bto be replaced with:

允许将所有实例\B替换为：

    (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))

This really is how \band \Bbehave. Equivalent patterns for them are

这确实是如何\b和\B行为。它们的等效模式是

\busing the ((IF)THEN|ELSE)construct is (?(?<=\w)(?!\w)|(?=\w))
\Busing the ((IF)THEN|ELSE)construct is (?(?=\w)(?<=\w)|(?<!\w))

\b使用((IF)THEN|ELSE)构造是(?(?<=\w)(?!\w)|(?=\w))
\B使用((IF)THEN|ELSE)构造是(?(?=\w)(?<=\w)|(?<!\w))

But the versions with just AB|CDare fine, especially if you lack conditional patterns in your regex language — like Java. ?

但是带有 just 的版本AB|CD很好，尤其是当您的正则表达式语言（如 Java）中缺少条件模式时。?

I've already verified the behaviour of the boundaries using all three equivalent definitions with a test suite that checks 110,385,408 matches per run, and which I've run on a dozen different data configurations according to:

我已经使用所有三个等效定义和一个测试套件验证了边界的行为，该测试套件每次运行检查 110,385,408 个匹配项，并且我已经根据以下内容在十几种不同的数据配置上运行：

     0 ..     7F    the ASCII range
    80 ..     FF    the non-ASCII Latin1 range
   100 ..   FFFF    the non-Latin1 BMP (Basic Multilingual Plane) range
 10000 .. 10FFFF    the non-BMP portion of Unicode (the "astral" planes)

However, people often want a different sort of boundary. They want something that is whitespace and edge-of-string aware:

然而，人们往往想要一种不同的边界。他们想要一些空白和字符串边缘感知的东西：

left edgeas (?:(?<=^)|(?<=\s))
right edgeas (?=$|\s)

左边缘为(?:(?<=^)|(?<=\s))
右边缘为(?=$|\s)

Fixing Java with Java

用 Java 修复 Java

The code I posted in my other answerprovides this and quite a few other conveniences. This includes definitions for natural-language words, dashes, hyphens, and apostrophes, plus a bit more.

我在其他答案中发布的代码提供了这一点以及许多其他便利。这包括自然语言单词、破折号、连字符和撇号的定义，以及更多。

It also allows you to specify Unicode characters in logical code points, not in idiotic UTF-16 surrogates. It's hard to overstress how important that is!And that's just for the string expansion.

它还允许您在逻辑代码点中指定 Unicode 字符，而不是在愚蠢的 UTF-16 代理中。很难过分强调这有多重要！这仅适用于字符串扩展。

For regex charclass substitution that makes the charclass in your Java regexes finallywork on Unicode, and work correctly,grab the full source from here.You may do with it as you please, of course. If you make fixes to it, I'd love to hear of it, but you don't have to. It's pretty short. The guts of the main regex rewriting function is simple:

对于使 Java 正则表达式中的字符类最终在 Unicode 上工作并正常工作的正则表达式字符类替换，请从此处获取完整源代码。当然，你可以随心所欲。如果你修复它，我很乐意听到它，但你不必。它很短。主要正则表达式重写函数的内容很简单：

switch (code_point) {

    case 'b':  newstr.append(boundary);
               break; /* switch */
    case 'B':  newstr.append(not_boundary);
               break; /* switch */

    case 'd':  newstr.append(digits_charclass);
               break; /* switch */
    case 'D':  newstr.append(not_digits_charclass);
               break; /* switch */

    case 'h':  newstr.append(horizontal_whitespace_charclass);
               break; /* switch */
    case 'H':  newstr.append(not_horizontal_whitespace_charclass);
               break; /* switch */

    case 'v':  newstr.append(vertical_whitespace_charclass);
               break; /* switch */
    case 'V':  newstr.append(not_vertical_whitespace_charclass);
               break; /* switch */

    case 'R':  newstr.append(linebreak);
               break; /* switch */

    case 's':  newstr.append(whitespace_charclass);
               break; /* switch */
    case 'S':  newstr.append(not_whitespace_charclass);
               break; /* switch */

    case 'w':  newstr.append(identifier_charclass);
               break; /* switch */
    case 'W':  newstr.append(not_identifier_charclass);
               break; /* switch */

    case 'X':  newstr.append(legacy_grapheme_cluster);
               break; /* switch */

    default:   newstr.append('\');
               newstr.append(Character.toChars(code_point));
               break; /* switch */

}
saw_backslash = false;

Anyway, that code is just an alpha release, stuff I hacked up over the weekend. It won't stay that way.

无论如何，该代码只是一个 alpha 版本，是我在周末修改的内容。它不会一直这样。

For the beta I intend to:

对于测试版，我打算：

fold together the code duplication
provide a clearer interface regarding unescaping string escapes versus augmenting regex escapes
provide some flexibility in the \dexpansion, and maybe the \b
provide convenience methods that handle turning around and calling Pattern.compile or String.matches or whatnot for you

将重复的代码折叠在一起
提供关于非转义字符串转义与增加正则表达式转义的更清晰的界面
在\d扩展中提供一些灵活性，也许\b
提供方便的方法来处理转向和调用 Pattern.compile 或 String.matches 或诸如此类的东西

For production release, it should have javadoc and a JUnit test suite. I may include my gigatester, but it's not written as JUnit tests.

对于生产版本，它应该有 javadoc 和一个 JUnit 测试套件。我可能包括我的 gigatester，但它不是作为 JUnit 测试编写的。

Addendum

附录

I have good news and bad news.

我有好消息和坏消息。

The good news is that I've now got a veryclose approximation to an extended grapheme clusterto use for an improved \X.

好消息是，我现在得到了一个非常接近扩展字素簇的近似值，可用于改进的\X.

The bad news ? is that that pattern is:

坏消息？那个模式是：

(?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42\u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4\uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100-\u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960-\uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6][\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0-\uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8-\u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB-\uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&&[^\u000D\u000A\u200C\u200D]]\u000D\u000A])[[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD\u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670\uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32\u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))

which in Java you'd write as:

在 Java 中你会写成：

String extended_grapheme_cluster = "(?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42\u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4\uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100-\u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960-\uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6][\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0-\uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8-\u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB-\uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&&[^\u000D\u000A\u200C\u200D]]\u000D\u000A])[[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD\u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670\uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32\u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))";

?Tschü?!

？楚？！

Answer 2

回答by musiKk

It's really unfortunate that \wdoesn't work. The proposed solution \p{Alpha}doesn't work for me either.

真的很不幸它\w不起作用。提议的解决方案\p{Alpha}对我也不起作用。

It seems [\p{L}]catches all Unicode letters. So the Unicode equivalent of \wshould be [\p{L}\p{Digit}_].

它似乎[\p{L}]捕获了所有 Unicode 字母。所以 Unicode 等价物\w应该是[\p{L}\p{Digit}_].

Answer 3

回答by Alan Moore

In Java, \wand \dare not Unicode-aware; they only match the ASCII characters, [A-Za-z0-9_]and [0-9]. The same goes for \p{Alpha}and friends (the POSIX "character classes" they're based on are supposed to be locale-sensitive, but in Java they've only ever matched ASCII characters). If you want to match Unicode "word characters" you you have to spell it out, e.g. [\pL\p{Mn}\p{Nd}\p{Pc}],for letters, non-spacing modifiers (accents), decimal digits, and connecting punctuation.

在 Java 中，\w并且\d不支持 Unicode；它们只匹配 ASCII 字符，[A-Za-z0-9_]并且[0-9]. 这同样适用于\p{Alpha}和朋友（POSIX的“字符类”他们正在根据应该是语言环境敏感的，但在Java中，他们已经永远只能匹配的ASCII字符）。如果您想匹配 Unicode“单词字符”，您必须将其拼写出来，例如[\pL\p{Mn}\p{Nd}\p{Pc}]，用于字母、非间距修饰符（重音符号）、十进制数字和连接标点符号。

However, Java's \bisUnicode-savvy; it uses Character.isLetterOrDigit(ch)and checks for accented letters as well, but the only "connecting punctuation" character it recognizes is the underscore. EDIT:when I try your sample code, it prints ""and élève"as it should (see it on ideone.com).

然而，Java\b是Unicode 精通的；它也使用Character.isLetterOrDigit(ch)并检查重音字母，但它识别的唯一“连接标点”字符是下划线。 编辑：当我尝试你的示例代码，它打印""和élève"它应该（看到它在ideone.com）。

Java正则表达式中\w和\b的Unicode等价物？

提问by Tim Pietzcker

采纳答案by tchrist

Source code

源代码

Update in Java 7

Java 7 中的更新

Java's Regex Unicode Problems

Java 的正则表达式 Unicode 问题

The Solution to All Those Problems, and More

所有这些问题的解决方案，以及更多

Exploring Boundaries

探索边界

Fixing Java with Java

用 Java 修复 Java

Addendum

附录

回答by musiKk

回答by Alan Moore

相关推荐

最近更新

标签

Java正则表达式中\w和\b的Unicode等价物？

提问by Tim Pietzcker

采纳答案by tchrist

Source code

源代码

Update in Java 7

Java 7 中的更新

Java's Regex Unicode Problems

Java 的正则表达式 Unicode 问题

The Solution to All Those Problems, and More

所有这些问题的解决方案，以及更多

Exploring Boundaries

探索边界

Fixing Java with Java

用 Java 修复 Java

Addendum

附录

回答by musiKk

回答by Alan Moore

相关推荐

Java 用于测试的 JSTL 表达式（如果不是）

Java 关联与聚合

Java 打开和关闭文件 - 单独的方法

java.lang.NumberFormatException: Invalid int: "null"

相关推荐

最近更新

标签