如何添加 Java 正则表达式实现中缺少的功能?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5767627/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to add features missing from the Java regex implementation?
提问by Alireza Noori
I'm new to Java. As a .Net developer, I'm very much used to the Regex
class in .Net. The Java implementation of Regex
(Regular Expressions) is not bad but it's missing some key features.
我是 Java 的新手。作为 .Net 开发人员,我非常习惯Regex
.Net 中的课程。Regex
(正则表达式)的 Java 实现还不错,但缺少一些关键功能。
I wanted to create my own helper class for Java but I thought maybe there is already one available. So is there any free and easy-to-use product available for Regex in Java or should I create one myself?
我想为 Java 创建我自己的帮助类,但我想也许已经有一个可用的了。那么是否有任何免费且易于使用的 Java 正则表达式产品,还是我应该自己创建一个?
If I would write my own class, where do you think I should share it for the others to use it?
如果我要编写自己的类,您认为我应该在哪里共享它以供其他人使用?
[Edit]
[编辑]
There were complaints that I wasn't addressing the problem with the current Regex
class. I'll try to clarify my question.
有人抱怨我没有解决当前Regex
班级的问题。我会尽力澄清我的问题。
In .Net the usage of a regular expression is easier than in Java. Since both languages are object oriented and very similar in many aspects, I expect to have a similar experience with using regex in both languages. Unfortunately that's not the case.
在 .Net 中,正则表达式的使用比在 Java 中更容易。由于这两种语言都是面向对象的并且在很多方面都非常相似,我希望在两种语言中使用正则表达式有相似的体验。不幸的是,情况并非如此。
Here's a little code compared in Java and C#. The first is C# and the second is Java:
这是在 Java 和 C# 中比较的一些代码。第一个是 C#,第二个是 Java:
In C#:
在 C# 中:
string source = "The colour of my bag matches the color of my shirt!";
string pattern = "colou?r";
foreach(Match match in Regex.Matches(source, pattern))
{
Console.WriteLine(match.Value);
}
In Java:
在 Java 中:
String source = "The colour of my bag matches the color of my shirt!";
String pattern = "colou?r";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(source);
while(m.find())
{
System.out.println(source.substring(m.start(), m.end()));
}
I tried to be fair to both languages in the sample code above.
我试图在上面的示例代码中对两种语言都公平。
The first thing you notice here is the .Value
member of the Match
class (compared to using .start()
and .end()
in Java).
您在此处注意到的第一件事是类的.Value
成员Match
(与在 Java 中使用.start()
和相比.end()
)。
Why should I create two objects when I can call a static function like Regex.Matches
or Regex.Match
, etc.?
当我可以调用像Regex.Matches
orRegex.Match
等静态函数时,为什么要创建两个对象?
In more advanced usages, the difference shows itself much more. Look at the method Groups
, dictionary length, Capture
, Index
, Length
, Success
, etc. These are all very necessary features that in my opinion should be available for Java too.
在更高级的用法中,差异更加明显。看方法Groups
,辞典长度Capture
,Index
,Length
,Success
,等等,这些都是非常必要的功能,在我看来,应该为Java提供了。
Of course all of these features can be manually added by a custom proxy (helper) class. This is main reason why I asked this question. We don't have the breeze of Regex
in Perl but at least we can use the .Net approach to Regex
which I think is very cleverly designed.
当然,所有这些功能都可以通过自定义代理(帮助程序)类手动添加。这是我问这个问题的主要原因。我们没有Regex
Perl的微风,但至少我们可以使用Regex
我认为设计非常巧妙的 .Net 方法。
回答by tchrist
From your edited example, I can now see what you would like. And you have my sympathies in this, too. Java's regexes are a long, long, long ways from the convenience you find in Ruby or Perl. And they pretty much always will be; this cannot be fixed, so we're stuck with this mess forever — at least in Java. Other JVM languages do a better job at this, especially Groovy. But they still suffer some of the inherent flaws, and can only go so far.
从您编辑的示例中,我现在可以看到您想要什么。你也有我的同情。Java 的正则表达式与您在 Ruby 或 Perl 中找到的便利相去甚远。他们几乎永远都是;这是无法修复的,所以我们永远被这个烂摊子困住了——至少在 Java 中。其他 JVM 语言在这方面做得更好,尤其是 Groovy。但是它们仍然存在一些固有的缺陷,并且只能到此为止。
Where to begin? There are the so-called convenience methods of the String class: matches
, replaceAll
, replaceFirst
, and split
. These can sometimes be ok in small programs, depending how you use them. However, they do indeed have several problems, which it appears you have discovered. Here's a partial list of those problems, and what can and cannot be done about them.
从哪里开始?有String类的所谓的便利方法:matches
,replaceAll
,replaceFirst
,和split
。这些有时可以在小程序中使用,这取决于您如何使用它们。但是,它们确实存在一些问题,您似乎已经发现了这些问题。以下是这些问题的部分列表,以及对它们可以做什么和不能做什么。
The inconvenience method is very bizarrely named “matches” but it requires you to pad your regex on both sides to match the entire string. This counter-intuitive sense is contrary to any sense of the word match as used in any previous language, and constantly bites people. Patterns passed into the other 3 inconvenience methods work very unlike this one, because in the other 3, they work like normal patterns work everywhere else; just not in
matches
. This means you can't just copy your patterns around, even within methods in the same darned class for goodness' sake! And there is nofind
convenience method to do what every other matcher in the world does. Thematches
method should have been called something likeFullMatch
, and there should have been aPartialMatch
orfind
method added to the String class.There is no API that allows you to pass in
Pattern.compile
flags along with the strings you use for the 4 pattern-related convenience methods of the String class. That means you have to rely on string versions like(?i)
and(?x)
, but those do not exist for all possible Pattern compilation flags. This is highly inconvenient to say the least.The
split
method does not return the same result in edge cases assplit
returns in the languages that Java borrowed split from. This is a sneaky little gotcha. How many elements do youthink you should get back in the return list if you split the empty string, eh? Java manufacturers a fake return element where there should be one, which means you can't distinguish between legit results and bogus ones. It is a serious design flaw that splitting on a":"
, you cannot tell the difference between inputs of""
vs of":"
. Aw, gee! Don't people ever test this stuff? And again, the broken and fundamentally unreliable behavior is unfixable: you must never change things, even broken things. It's not ok to break broken things in Java the wayt it is anywhere else. Broken is forever here.The backslash notation of regexes conflicts with the backslash notation used in strings. This makes it superduper awkward, and error-prone, too, because you have to constantly add lots of backslashes to everything, and it's too easy to forget one and get neither warning nor success. Simple patterns like
\b\w+\b
become nightmares in typographical excess:"\\b\\w+\\b"
. Good luck with reading that. Some people use a slash-inverter function on their patterns so that they can write that as"/b/w+/b"
instead. Other than reading in your patterns from a string, there is no way to construct your pattern in a WYSIWYG literal fashion; it's always heavy-laden with backslashes. Did you get them all, and enough, and in the right places? If so, it makes it really really hard to read. If it isn't, you probably haven't gotten them all. At least JVM languages like Groovy have figured out the right answer here: give people 1st-class regexes so you don't go nuts. Here's a fair collection of Groovy regex examplesshowing how simple it can and shouldbe.The
(?x)
mode is deeply flawed. It doesn't take comments in the Java style of// COMMENT
but rather in the shell style of# COMMENT
. It doesn't work with multiline strings. It doesn't accept literals as literals, forcing the backslash problems listed above, which fundamentally compromises any attempt at lining things up, like having all comments begin on the same column. Because of the backslashes, you either make them begin on the same column in the source code string and screw them up if you print them out, or vice versa. So much for legibility!It is incredibly difficult — and indeed, fundamentally unfixably broken — to enter Unicode characters in a regex. There is no support for symbolically named characters like
\N{QUOTATION MARK}
,\N{LATIN SMALL LETTER E WITH GRAVE}
, or\N{MATHEMATICAL BOLD CAPITAL C}
. That means you're stuck with unmaintainable magic numbers. And you cannot even enter them by code point, either. You cannot use\u0022
for the first one because the Java preprocessor makes that a syntax error. So then you move to\\u0022
instead, which works until you get to the next one,\\u00E8
, which cannot be entered that way or it will break theCANON_EQ
flag. And the last one is a pure nightmare: its code point is U+1D402, but Java does not support the full Unicode set using their code point numbers in regexes, forcing you to get out your calculator to figure out that that is\uD835\uDC02
or\\uD835\\uDC02
(but not\\uD835\uDC02
), madly enough. But you cannot use those in character classes due to a design bug, making it impossible to match say,[\N{MATHEMATICAL BOLD CAPITAL A}-\N{MATHEMATICAL BOLD CAPITAL Z}]
because the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java's Unicode-in-source-code troubles by compiling withjava -encoding UTF-8
, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!Many of the regex things we've come to rely on in other languages are missing from Java. There are no named groups for examples, nor even relatively-numbered ones. This makes constructing larger patterns out of smaller ones fundamentally error prone. There is a front-end library that allows you to have simple named groups, and indeed this will finally arrive in production JDK7. But even so there is no mechanism for what to do with more than one group by the same name. And you still don't have relatively numbered buffers, either. We're back to the Bad Old Days again, stuff that was solved aeons ago.
There is no support a linebreak sequence, which is one of the only two “Strongly Recommended” parts of the standard, which suggests that
\R
be used for such. This is awkward to emulate because of its variable-length nature and Java's lack of support for graphemes.The character class escapes do not work on Java's native character set! Yes, that's right: routine stuff like
\w
and\s
(or rather,"\\w"
and"\\b"
) does not work on Unicode in Java! This is notthe cool sort of retro. To make matters worse, Java's\b
(make that"\\b"
, which isn't the same as"\b"
) doeshave some Unicode sensibility, although not what the standardsays it must have. So for example a string like"élève"
will never in Java match the pattern\b\w+\b
, and not merely in entirety perPattern.matches
, but indeed at no point whatsoeveras you might get fromPattern.find
. This is just so screwed up as to beggar belief. They've broken the inherent connection between\w
and\b
, then misdefined them to boot!! It doesn't even know what Unicode Alphabetic code points are. This is supremely broken, and they can never fix it because that would change the behavior of existing code, which is strictly forbidden in the Java Universe. The best you can do is create a rewrite library that acts as a front end before it gets to the compile phase; that way you can forcibly migrate your patterns from the 1960s into the 21st century of text processing.The only two Unicode properties supported are the General Categories and the Block properties. The general category properties only support the abbreviations like
\p{Sk}
, contrary to the standards Strong Recommendation to also allow\p{Modifier Symbol}
,\p{Modifier_Symbol}
, etc. You don't even get the required aliases the standard says you should. That makes your code even more unreadable and unmaintainable. You will finallyget support for the Script property in production JDK7, but that is still seriously short of the mininum set of 11 essential properties that the Standardsays you must provide for even the minimal level of Unicode support.Some of the meagre properties that Java does provide are faux amis: they have the same names as official Unicode propoperty names, but they do something altogether different. For example, Unicode requires that
\p{alpha}
be the same as\p{Alphabetic}
, but Java makes it the archaic and no-longer-quaint 7-bit alphabetics only, which is more than 4 orders of magnitude too few. Whitespace is another flaw, since you use the Java version that masquerades as Unicode whitespace, your UTF-8 parsers will break because of their NO-BREAK SPACE code points, which Unicode normatively requires be deemed whitespace, but Java ignores that requirement, so breaks your parser.There is no support for graphemes, the way
\X
normally provides. That renders impossible innumerably many common tasks that you need and want to do with regexes. Not only are extended grapheme clustersout of your reach, because Java supports almost none of the Unicode properties, you cannot even approximate the old legacy grapheme clustersusing the standard(?:\p{Grapheme_Base}\p{Grapheme_Extend}]*)
. Not being able to work with graphemes makes even the simplest sorts of Unicode text processing impossible. For example, you cannot match a vowel irrespective of diacritic in Java. The way you do this in a language with grapheme supports varies, but at the very least you should be able to throw the thing into NFD and match(?:(?=[aeiou])\X)
. In Java, you cannot do even that much: graphemes are beyond your reach. And that means Java cannot even handle its own native character set. It gives you Unicode and then makes it impossible to work with it.The convenience methods in the String class do not cache the compiled regex. In fact, there is no such thing as a compile-time pattern that gets syntax-checked at compile time — which is when syntax checking is supposed to occur.That means your program, which uses nothing but constant regexes fully understood at compile time, will bomb out with an exception in the middle of its run if you forget a little backslash here or there as one is wont to do due to the flaws previously discussed. Even Groovy gets this part right. Regexes are far too high-level a construct to be dealt with by Java's unpleasant after-the-fact, bolted-on-the-side model — and they are far too important to routine text processing to be ignored. Java is much too low-level a language for this stuff, and it fails to provide the simple mechanics out of which might yourself build what you need: you can't get there from here.
The
String
andPattern
classes are markedfinal
in Java. That completely kills any possibility of using proper OO design to extend those classes. You can't create a better version of amatches
method by subclassing and replacement. Heck, you can't even subclass! Final is not a solution; final is a death sentence from which there is no appeal.
不便的方法被非常奇怪地命名为“匹配”,但它要求您在两侧填充正则表达式以匹配整个字符串。这种违反直觉的意义与任何先前语言中使用的单词匹配的任何意义相反,并且不断地咬人。传递给其他 3 个不便方法的模式与这个非常不同,因为在其他 3 个方法中,它们的工作方式与其他地方的正常模式一样;只是不在
matches
. 这意味着您不能只是复制您的模式,即使是在同一个该死的类中的方法中,看在上帝的份上!并且没有find
方便的方法来做世界上所有其他匹配器所做的事情。该matches
方法应该被称为类似FullMatch
,并且应该有一个PartialMatch
或find
添加到 String 类的方法。没有 API 允许您将
Pattern.compile
标志与用于 String 类的 4 个与模式相关的便捷方法的字符串一起传入。这意味着您必须依赖像(?i)
and这样的字符串版本(?x)
,但对于所有可能的 Pattern 编译标志,这些版本并不存在。至少可以说这是非常不方便的。该
split
方法在边缘情况下返回的结果与split
Java 借用拆分的语言中的返回结果不同。这是一个偷偷摸摸的小问题。如果拆分空字符串,您认为应该返回返回列表中的多少元素,是吗?Java 在应该有的地方制造了一个虚假的返回元素,这意味着你无法区分合法的结果和虚假的结果。在 a 上拆分是一个严重的设计缺陷":"
,您无法分辨 的""
vs 的输入之间的区别":"
. 哎呀!人们从来没有测试过这些东西吗?再一次,破碎的和根本不可靠的行为是无法修复的:你永远不能改变事物,即使是破碎的事物。在 Java 中打破破碎的东西是不行的,就像在其他任何地方一样。破碎永远在这里。正则表达式的反斜杠符号与字符串中使用的反斜杠符号冲突。这也使它变得非常笨拙,而且容易出错,因为您必须不断地为所有内容添加大量反斜杠,而且很容易忘记一个,既得不到警告也得不到成功。简单的图案就像
\b\w+\b
在排版过多时变成噩梦一样:"\\b\\w+\\b"
. 祝你阅读愉快。有些人在他们的模式上使用斜线反转函数,这样他们就可以把它写成"/b/w+/b"
反而。除了从字符串中读取您的模式之外,没有办法以 WYSIWYG 字面方式构建您的模式;它总是充满反斜杠。你是否在正确的地方得到了它们,足够多?如果是这样,那就真的很难阅读了。如果不是,您可能还没有全部获得。至少像 Groovy 这样的 JVM 语言已经在这里找到了正确的答案:为人们提供一流的正则表达式,这样您就不会发疯。这是一个公平的 Groovy 正则表达式示例集合,展示了它可以和应该是多么简单。该
(?x)
模式存在严重缺陷。它不采用 Java 风格的注释,// COMMENT
而是采用# COMMENT
. 它不适用于多行字符串。它不接受文字作为文字,迫使上面列出的反斜杠问题,这从根本上损害了任何排列事物的尝试,例如让所有评论都从同一列开始。因为有反斜杠,你要么让它们从源代码字符串的同一列开始,然后在打印出来时把它们搞砸,反之亦然。这么多的易读性!在正则表达式中输入 Unicode 字符非常困难——事实上,从根本上无法修复。有像象征性人物命名不支持
\N{QUOTATION MARK}
,\N{LATIN SMALL LETTER E WITH GRAVE}
或\N{MATHEMATICAL BOLD CAPITAL C}
。这意味着你被无法维护的幻数困住了。你甚至不能通过代码点输入它们。您不能使用\u0022
第一个,因为 Java 预处理器会导致语法错误。所以然后你移动到\\u0022
,直到你到达下一个,\\u00E8
,不能以这种方式输入,否则它会破坏CANON_EQ
标志。最后一个是纯粹的噩梦:它的代码点是 U+1D402,但是 Java 不支持在正则表达式中使用它们的代码点编号的完整 Unicode 集,这迫使您拿出计算器来弄清楚那是\uD835\uDC02
或\\uD835\\uDC02
(但不是\\uD835\uDC02
),够疯狂的。但是由于设计错误,您不能在字符类中使用它们,从而无法匹配 say,[\N{MATHEMATICAL BOLD CAPITAL A}-\N{MATHEMATICAL BOLD CAPITAL Z}]
因为正则表达式编译器在 UTF-16 上搞砸了。同样,这永远无法修复,否则会改变旧程序。您甚至无法通过使用 编译 来解决 Java 源代码中的 Unicode 问题的正常解决方法java -encoding UTF-8
,因为愚蠢的事情将字符串存储为令人讨厌的 UTF-16,这必然会在字符类中破坏它们。 哎呀!我们在其他语言中依赖的许多正则表达式在 Java 中都没有。例如,没有命名组,甚至没有相对编号的组。这使得从较小的模式构建较大的模式从根本上容易出错。有一个前端库可以让您拥有简单的命名组,实际上这最终会出现在生产 JDK7 中。但即便如此,也没有机制可以处理多个同名的组。而且您仍然没有相对编号的缓冲区。我们又回到了糟糕的过去,很久以前就解决了的问题。
不支持换行序列,这是标准中仅有的两个“强烈推荐”部分之一,建议将
\R
其用于此类。由于它的可变长度特性和 Java 缺乏对字素的支持,这很难模拟。字符类转义不适用于 Java 的本机字符集!是的,没错:像
\w
and\s
(或者更确切地说,"\\w"
and"\\b"
)之类的常规内容在 Java 中不适用于 Unicode!这不是很酷的复古风格。更糟糕的是,Java 的\b
(make that"\\b"
,与 不同"\b"
)确实具有一些 Unicode 敏感性,尽管不是标准所说的必须具有的敏感性。因此,例如像绳子"élève"
从来没有在Java将匹配的模式\b\w+\b
,而不是仅仅在每全部Pattern.matches
,但确实是在没有任何一点你可能从得到Pattern.find
. 这简直糟透了,简直就是乞丐的信仰。他们打破了\w
和之间的内在联系\b
,然后错误地定义了它们来启动!!它甚至不知道什么是 Unicode 字母代码点。这是非常糟糕的,他们永远无法修复它,因为这会改变现有代码的行为,这在 Java 世界中是严格禁止的。您能做的最好的事情是创建一个重写库,在它进入编译阶段之前充当前端;这样您就可以将 1960 年代的模式强行迁移到 21 世纪的文本处理。唯一支持的两个 Unicode 属性是 General Categories 和 Block 属性。一般的类别属性只支持类的缩写
\p{Sk}
,违背了标准的强烈推荐中,也允许\p{Modifier Symbol}
,\p{Modifier_Symbol}
等你甚至不得到所需要的别名标准说,你应该。这使您的代码更加不可读和不可维护。您最终将在生产 JDK7 中获得对 Script 属性的支持,但这仍然严重缺乏标准所说的 11 个基本属性的最小集合,即使是最低级别的 Unicode 支持也必须提供这些属性。Java 确实提供的一些微不足道的属性是虚假的:它们与官方的 Unicode 属性名称具有相同的名称,但它们的作用完全不同。例如,Unicode 要求
\p{alpha}
与 相同\p{Alphabetic}
,但 Java 使它成为陈旧且不再古怪的 7 位字母,这太少了 4 个数量级以上。空格是另一个缺陷,因为您使用伪装成 Unicode 空格的 Java 版本,您的 UTF-8 解析器会因为它们的 NO-BREAK SPACE 代码点而中断,Unicode 规范要求将其视为空格,但 Java 忽略了该要求,因此中断你的解析器。不支持字素,
\X
通常提供的方式。这使得您需要和想要使用正则表达式完成的无数常见任务变得不可能。不仅扩展字形集群遥不可及的,因为Java的支持几乎没有任何Unicode属性的,你甚至不能接近的老传统字形集群使用标准(?:\p{Grapheme_Base}\p{Grapheme_Extend}]*)
。无法使用字素,即使是最简单的 Unicode 文本处理也变得不可能。例如,在 Java 中,无论变音符号如何,您都无法匹配元音。在支持字素的语言中执行此操作的方式各不相同,但至少您应该能够将其放入 NFD 并匹配(?:(?=[aeiou])\X)
. 在 Java 中,您甚至做不到那么多:字素超出您的能力范围。这意味着 Java 甚至无法处理自己的本地字符集。它为您提供 Unicode,然后使其无法使用。String 类中的便捷方法不缓存已编译的正则表达式。事实上,不存在在编译时进行语法检查的编译时模式——这是应该发生语法检查的时候。这意味着你的程序,它只使用在编译时完全理解的常量正则表达式,如果你在这里或那里忘记了一点反斜杠,将在运行过程中抛出一个异常,因为之前讨论过的缺陷通常不会这样做. 即使是 Groovy 也能正确理解这一部分。正则表达式是一种太高级的结构,无法由 Java 令人不快的事后固定模型来处理——而且它们对于常规文本处理来说太重要而不能被忽略。Java 对于这些东西来说是一种太低级的语言,它无法提供简单的机制,您可以从中构建您需要的东西:您无法从这里到达那里。
在
String
和Pattern
类被标记final
在Java中。这完全消除了使用适当的面向对象设计来扩展这些类的任何可能性。您无法matches
通过子类化和替换来创建更好的方法版本。哎呀,你甚至不能子类化!Final 不是解决方案;最后是死刑判决,不得上诉。
Finally,to show you just how brain-damaged Java's truly regexes are, consider this multiline pattern, which shows many of the flaws already described:
最后,为了向您展示 Java 的真正正则表达式是多么伤脑筋,请考虑这个多行模式,它显示了许多已经描述的缺陷:
String rx =
"(?= ^ \p{Lu} [_\pL\pM\d\-] + $)\n"
. " # next is a big can't-have set \n"
. "(?! ^ .* \n"
. " (?: ^ \d+ $ \n"
. " | ^ \p{Lu} - \p{Lu} $ \n"
. " | Invitrogen \n"
. " | Clontech \n"
. " | L-L-X-X # dashes ok \n"
. " | Sarstedt \n"
. " | Roche \n"
. " | Beckman \n"
. " | Bayer \n"
. " ) # end alternatives \n"
. " \b # only on a word boundary \n"
. ") # end negated lookahead \n"
;
Do you see how unnatural that is? You have to put literal newlines in your strings; you have to use non-Java comments; you cannot make anything line up because of the extra backslashes; you have to use definitions of things that don't work right on Unicode. There are many more problems beyond that.
你看到这有多不自然了吗?您必须在字符串中放置文字换行符;你必须使用非 Java 注释;由于额外的反斜杠,您无法将任何内容对齐;您必须使用在 Unicode 上无法正常工作的事物的定义。除此之外还有很多问题。
Not only are there no plans to fix almost any of these grievous flaws, it is indeed impossible to fix almost any of them at all, because you change old programs. Even the normal tools of OO design are forbidden to you because it's all locked down with the finality of a death sentence, and it cannot be fixed.
不仅没有计划修复几乎所有这些严重的缺陷,而且几乎根本不可能修复它们中的任何一个,因为您更改了旧程序。即使是面向对象设计的常规工具也被禁止使用,因为这一切都随着死刑的最终确定而被锁定,无法修复。
So Alireza Noori, if you feel Java's clumsy regexes are too hosed for reliable and convenient regex processing everto be possible in Java, I cannot gainsay you. Sorry, but that's just the way it is.
所以阿里努里,如果你觉得Java的正则表达式笨拙过于大清洗可靠,便捷的正则表达式处理永远是在Java中有可能,我不能反驳你。抱歉,事情就是这样。
“Fixed in the Next Release!”
“在下一个版本中修复!”
Just because some things can never be fixed does not mean that nothing can ever be fixed. It just has to be done very carefully. Here are the things I know of which are already fixed in current JDK7 or proposed JDK8 builds:
仅仅因为有些事情永远无法修复并不意味着任何事情都无法修复。它必须非常小心地完成。以下是我所知道的在当前 JDK7 或建议的 JDK8 版本中已经修复的内容:
The Unicode Script property is now supported. You may use any of the equivalent forms
\p{Script=Greek}
,\p{sc=Greek}
,\p{IsGreek}
, or\p{Greek}
. This is inherently superior to the old clunky block properties. It means you can do things like[\p{Latin}\p{Common}\p{Inherited}]
, which is quite important.The UTF-16 bug has a workaround. You may now specify any Unicode code point by its number using the
\x{?}
notation, such as\x{1D402}
. This works even inside character classes, finally allowing[\x{1D400}-\x{1D419}]
to work properly. You still must double backslash it though, and it only works in regexex, not strings in general as it really ought to.Named groups are now supported via the standard notation
(?<NAME>?)
to create it and\k<NAME>
to backreference it. These still contribute to numeric group numbers, too. However, you cannot get at more than one of them in the same pattern, nor can you use them for recursion.A new Pattern compile flag,
Pattern.UNICODE_CHARACTER_CLASSES
and associated embeddable switch,(?U)
, will now swap around all the definitions of things like\w
,\b
,\p{alpha}
, and\p{punct}
, so that they now conform to the definitions of those things required by The Unicode Standard.The missing or misdefined binary properties
\p{IsLowercase}
,\p{IsUppercase}
, and\p{IsAlphabetic}
will now be supported, and these correspond to methods in theCharacter
class. This is important because Unicode makes a significant and pervasive distinction between mere letters and cased or alphabetic code points. These key properties are among those 11 essential properties that are absolutely required for Level 1 compliance with UTS#18, “Unicode Regular Expresions”, without which you really cannot work with Unicode.
现在支持 Unicode 脚本属性。您可以使用任何等效形式
\p{Script=Greek}
,\p{sc=Greek}
,\p{IsGreek}
, 或\p{Greek}
。这本质上优于旧的笨重块属性。这意味着您可以执行诸如 之类的事情[\p{Latin}\p{Common}\p{Inherited}]
,这非常重要。UTF-16 错误有一个解决方法。您现在可以使用
\x{?}
符号通过其编号指定任何 Unicode 代码点,例如\x{1D402}
. 这甚至可以在字符类[\x{1D400}-\x{1D419}]
中工作,最终可以正常工作。尽管如此,您仍然必须加倍反斜杠,并且它仅适用于正则表达式,而不是一般的字符串,因为它确实应该如此。现在通过标准符号支持命名组
(?<NAME>?)
来创建它并\k<NAME>
反向引用它。这些仍然有助于数字组编号。但是,您不能在同一模式中获得多个它们,也不能将它们用于递归。一种新的模式编译标志,
Pattern.UNICODE_CHARACTER_CLASSES
以及相关的嵌入式交换机(?U)
,现在会掉周围的一切事物一样的定义\w
,\b
,\p{alpha}
,和\p{punct}
,使他们现在符合的Unicode标准所要求的那些东西的定义。现在将支持缺失或错误定义的二进制属性
\p{IsLowercase}
、\p{IsUppercase}
和\p{IsAlphabetic}
,这些对应于Character
类中的方法。这很重要,因为 Unicode 在单纯的字母和大小写或字母代码点之间做出了重要而普遍的区分。这些关键属性是符合 UTS#18“Unicode 正则表达式”的 1 级绝对要求的 11 个基本属性之一,没有这些属性,您将无法真正使用 Unicode。
These enhancements and fixes are veryimportant to finally have, and so I am glad, even excited, to have them.
这些增强和修复对于最终拥有非常重要,所以我很高兴,甚至很兴奋,拥有它们。
But for industrial-strength, state-of-the-art regex and/or Unicode work, I will not be using Java. There's just too much missing from Java's still-patchy-after-20-years Unicode model to get real work done if you dare to use the character set that Java gives. And the bolted-on-the-side model neverworks, which is all Java regexes are. You have to start over from first principles, the way Groovy did.
但是对于工业强度、最先进的正则表达式和/或 Unicode 工作,我不会使用 Java。如果您敢于使用 Java 提供的字符集,那么 Java 20 年后仍然不完整的 Unicode 模型缺少太多东西来完成真正的工作。并且螺栓固定模型从不工作,这是所有 Java 正则表达式。您必须从首要原则重新开始,就像 Groovy 所做的那样。
Sure, it might work for very limited applications whose small customer base is limited to English-language monoglots rural Iowa with no external interactions or any need for characters beyond what an old-style telegraph could send. But for how many projects is thatreally true? Fewer even that you think, it turns out.
当然,它可能适用于非常有限的应用程序,这些应用程序的小客户群仅限于爱荷华州农村的英语单语者,没有外部交互或不需要旧式电报可以发送的字符。但是对于多少项目来说,这是真的?事实证明,甚至比你想象的还要少。
It is for this reason that a certain (and obvious) multi-billion-dollar just recently cancelled international deployment of an important application. Java's Unicode support — not just in regexes, but throughout — proved to be too weak for the needed internationalization to be done reliably in Java. Because of this, they have been forced to scale back from their originally planned wordwide deployment to a merely U.S. deployment. It's positively parochial. And no, there are N?? H????; would you be?
正是出于这个原因,某个(而且很明显)价值数十亿美元的公司最近刚刚取消了一项重要应用程序的国际部署。Java 的 Unicode 支持——不仅在正则表达式中,而且在整个过程中——被证明对于在 Java 中可靠地完成所需的国际化来说太弱了。因此,他们被迫从最初计划的全球部署缩减到仅在美国部署。这是积极的狭隘。不,有N个??H????; 你会吗?
Java has had 20 years to get it right, and they demonstrably have not done so thus far, so I wouldn't hold my breath. Or throw good money after bad; the lesson here is to ignore the hype and instead apply due diligence to make verysure that all the necessary infrastructure support is there beforeyou invest too much. Otherwise you too may get stuck without any real options once you're too far into it to rescue your project.
Java 已经用了 20 年的时间把它做好了,而到目前为止他们显然还没有做到,所以我不会屏住呼吸。或坏事后扔好钱;这里的教训是忽略炒作,而是适用尽职调查做出非常确保所有必要的基础设施支持,是有之前你投入太多。否则,一旦您陷入困境而无法挽救您的项目,您也可能会陷入困境而没有任何实际选择。
Caveat Emptor
买者自负
回答by Alistair A. Israel
One can rant, or one can simply write:
一个人可以咆哮,或者一个人可以简单地写:
public class Regex {
/**
* @param source
* the string to scan
* @param pattern
* the regular expression to scan for
* @return the matched
*/
public static Iterable<String> matches(final String source, final String pattern) {
final Pattern p = Pattern.compile(pattern);
final Matcher m = p.matcher(source);
return new Iterable<String>() {
@Override
public Iterator<String> iterator() {
return new Iterator<String>() {
@Override
public boolean hasNext() {
return m.find();
}
@Override
public String next() {
return source.substring(m.start(), m.end());
}
@Override
public void remove() {
throw new UnsupportedOperationException();
}
};
}
};
}
}
Used as you wish:
随意使用:
public class RegexTest {
@Test
public void test() {
String source = "The colour of my bag matches the color of my shirt!";
String pattern = "colou?r";
for (String match : Regex.matches(source, pattern)) {
System.out.println(match);
}
}
}
回答by Vadzim
回答by Christo
Boy, do I hear you on that one Alireza! Regex's are confusing enough without there being so many syntax variations amonng them. I too do a lot more C# than Java programming and had the same issue.
男孩,我在那个 Alireza 上听到你了吗!正则表达式足够令人困惑,而它们之间没有这么多的语法变化。我也做了比 Java 编程更多的 C# 并且遇到了同样的问题。
I found this to be very helpful: http://www.tusker.org/regex/regex_benchmark.html- it's a list of alternative regular expression implementations for Java, benchmarked.
我发现这非常有帮助:http: //www.tusker.org/regex/regex_benchmark.html- 它是 Java 的替代正则表达式实现列表,已进行基准测试。
回答by Victor Grazi
This one is darned good, if I do say so myself! regex-tester-tool
如果我自己这么说的话,这真是太好了! 正则表达式测试工具