Java 正则表达式交替运算符“|” 行为似乎坏了
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4515309/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java regex alternation operator "|" behavior seems broken
提问by Craig Kovatch
Trying to write a regex matcher for roman numerals. In sed (which I think is considered 'standard' for regex?), if you have multiple options delimited by the alternation operator, it will match the longest. Namely, "I|II|III|IV"
will match "IV" for "IV" and "III" for "III"
尝试为罗马数字编写正则表达式匹配器。在 sed 中(我认为它被认为是正则表达式的“标准”?),如果您有多个由交替运算符分隔的选项,它将匹配最长的选项。即,"I|II|III|IV"
将匹配“IV”为“IV”和“III”为“III”
In Java, the same pattern matches "I" for "IV" and "I" for "III". Turns out Java chooses between alternation matches left-to-right; that is, because "I" appears before "III" in the regex, it matches. If I change the regex to "IV|III|II|I"
, the behavior is corrected, but this obviously isn't a solution in general.
在 Java 中,相同的模式匹配“I”代表“IV”和“I”代表“III”。结果证明 Java 在从左到右交替匹配之间进行选择;也就是说,因为“I”出现在正则表达式中的“III”之前,所以它匹配。如果我将正则表达式更改为"IV|III|II|I"
,则行为会得到纠正,但这显然不是一般的解决方案。
Is there a way to make Java choose the longest match out of an alternation group, instead of choosing the 'first'?
有没有办法让 Java 从交替组中选择最长的匹配项,而不是选择“第一个”?
A code sample for clarity:
为清楚起见,代码示例:
public static void main(String[] args)
{
Pattern p = Pattern.compile("six|sixty");
Matcher m = p.matcher("The year was nineteen sixty five.");
if (m.find())
{
System.out.println(m.group());
}
else
{
System.out.println("wtf?");
}
}
This outputs "six"
这输出 "six"
回答by Alan Moore
No, it's behaving correctly. Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlikesed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.
不,它的行为是正确的。Java 使用 NFA 或正则表达式导向的风格,如 Perl、.NET、JavaScript 等,与sed、grep 或 awk 不同。替代方案预计会在其中一个替代方案匹配后立即退出,而不是坚持最长的匹配。
You can force it to continue by adding a condition afterthe alternation that can't be met until the whole token has been consumed. What that condition might be depends on the context; the simplest option would be an anchor ($
) or a word boundary (\b
).
您可以通过在交替之后添加一个条件来强制它继续,直到整个令牌都被消耗掉才能满足。这种情况可能取决于上下文;最简单的选择是锚点 ( $
) 或词边界 ( \b
)。
"\b(I|II|III|IV)\b"
EDIT: I should mention that, while grep, sed, awk and others traditionallyuse text-directed (or DFA) engines, you can also find versions of some of them that use NFA engines, or even hybrids of the two.
编辑:我应该提到的是,虽然 grep、sed、awk 和其他传统上使用文本导向(或 DFA)引擎,但您也可以找到其中一些使用 NFA 引擎的版本,甚至是两者的混合。
回答by danben
I think a pattern that will work is something like
我认为一种有效的模式类似于
IV|I{1,3}
IV|I{1,3}
See the "greedy quantifiers" section at http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
请参阅http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html 上的“贪婪量词”部分
Edit: in response to your comment, I think the general problem is that you keep using alternation when it is not the right thing to use. In your new example, you are trying to match "six" or "sixty"; the right pattern to use is six(ty)?
, not six|sixty
. In general, if you ever have two members of an alternation group such that one is a prefix of another, you should rewrite the regular expression to eliminate it. Otherwise, you can't really complain that the engine is doing the wrong thing, since the semantics of alternation don't say anything about a longest match.
编辑:为了回应您的评论,我认为一般的问题是您在不正确的情况下继续使用交替。在您的新示例中,您试图匹配“六”或“六十”;要使用的正确模式是six(ty)?
,而不是six|sixty
。一般来说,如果你有两个交替组的成员,其中一个是另一个的前缀,你应该重写正则表达式以消除它。否则,你真的不能抱怨引擎做错了,因为交替的语义并没有说明最长的匹配。
Edit 2: the literal answer to your question is no, it can't be forced (and my commentary is that you shouldn't ever need this behavior).
编辑 2:你的问题的字面答案是否定的,它不能被强迫(我的评论是你不应该需要这种行为)。
Edit 3: thinking more about the subject, it occurred to me that an alternation pattern where one string is the prefix of another is undesirable for another reason; namely, it will be slower unless the underlying automaton is constructed to take prefixes into account (and given that Java picks the first match in the pattern, I would guess that this is not the case).
编辑 3:更多地考虑这个主题,我想到一个字符串是另一个字符串前缀的交替模式由于另一个原因是不可取的;也就是说,除非底层自动机被构建为考虑前缀,否则它会更慢(并且考虑到 Java 选择模式中的第一个匹配项,我猜事实并非如此)。