Java 嵌套捕获组如何在正则表达式中编号?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1313934/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How are nested capturing groups numbered in regular expressions?
提问by Alan Storm
Is there a defined behavior for how regular expressions should handle the capturing behavior of nested parentheses? More specifically, can you reasonably expect that different engines will capture the outer parentheses in the first position, and nested parentheses in subsequent positions?
对于正则表达式应如何处理嵌套括号的捕获行为,是否有定义的行为?更具体地说,您能否合理地期望不同的引擎会在第一个位置捕获外括号,并在后续位置捕获嵌套括号?
Consider the following PHP code (using PCRE regular expressions)
考虑以下 PHP 代码(使用 PCRE 正则表达式)
<?php
$test_string = 'I want to test sub patterns';
preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches);
print_r($matches);
?>
Array
(
[0] => I want to test sub patterns //entire pattern
[1] => I want to test //entire outer parenthesis
[2] => want //first inner
[3] => to //second inner
[4] => patterns //next parentheses set
)
The entire parenthesized expression is captured first (I want to test), and then the inner parenthesized patterns are captured next ("want" and "to"). This makes logical sense, but I could see an equally logical case being made for first capturing the sub parentheses, and THEN capturing the entire pattern.
首先捕获整个带括号的表达式(我想测试),然后接下来捕获带括号的内部模式(“want”和“to”)。这是合乎逻辑的,但我可以看到一个同样合乎逻辑的情况,首先捕获子括号,然后捕获整个模式。
So, is this "capture the entire thing first" defined behavior in regular expression engines, or is it going to depend on the context of the pattern and/or the behavior of the engine (PCRE being different than C#'s being different than Java's being different than etc.)?
那么,这种“首先捕获整个事物”是在正则表达式引擎中定义的行为,还是取决于模式的上下文和/或引擎的行为(PCRE 与 C# 不同,Java 不同)比等)?
采纳答案by daotoad
From perlrequick
If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc.
如果正则表达式中的分组是嵌套的,则 $1 获取具有最左边左括号的组,$2 获取下一个左括号等。
Caveat: Excluding non-capture group opening parenthesis (?=)
警告:排除非捕获组左括号 (?=)
Update
更新
I don't use PCRE much, as I generally use the real thing ;), but PCRE's docsshow the same as Perl's:
我不经常使用 PCRE,因为我通常使用真实的东西 ;),但PCRE 的文档显示与 Perl 的相同:
SUBPATTERNS
2.
It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via theovector
argument ofpcre_exec()
. Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns.For example, if the string "the red king" is matched against the pattern
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.
子模式
2.
它将子模式设置为捕获子模式。这意味着,当整个模式匹配时,与子模式匹配的主题字符串部分将通过 的ovector
参数传递回调用者pcre_exec()
。从左到右(从 1 开始)计算左括号以获得捕获子模式的编号。例如,如果字符串“the red king”与模式匹配
the ((red|white) (king|queen))
捕获的子串是“red king”、“red”和“king”,分别编号为1、2、3。
If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.
如果 PCRE 偏离 Perl regex 兼容性,也许应该重新定义首字母缩略词——“Perl Cognate Regular Expressions”、“Perl Comparable Regular Expressions”或其他什么。或者只是剥离意义的字母。
回答by Devin Ceartas
The order of capturing in the order of the left paren is standard across all the platforms I've worked in. (perl, php, ruby, egrep)
以左括号的顺序捕获的顺序在我工作过的所有平台上都是标准的。(perl、php、ruby、egrep)
回答by Alan Moore
Every regex flavor I know numbers groups by the order in which the opening parentheses appear. That outer groups are numbered before their contained sub-groups is just a natural outcome, not explicit policy.
我所知道的每种正则表达式都按照左括号出现的顺序进行编号。外部组在其包含的子组之前编号只是自然结果,而不是明确的政策。
Where it gets interesting is with named groups. In most cases, they follow the same policy of numbering by the relative positions of the parens--the name is merely an alias for the number. However, in .NET regexes the named groups are numbered separately from numbered groups. For example:
有趣的是命名组。在大多数情况下,它们遵循相同的按括号的相对位置编号的策略——名称只是数字的别名。但是,在 .NET 正则表达式中,命名组与编号组分开编号。例如:
Regex.Replace(@"one two three four",
@"(?<one>\w+) (\w+) (?<three>\w+) (\w+)",
@" ")
// result: "two four one three"
In effect, the numberis an alias for the name; the numbers assigned to named groups start where the "real" numbered groups leave off. That may seem like a bizarre policy, but there's a good reason for it: in .NET regexes you can use the same group name more than once in a regex. That makes possible regexes like the one from this threadfor matching floating-point numbers from different locales:
实际上,数字是name的别名;分配给命名组的编号从“真实”编号组离开的地方开始。这似乎是一个奇怪的策略,但有一个很好的理由:在 .NET 正则表达式中,您可以在正则表达式中多次使用相同的组名。这使得可能的正则表达式,例如来自该线程的正则表达式,用于匹配来自不同语言环境的浮点数:
^[+-]?[0-9]{1,3}
(?:
(?:(?<thousand>\,)[0-9]{3})*
(?:(?<decimal>\.)[0-9]{2})?
|
(?:(?<thousand>\.)[0-9]{3})*
(?:(?<decimal>\,)[0-9]{2})?
|
[0-9]*
(?:(?<decimal>[\.\,])[0-9]{2})?
)$
If there's a thousands separator, it will be saved in group "thousand" no matter which part of the regex matched it. Similarly, the decimal separator (if there is one) will always be saved in group "decimal". Of course, there are ways to identify and extract the separators without reusable named groups, but this way is so much more convenient, I think it more than justifies the weird numbering scheme.
如果有千位分隔符,无论正则表达式的哪一部分匹配它,它都会被保存在“千”组中。同样,十进制分隔符(如果有)将始终保存在“十进制”组中。当然,有一些方法可以在没有可重用命名组的情况下识别和提取分隔符,但这种方法要方便得多,我认为它不仅证明了奇怪的编号方案是合理的。
And then there's Perl 5.10+, which gives us more control over capturing groups than I know what to do with. :D
然后是 Perl 5.10+,它让我们对捕获组的控制比我知道的要多。:D
回答by Alan Donnelly
Yeah, this is all pretty much well defined for all the languages you're interested in:
是的,对于您感兴趣的所有语言,这一切都得到了很好的定义:
- Java- http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
"Capturing groups are numbered by counting their opening parentheses from left to right. ... Group zero always stands for the entire expression." - .Net- http://msdn.microsoft.com/en-us/library/bs2twtah(VS.71).aspx
"Captures using () are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, capture element number zero, is the text matched by the whole regular expression pattern.") - PHP (PCRE functions)- http://www.php.net/manual/en/function.preg-replace.php#function.preg-replace.parameters
"\0 or $0 refers to the text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern." (It was also true of the deprecated POSIX functions) PCRE- http://www.pcre.org/pcre.txt
To add to what Alan M said, search for "How pcre_exec() returns captured substrings" and read the fifth paragraph that follows:The first pair of integers, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec() is one more than the highest numbered pair that has been set. For example, if two substrings have been captured, the returned value is 3. If there are no capturing subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set.
- Perl's different- http://perldoc.perl.org/perlre.html#Capture-buffers
$1, $2 etc. match capturing groups as you'd expect (i.e. by occurrence of opening bracket), however $0 returns the program name, not the entire query string - to get that you use $& instead.
- Java- http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
“捕获组通过从左到右计算其左括号来编号。...组零始终代表整个表达式。” - .Net- http://msdn.microsoft.com/en-us/library/bs2twtah(VS.71).aspx
"使用 () 的捕获根据左括号的顺序自动编号,从一个开始。第一个捕获,捕获元素编号为零,是整个正则表达式模式匹配的文本。”) - PHP (PCRE 函数)- http://www.php.net/manual/en/function.preg-replace.php#function.preg-replace.parameters
"\0 或 $0 指的是整个模式匹配的文本。左括号从左到右(从1开始)计数,以获得捕获子模式的编号。” (不推荐使用的 POSIX 函数也是如此) PCRE- http://www.pcre.org/pcre.txt
要添加 Alan M 所说的内容,请搜索“如何 pcre_exec() 返回捕获的子字符串”并阅读下面的第五段:The first pair of integers, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec() is one more than the highest numbered pair that has been set. For example, if two substrings have been captured, the returned value is 3. If there are no capturing subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set.
- Perl 的不同- http://perldoc.perl.org/perlre.html#Capture-buffers
$1, $2 等匹配捕获组,如您所料(即通过出现左括号),但是 $0 返回程序名称,而不是整个查询字符串 - 得到你使用 $& 代替。
You'll more than likely find similar results for other languages (Python, Ruby, and others).
对于其他语言(Python、Ruby 和其他语言),您很可能会发现类似的结果。
You say that it's equally logical to list the inner capture groups first and you're right - it's just be a matter of indexing on closing, rather than opening, parens. (if I understand you correctly). Doing this is less natural though (for example it doesn't follow reading direction convention) and so makes it more difficult (probably not significantly) to determine, by insepection, which capturing group will be at a given result index.
你说首先列出内部捕获组同样合乎逻辑,你是对的 - 这只是关闭而不是打开括号的索引问题。(如果我理解正确的话)。这样做不太自然(例如,它不遵循阅读方向约定),因此更难(可能不显着)通过检查确定哪个捕获组将处于给定的结果索引。
Putting the entire match string being in position 0 also makes sense - mostly for consistency. It allows the entire matched string to remain at the same index regardless of the number capturing groups from regex to regex and regardless of the number of capturing groups that actually match anything (Java for example will collapse the length of the matched groups array for each capturing group does not match any content (think for example something like "a (.*)pattern"). You could always inspect capturing_group_results[capturing_group_results_length - 2], but that doesn't translate well to languages to Perl which dynamically create variables ($1, $2 etc.) (Perl's a bad example of course, since it uses $& for the matched expression, but you get the idea :).
将整个匹配字符串放在位置 0 也是有意义的 - 主要是为了一致性。它允许整个匹配的字符串保持在相同的索引上,而不管从正则表达式到正则表达式的捕获组的数量,也不管实际匹配任何内容的捕获组的数量(例如,Java 将折叠每个捕获的匹配组数组的长度) group 不匹配任何内容(例如,例如“a (.*)pattern”)。您可以随时检查 capture_group_results[capturing_group_results_length - 2],但这并不能很好地转换为动态创建变量的 Perl 语言($1 , $2 等)(当然,Perl 是一个不好的例子,因为它使用 $& 作为匹配的表达式,但你明白了:)。