Java正则表达式捕获组索引

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16517689/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 07:14:31  来源:igfitidea点击:

Java regex capturing groups indexes

javaregex

提问by P basak

I have the following line,

我有以下几行,

typeName="ABC:xxxxx;";

I need to fetch the word ABC,

我需要提取这个词ABC

I wrote the following code snippet,

我写了以下代码片段,

Pattern pattern4=Pattern.compile("(.*):");
matcher=pattern4.matcher(typeName);

String nameStr="";
if(matcher.find())
{
    nameStr=matcher.group(1);

}

So if I put group(0)I get ABC:but if I put group(1)it is ABC, so I want to know

所以如果我放group(0)我得到ABC:但如果我放group(1)它是ABC,所以我想知道

  1. What does this 0and 1mean? It will be better if anyone can explain me with good examples.

  2. The regex pattern contains a :in it, so why group(1)result omits that? Does group 1 detects all the words inside the parenthesis?

  3. So, if I put two more parenthesis such as, \\s*(\d*)(.*): then, will be there two groups? group(1)will return the (\d*)part and group(2)return the (.*)part?

  1. 这是什么01意味着什么呢?如果有人能用很好的例子来解释我会更好。

  2. 正则表达式模式中包含 a :,那么为什么group(1)结果会省略呢?第 1 组是否检测到括号内的所有单词?

  3. 所以,如果我\\s*(\d*)(.*)再加上两个括号,例如:那么,会不会有两个组?group(1)会退回(\d*)零件并group(2)退回(.*)零件吗?

The code snippet was given in a purpose to clear my confusions. It is not the code I am dealing with. The code given above can be done with String.split()in a much easier way.

给出代码片段是为了消除我的困惑。这不是我正在处理的代码。上面给出的代码可以以String.split()更简单的方式完成。

采纳答案by nhahtdh

Capturing and grouping

捕获和分组

Capturing group(pattern)creates a groupthat has capturingproperty.

捕获组(pattern)创建一个具有捕获属性的

A related one that you might often see (and use) is (?:pattern), which creates a groupwithout capturingproperty, hence named non-capturing group.

您可能经常看到(和使用)的一个相关是(?:pattern),它创建一个没有捕获属性的,因此命名为非捕获组

A group is usually used when you need to repeat a sequence of patterns, e.g. (\.\w+)+, or to specify where alternation should take effect, e.g. ^(0*1|1*0)$(^, then 0*1or 1*0, then $) versus ^0*1|1*0$(^0*1or 1*0$).

当您需要重复一系列模式时,通常使用组,例如(\.\w+)+,或指定交替应在何处生效,例如^(0*1|1*0)$( ^, then 0*1or 1*0, then $) 与^0*1|1*0$( ^0*1or 1*0$) 。

A capturing group, apart from grouping, will also record the text matched by the pattern inside the capturing group (pattern). Using your example, (.*):, .*matches ABCand :matches :, and since .*is inside capturing group (.*), the text ABCis recorded for the capturing group 1.

捕获组除了分组之外,还会记录捕获组内的模式匹配的文本(pattern)。使用你的榜样,(.*):.*比赛ABC:比赛:,由于.*是内捕获组(.*),该文本ABC被记录,捕获组1。

Group number

组号

The whole pattern is definedto be group number 0.

整个模式被定义为组号 0。

Any capturing group in the pattern start indexing from 1. The indices are defined by the order of the opening parentheses of the capturing groups. As an example, here are all5 capturing groups in the below pattern:

模式中的任何捕获组从 1 开始索引。索引由捕获组的左括号的顺序定义。例如,这里有以下模式中的所有5 个捕获组:

(group)(?:non-capturing-group)(g(?:ro|u)p( (nested)inside)(another)group)(?=assertion)
|     |                       |          | |      |      ||       |     |
1-----1                       |          | 4------4      |5-------5     |
                              |          3---------------3              |
                              2-----------------------------------------2

The group numbers are used in back-reference \nin pattern and $nin replacement string.

组号用于\n模式和$n替换字符串中的反向引用。

In other regex flavors (PCRE, Perl), they can also be used in sub-routine calls.

在其他正则表达式(PCRE、Perl)中,它们也可以用于子程序调用

You can access the text matched by certain group with Matcher.group(int group). The group numbers can be identified with the rule stated above.

您可以使用 访问特定组匹配的文本Matcher.group(int group)。组号可以用上述规则来识别。

In some regex flavors (PCRE, Perl), there is a branch resetfeature which allows you to use the same numberfor capturing groups in different branches of alternation.

在某些正则表达式风格(PCRE、Perl)中,有一个分支重置功能,它允许您使用相同的数字捕获不同分支中的组

Group name

团队名字

From Java 7, you can define a named capturing group(?<name>pattern), and you can access the content matched with Matcher.group(String name). The regex is longer, but the code is more meaningful, since it indicates what you are trying to match or extract with the regex.

从 Java 7 开始,您可以定义一个命名的捕获组(?<name>pattern),您可以访问与Matcher.group(String name). 正则表达式更长,但代码更有意义,因为它表明您试图用正则表达式匹配或提取的内容。

The group names are used in back-reference \k<name>in pattern and ${name}in replacement string.

组名用于\k<name>模式和${name}替换字符串中的反向引用。

Named capturing groups are still numbered with the same numbering scheme, so they can also be accessed via Matcher.group(int group).

命名的捕获组仍然使用相同的编号方案进行编号,因此也可以通过Matcher.group(int group).

Internally, Java's implementation just maps from the name to the group number. Therefore, you cannot use the same name for 2 different capturing groups.

在内部,Java 的实现只是从名称映射到组号。因此,您不能对 2 个不同的捕获组使用相同的名称。

回答by Michal Borek

Parenthesis ()are used to enable grouping of regex phrases.

括号()用于启用正则表达式短语的分组。

The group(1)contains the string that is between parenthesis (.*)so .*in this case

group(1)包含是括号之间的字符串,(.*)所以.*在这种情况下,

And group(0)contains whole matched string.

group(0)包含整个匹配的字符串。

If you would have more groups (read (...)) it would be put into groups with next indexes (2, 3 and so on).

如果您有更多组(读取(...)),它将被放入具有下一个索引(2、3 等)的组中。

回答by Michael Sims

For The Rest Of Us

对于我们其他人

Here is a simple and clear example of how this works

这是一个简单明了的例子,说明这是如何工作的

Regex: ([a-zA-Z0-9]+)([\s]+)([a-zA-Z ]+)([\s]+)([0-9]+)

正则表达式: ([a-zA-Z0-9]+)([\s]+)([a-zA-Z ]+)([\s]+)([0-9]+)

String: "!* UserName10 John Smith 01123 *!"

细绳: "!* UserName10 John Smith 01123 *!"

group(0): UserName10 John Smith 01123
group(1): UserName10
group(2):  
group(3): John Smith
group(4):  
group(5): 01123

As you can see, I have created FIVE groups which are each enclosed in parentheses.

如您所见,我创建了五个组,每个组都用括号括起来。

I included the !* and *! on either side to make it clearer. Note that none of those characters are in the RegEx and therefore will not be produced in the results. Group(0) merely gives you the entire matched string (all of my search criteria in one single line). Group 1 stops right before the first space because the space character was not included in the search criteria. Groups 2 and 4 are simply the white space, which in this case is literally a space character, but could also be a tab or a line feed etc. Group 3 includes the space because I put it in the search criteria ... etc.

我包括了 !* 和 *! 两侧,以使其更清晰。请注意,这些字符都不在 RegEx 中,因此不会在结果中产生。Group(0) 只为您提供整个匹配的字符串(我的所有搜索条件都在一行中)。第 1 组正好在第一个空格之前停止,因为搜索条件中未包含空格字符。第 2 组和第 4 组只是空格,在这种情况下,它实际上是一个空格字符,但也可以是制表符或换行符等。第 3 组包括空格,因为我将它放在搜索条件中……等等。

Hope this makes sense.

希望这是有道理的。