python 我如何正则表达式匹配具有未知组数的分组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1407435/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 22:08:54  来源:igfitidea点击:

How do I regex match with grouping with unknown number of groups

pythonregex

提问by Lorin Hochstein

I want to do a regex match (in Python) on the output log of a program. The log contains some lines that look like this:

我想对程序的输出日志进行正则表达式匹配(在 Python 中)。日志包含一些如下所示的行:

... 
VALUE 100 234 568 9233 119
... 
VALUE 101 124 9223 4329 1559
...

I would like to capture the list of numbers that occurs after the first incidence of the line that starts with VALUE. i.e., I want it to return ('100','234','568','9233','119'). The problem is that I do not know in advance how many numbers there will be.

我想捕获在以 VALUE 开头的行的第一次出现之后出现的数字列表。即,我希望它返回('100','234','568','9233','119')。问题是我事先不知道会有多少个数字。

I tried to use this as a regex:

我尝试将其用作正则表达式:

VALUE (?:(\d+)\s)+

This matches the line, but it only captures the last value, so I just get ('119',).

这与该行匹配,但它只捕获最后一个值,所以我只得到 ('119',)。

采纳答案by Greg Hewgill

What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():

您正在寻找的是parser,而不是正则表达式匹配。在你的情况下,我会考虑使用一个非常简单的解析器,split()

s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
    print [int(x) for x in a[1:]]

You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE"and knowing that the int(x)conversion will always succeed since you've already confirmed that the following character groups are all digits.

您可以使用正则表达式来查看您的输入行是否符合您的预期格式(在您的问题中使用正则表达式),然后您可以运行上述代码而无需检查"VALUE"并知道int(x)转换将始终成功,因为您已经确认以下字符组均为数字。

回答by Ian Clelland

>>> import re
>>> reg = re.compile('\d+')
>>> reg.findall('VALUE 100 234 568 9233 119')
['100', '234', '568', '9223', '119']

That doesn't validate that the keyword 'VALUE' appears at the beginning of the string, and it doesn't validate that there is exactly one space between items, but if you can do that as a separate step (or if you don't need to do that at all), then it will find all digit sequences in any string.

这不会验证关键字“VALUE”是否出现在字符串的开头,也不会验证项目之间是否只有一个空格,但是如果您可以将其作为单独的步骤来执行(或者如果您不这样做)根本不需要这样做),然后它将找到任何字符串中的所有数字序列。

回答by Scottmas

Another option not described here is to have a bunch of optional capturing groups.

此处未描述的另一个选项是拥有一堆可选的捕获组。

VALUE *(\d+)? *(\d+)? *(\d+)? *(\d+)? *(\d+)? *$

This regex captures up to 5 digit groups separated by spaces. If you need more potential groups, just copy and paste more *(\d+)?blocks.

此正则表达式最多可捕获 5 个由空格分隔的数字组。如果您需要更多潜在组,只需复制和粘贴更多*(\d+)?块即可。

回答by Chris J

You could just run you're main match regex then run a secondary regex on those matches to get the numbers:

您可以运行您的主要匹配正则表达式,然后在这些匹配上运行辅助正则表达式以获取数字:

matches = Regex.Match(log)

foreach (Match match in matches)
{
    submatches = Regex2.Match(match)
}

This is of course also if you don't want to write a full parser.

如果您不想编写完整的解析器,这当然也是如此。

回答by Christian

I had this same problem and my solution was to use two regular expressions: the first one to match the whole group I'm interested in and the second one to parse the sub groups. For example in this case, I'd start with this:

我遇到了同样的问题,我的解决方案是使用两个正则表达式:第一个匹配我感兴趣的整个组,第二个匹配子组。例如在这种情况下,我会从这个开始:

VALUE((\s\d+)+)

This should result in three matches: [0] the whole line, [1] the stuff after value [2] the last space+value.

这应该导致三个匹配项:[0] 整行,[1] 值之后的内容 [2] 最后一个空格+值。

[0] and [2] can be ignored and then [1] can be used with the following:

[0] 和 [2] 可以被忽略,然后 [1] 可以与以下内容一起使用:

\s(\d+)

Note: these regexps were not tested, I hope you get the idea though.

注意:这些正则表达式没有经过测试,但我希望你能理解。



The reason why Greg's answerdoesn't work for meis because the 2nd part of the parsing is more complicated and not simply some numbers separated by a space.

Greg 的答案对我不起作用的原因是解析的第二部分更复杂,而不仅仅是一些由空格分隔的数字。

However, I would honestly go with Greg's solution for this question (it's probably way more efficient).

但是,老实说,对于这个问题,我会采用 Greg 的解决方案(这可能更有效率)。

I'm just writing this answer in case someone is looking for a more sophisticated solution like I needed.

我只是写这个答案,以防有人正在寻找像我需要的更复杂的解决方案。

回答by H. Chan

You can use re.matchto check first and call re.splitto use a regex as separator to split.

您可以re.match先检查并调用re.split使用正则表达式作为分隔符进行拆分。

>>> s = "VALUE 100 234 568 9233 119"
>>> sep = r"\s+"
>>> reg = re.compile(r"VALUE(%s\d+)+"%(sep)) # OR r"VALUE(\s+\d+)+"
>>> reg_sep = re.compile(sep)
>>> if reg.match(s): # OR re.match(r"VALUE(\s+\d+)+", s)
...     result = reg_sep.split(s)[1:] # OR re.split(r"\s+", s)[1:]
>>> result
['100', '234', '568', '9233', '119']

The separator "\s+"can be more complicated.

分隔符"\s+"可能更复杂。