Python 正则表达式中的递归模式

Question

提问by Andy Hayden

This is very much related to Regular Expression to match outer bracketshowever, I specifically want to know how or whether it's possible to do this regex's recursive pattern? I'm yet to find a python example using this strategy so think this ought to be a useful question!

这与正则表达式匹配外括号非常相关，但是，我特别想知道如何或是否可以执行此正则表达式的递归模式？我还没有找到使用这种策略的 python 示例，所以认为这应该是一个有用的问题！

I've seen some claims thatrecursive patterns can be used to match balanced parenthesis, but no examples using python's regexpackage (Note: re does notsupport recursive pattern, you need to use regex).

我已经看到了一些索赔是递归的模式可以用来匹配平衡括号，但使用Python的没有例子正则表达式包（注：重不支持递归模式，你需要使用正则表达式）。

One claimis that syntax is b(?:m|(?R))*ewhere:

一种说法是语法是b(?:m|(?R))*e：

bis what begins the construct, mis what can occur in the middle of the construct, and eis what can occur at the end of the construct

b是构念的开始，构m念的中间可以发生的，构e念的结尾可以发生的

I want to extract matches for the outerbraces in the following:

我想提取以下外括号的匹配项：

"{1, {2, 3}} {4, 5}"
["1, {2, 3}", "4, 5"]  # desired

Note that this is easy to do the same for innerbraces:

请注意，对于内大括号，这很容易做到：

re.findall(r"{([^{}]*)}", "{1, {2, 3}} {4, 5}")
['2, 3', '4, 5']

(In my example I was using finditer (over match objects), see here.)

（在我的示例中，我使用的是 finditer（匹配对象），请参见此处。）

So I had hoped that the following, or some variation, would work:

所以我曾希望以下或一些变化会起作用：

regex.findall(r"{(:[^{}]*|?R)}", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:[^{}]*|?R)})", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:.*|(?R))*})", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:.*)|(?R)*})", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:[^{}])|(?R)})", "{1, {2, 3}} {4, 5}")

but I'm scuppered by either [] or error: too much backtracking.

但我被 [] 或error: too much backtracking.

Is it possible to extract match objects for the outer parenthesis using regex's recursion?

是否可以使用正则表达式的递归提取外括号的匹配对象？

Obviously, I run the risk of being shot down with:

显然，我冒着被击落的风险：

don't parse html with regex
do this with pyparse
write a proper lexer & parser e.g. using ply

不要用正则表达式解析 html
用 pyparse 做到这一点
编写一个合适的词法分析器和解析器，例如使用 ply

I want to emphasis this is about how to use the recursive pattern(which if my understanding is correct, takes us outside of regular language parsing, so may can actually be possible!). If it can be done, this ought to be a cleaner solution.

我想强调这是关于如何使用递归模式（如果我的理解是正确的，这会将我们带到常规语言解析之外，所以实际上可能是可能的！）。如果可以做到，这应该是一个更清洁的解决方案。

Answer 1

采纳答案by Casimir et Hippolyte

The pattern is:

图案是：

{((?>[^{}]+|(?R))*)}

You can see this works for your example:

您可以看到这适用于您的示例：

regex.findall("{((?>[^{}]+|(?R))*)}", "{1, {2, 3}} {4, 5}")
# ['1, {2, 3}', '4, 5']

Explanation:

解释：

The m part needs to exclude the brackets. The use of an atomic group is needed if you want at the same time to allow a quantifier for [^{}]and to repeat the group without catastropic backtracking problems. To be more clear, if the last closing curly bracket is missing this regex engine will backtrack atomic group by atomic group instead of character by character. To drive home this point, you can make the quantifier possessive like that: {((?>[^{}]+|(?R))*+)}(or {((?:[^{}]+|(?R))*+)}since the atomic group is no more useful).

m 部分需要排除括号。如果您希望同时允许一个量词[^{}]并重复该组而不会出现灾难性的回溯问题，则需要使用原子组。更清楚的是，如果最后一个大括号丢失，这个正则表达式引擎将逐个原子组而不是逐个字符地回溯原子组。为了解决这一点，您可以像这样使量词具有所有格：（{((?>[^{}]+|(?R))*+)}或者{((?:[^{}]+|(?R))*+)}因为原子组不再有用）。

The atomic group (?>....)and the possessive quantifier ?+, *+, ++are the two sides of the same feature. This feature forbids the regex engine to backtrack inside the group of characters that becomes an "atom" (something you can't divide in smaller parts).

该原子团(?>....)和所有格量词?+，*+，++是相同的特征的两侧。此功能禁止正则表达式引擎在成为“原子”的字符组内回溯（您无法将其分成更小的部分）。

The basic examples are the following two patterns that always fail for the string aaaaaaaaaab:

基本示例是以下两种对于 string 总是失败的模式aaaaaaaaaab：

(?>a+)ab
a++ab

that is:

那是：

regex.match("a++ab", "aaaaaaaaaab")
regex.match("(?>a+)ab", "aaaaaaaaaab")

When you use (?:a+)or a+the regex engine (by default) records (in prevision) all backtracking positions for all characters. But when you use an atomic group or a possessive quantifier, theses backtracking positions are no more recorded (except for the begining of the group). So when the backtracking mechanism occurs the last "a" character can't be given back. Only the entire group can be given back.

当您使用(?:a+)或a+正则表达式引擎（默认情况下）记录（预置）所有字符的所有回溯位置时。但是当您使用原子组或所有格量词时，不再记录这些回溯位置（组的开头除外）。所以当回溯机制发生时，最后一个“a”字符不能被返回。只能返还整个组。

[EDIT]: the pattern can be written in a more efficient way if you use an "unrolled" subpattern to describe the content between brackets:

[编辑]：如果您使用“展开”子模式来描述括号之间的内容，则可以以更有效的方式编写模式：

{([^{}]*+(?:(?R)[^{}]*)*+)}

Answer 2

回答by Sam

I was able to do this no problem with the b(?:m|(?R))*esyntax:

我能够做到这一点，b(?:m|(?R))*e语法没有问题：

{((?:[^{}]|(?R))*)}

Demo

演示

I think the key from what you were attempting is that the repetition doesn't go on m, but the entire (?:m|(?R))group. This is what allows the recursion with the (?R)reference.

我认为你尝试的关键是重复不是继续m，而是整个(?:m|(?R))小组。这就是允许使用(?R)引用进行递归的原因。

Python 正则表达式中的递归模式

提问by Andy Hayden

采纳答案by Casimir et Hippolyte

Explanation:

解释：

回答by Sam

相关推荐

最近更新

标签

Python 正则表达式中的递归模式

提问by Andy Hayden

采纳答案by Casimir et Hippolyte

Explanation:

解释：

回答by Sam

相关推荐

Python 海龟图形，画一颗星星？

Python - 计算图像的直方图

Python matplotlib 绘制所有列的 csv 文件

将 PDF 转换为 DOC（Python/Bash）

相关推荐

最近更新

标签