在 Python 中用正则表达式匹配嵌套结构

Question

提问by Victor Yan

I seem to remember that Regular Expressions in DotNet have a special mechanism that allows for the correct matching of nested structures, like the grouping in "( (a ( ( c ) b ) ) ( d ) e )".

我似乎记得 DotNet 中的正则表达式有一种特殊的机制，可以正确匹配嵌套结构，例如“ ( (a ( ( c ) b ) ) ( d ) e )”中的分组。

What is the python equivalent of this feature? Can this be achieved using regular expressions with some workaround? (Though it seems to be the sort of problem that current implementations of regex aren't designed for)

这个特性的 Python 等价物是什么？这可以通过一些解决方法使用正则表达式来实现吗？（虽然这似乎是正则表达式的当前实现并非设计用于的那种问题）

Answer 1

回答by Svante

Regular expressions cannotparse nested structures. Nested structures are not regular, by definition. They cannot be constructed by a regular grammar, and they cannot be parsed by a finite state automaton (a regular expression can be seen as a shorthand notation for an FSA).

正则表达式无法解析嵌套结构。根据定义，嵌套结构是不规则的。它们不能由正则文法构造，也不能由有限状态自动机解析（正则表达式可以看作是 FSA 的速记符号）。

Today's "regex" engines sometimes support some limited "nesting" constructs, but from a technical standpoint, they shouldn't be called "regular" anymore.

今天的“正则表达式”引擎有时支持一些有限的“嵌套”结构，但从技术角度来看，它们不应再被称为“常规”。

Answer 2

回答by ars

You can't do this generally using Python regular expressions. (.NET regular expressions have been extended with "balancing groups" which is what allows nested matches.)

您通常无法使用 Python 正则表达式执行此操作。（.NET 正则表达式已扩展为“平衡组”，允许嵌套匹配。）

However, PyParsing is a very nice package for this type of thing:

然而，PyParsing 对于这种类型的东西来说是一个非常好的包：

from pyparsing import nestedExpr

data = "( (a ( ( c ) b ) ) ( d ) e )"
print nestedExpr().parseString(data).asList()

The output is:

输出是：

[[['a', [['c'], 'b']], ['d'], 'e']]

回答by unutbu

Edit:falsetru's nested parser, which I've slightly modified to accept arbitrary regex patterns to specify delimiters and item separators, is faster and simpler than my original re.Scannersolution:

编辑：falsetru 的嵌套解析器，我稍微修改了它以接受任意正则表达式模式来指定分隔符和项目分隔符，比我的原始re.Scanner解决方案更快更简单：

import re

def parse_nested(text, left=r'[(]', right=r'[)]', sep=r','):
    """ https://stackoverflow.com/a/17141899/190597 (falsetru) """
    pat = r'({}|{}|{})'.format(left, right, sep)
    tokens = re.split(pat, text)
    stack = [[]]
    for x in tokens:
        if not x or re.match(sep, x):
            continue
        if re.match(left, x):
            # Nest a new list inside the current list
            current = []
            stack[-1].append(current)
            stack.append(current)
        elif re.match(right, x):
            stack.pop()
            if not stack:
                raise ValueError('error: opening bracket is missing')
        else:
            stack[-1].append(x)
    if len(stack) > 1:
        print(stack)
        raise ValueError('error: closing bracket is missing')
    return stack.pop()

text = "a {{c1::group {{c2::containing::HINT}} a few}} {{c3::words}} or three"

print(parse_nested(text, r'\s*{{', r'}}\s*'))

yields

产量

['a', ['c1::group', ['c2::containing::HINT'], 'a few'], ['c3::words'], 'or three']

Nested structures can not be matched with Python regex alone, but it is remarkably easy to build a basic parser (which can handle nested structures) using re.Scanner:

嵌套结构不能与Python匹配正则表达式单独，但它是非常容易建立一个基本解析器使用（其可以处理嵌套结构）re.Scanner：

import re

class Node(list):
    def __init__(self, parent=None):
        self.parent = parent

class NestedParser(object):
    def __init__(self, left='\(', right='\)'):
        self.scanner = re.Scanner([
            (left, self.left),
            (right, self.right),
            (r"\s+", None),
            (".+?(?=(%s|%s|$))" % (right, left), self.other),
        ])
        self.result = Node()
        self.current = self.result

    def parse(self, content):
        self.scanner.scan(content)
        return self.result

    def left(self, scanner, token):
        new = Node(self.current)
        self.current.append(new)
        self.current = new

    def right(self, scanner, token):
        self.current = self.current.parent

    def other(self, scanner, token):
        self.current.append(token.strip())

It can be used like this:

它可以像这样使用：

p = NestedParser()
print(p.parse("((a+b)*(c-d))"))
# [[['a+b'], '*', ['c-d']]]

p = NestedParser()
print(p.parse("( (a ( ( c ) b ) ) ( d ) e )"))
# [[['a', [['c'], 'b']], ['d'], 'e']]

By default NestedParsermatches nested parentheses. You can pass other regex to match other nested patterns, such as brackets, []. For example,

默认NestedParser匹配嵌套括号。您可以传递其他正则表达式以匹配其他嵌套模式，例如方括号、[]. 例如，

p = NestedParser('\[', '\]')
result = (p.parse("Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet"))
# ['Lorem ipsum dolor sit amet', ['@a xxx yyy', ['@b xxx yyy', ['@c xxx yyy']]],
# 'lorem ipsum sit amet']

p = NestedParser('<foo>', '</foo>')
print(p.parse("<foo>BAR<foo>BAZ</foo></foo>"))
# [['BAR', ['BAZ']]]

Of course, pyparsingcan do a whole lot more than the above code can. But for this single purpose, the above NestedParseris about 5x faster for small strings:

当然，pyparsing可以做的比上面的代码做的更多。但是对于这个单一的目的，NestedParser对于小字符串，上面的速度大约快 5 倍：

In [27]: import pyparsing as pp

In [28]: data = "( (a ( ( c ) b ) ) ( d ) e )"    

In [32]: %timeit pp.nestedExpr().parseString(data).asList()
1000 loops, best of 3: 1.09 ms per loop

In [33]: %timeit NestedParser().parse(data)
1000 loops, best of 3: 234 us per loop

and around 28x faster for larger strings:

对于较大的字符串，速度大约快 28 倍：

In [44]: %timeit pp.nestedExpr().parseString('({})'.format(data*10000)).asList()
1 loops, best of 3: 8.27 s per loop

In [45]: %timeit NestedParser().parse('({})'.format(data*10000))
1 loops, best of 3: 297 ms per loop

Answer 4

回答by Il-Bhima

Python doesn't support recursion in regular expressions. So equivalents to .NET's balancing groups or PCRE regex in Perl are not immediately possible in Python.

Python 不支持正则表达式中的递归。因此，在 Python 中无法立即实现 Perl 中的 .NET 平衡组或 PCRE 正则表达式。

Like you said yourself: this is NOT a problem you should really be solving with a single regex.

就像您自己说的：这不是您真正应该使用单个正则表达式解决的问题。

Answer 5

回答by edef

I'd recommend removing the nesting from the regex itself, looping through the results and performing regexes on that.

我建议从正则表达式本身中删除嵌套，遍历结果并对其执行正则表达式。

Answer 6

回答by Vinay Sajip

Are you talking about recursion? It's not clear from your question. An example:

你说的是递归吗？从你的问题看不清楚。一个例子：

ActivePython 2.6.1.1 (ActiveState Software Inc.) based on
Python 2.6.1 (r261:67515, Dec  5 2008, 13:58:38) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> p = re.compile(r"((\w+((\d+)[.;]))(\s+)\d)")
>>> m = p.match("Fred99. \t9")
>>> m
<_sre.SRE_Match object at 0x00454F80>
>>> m.groups()
('Fred99. \t9', 'Fred99.', '9.', '9', ' \t')

This shows matching of nested groups. The numbering of groups depends on the order in which their opening parenthesis occurs in the pattern.

这显示了嵌套组的匹配。组的编号取决于它们的左括号在模式中出现的顺序。

在 Python 中用正则表达式匹配嵌套结构

提问by Victor Yan

回答by Svante

回答by ars

回答by unutbu

回答by Il-Bhima

回答by edef

回答by Vinay Sajip

相关推荐

最近更新

标签

在 Python 中用正则表达式匹配嵌套结构

提问by Victor Yan

回答by Svante

回答by ars

回答by unutbu

回答by Il-Bhima

回答by edef

回答by Vinay Sajip

相关推荐

python 可调用模块

python 具有系统时区设置的 Django 与用户的个人时区

python python十进制比较

Python NotImplemented 常量

相关推荐

最近更新

标签