在 Python 中用正则表达式匹配嵌套结构
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1099178/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Matching Nested Structures With Regular Expressions in Python
提问by Victor Yan
I seem to remember that Regular Expressions in DotNet have a special mechanism that allows for the correct matching of nested structures, like the grouping in "( (a ( ( c ) b ) ) ( d ) e )
".
我似乎记得 DotNet 中的正则表达式有一种特殊的机制,可以正确匹配嵌套结构,例如“ ( (a ( ( c ) b ) ) ( d ) e )
”中的分组。
What is the python equivalent of this feature? Can this be achieved using regular expressions with some workaround? (Though it seems to be the sort of problem that current implementations of regex aren't designed for)
这个特性的 Python 等价物是什么?这可以通过一些解决方法使用正则表达式来实现吗?(虽然这似乎是正则表达式的当前实现并非设计用于的那种问题)
回答by Svante
Regular expressions cannotparse nested structures. Nested structures are not regular, by definition. They cannot be constructed by a regular grammar, and they cannot be parsed by a finite state automaton (a regular expression can be seen as a shorthand notation for an FSA).
正则表达式无法解析嵌套结构。根据定义,嵌套结构是不规则的。它们不能由正则文法构造,也不能由有限状态自动机解析(正则表达式可以看作是 FSA 的速记符号)。
Today's "regex" engines sometimes support some limited "nesting" constructs, but from a technical standpoint, they shouldn't be called "regular" anymore.
今天的“正则表达式”引擎有时支持一些有限的“嵌套”结构,但从技术角度来看,它们不应再被称为“常规”。
回答by ars
You can't do this generally using Python regular expressions. (.NET regular expressions have been extended with "balancing groups" which is what allows nested matches.)
您通常无法使用 Python 正则表达式执行此操作。(.NET 正则表达式已扩展为“平衡组”,允许嵌套匹配。)
However, PyParsing is a very nice package for this type of thing:
然而,PyParsing 对于这种类型的东西来说是一个非常好的包:
from pyparsing import nestedExpr
data = "( (a ( ( c ) b ) ) ( d ) e )"
print nestedExpr().parseString(data).asList()
The output is:
输出是:
[[['a', [['c'], 'b']], ['d'], 'e']]
More on PyParsing:
更多关于 PyParsing:
回答by unutbu
Edit:falsetru's nested parser, which I've slightly modified to accept arbitrary regex patterns to specify delimiters and item separators, is faster and simpler than my original re.Scanner
solution:
编辑:falsetru 的嵌套解析器,我稍微修改了它以接受任意正则表达式模式来指定分隔符和项目分隔符,比我的原始re.Scanner
解决方案更快更简单:
import re
def parse_nested(text, left=r'[(]', right=r'[)]', sep=r','):
""" https://stackoverflow.com/a/17141899/190597 (falsetru) """
pat = r'({}|{}|{})'.format(left, right, sep)
tokens = re.split(pat, text)
stack = [[]]
for x in tokens:
if not x or re.match(sep, x):
continue
if re.match(left, x):
# Nest a new list inside the current list
current = []
stack[-1].append(current)
stack.append(current)
elif re.match(right, x):
stack.pop()
if not stack:
raise ValueError('error: opening bracket is missing')
else:
stack[-1].append(x)
if len(stack) > 1:
print(stack)
raise ValueError('error: closing bracket is missing')
return stack.pop()
text = "a {{c1::group {{c2::containing::HINT}} a few}} {{c3::words}} or three"
print(parse_nested(text, r'\s*{{', r'}}\s*'))
yields
产量
['a', ['c1::group', ['c2::containing::HINT'], 'a few'], ['c3::words'], 'or three']
Nested structures can not be matched with Python regex alone, but it is remarkably easy to build a basic parser (which can handle nested structures) using re.Scanner:
嵌套结构不能与Python匹配正则表达式单独,但它是非常容易建立一个基本解析器使用(其可以处理嵌套结构)re.Scanner:
import re
class Node(list):
def __init__(self, parent=None):
self.parent = parent
class NestedParser(object):
def __init__(self, left='\(', right='\)'):
self.scanner = re.Scanner([
(left, self.left),
(right, self.right),
(r"\s+", None),
(".+?(?=(%s|%s|$))" % (right, left), self.other),
])
self.result = Node()
self.current = self.result
def parse(self, content):
self.scanner.scan(content)
return self.result
def left(self, scanner, token):
new = Node(self.current)
self.current.append(new)
self.current = new
def right(self, scanner, token):
self.current = self.current.parent
def other(self, scanner, token):
self.current.append(token.strip())
It can be used like this:
它可以像这样使用:
p = NestedParser()
print(p.parse("((a+b)*(c-d))"))
# [[['a+b'], '*', ['c-d']]]
p = NestedParser()
print(p.parse("( (a ( ( c ) b ) ) ( d ) e )"))
# [[['a', [['c'], 'b']], ['d'], 'e']]
By default NestedParser
matches nested parentheses. You can pass other regex to match other nested patterns, such as brackets, []
. For example,
默认NestedParser
匹配嵌套括号。您可以传递其他正则表达式以匹配其他嵌套模式,例如方括号、[]
. 例如,
p = NestedParser('\[', '\]')
result = (p.parse("Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet"))
# ['Lorem ipsum dolor sit amet', ['@a xxx yyy', ['@b xxx yyy', ['@c xxx yyy']]],
# 'lorem ipsum sit amet']
p = NestedParser('<foo>', '</foo>')
print(p.parse("<foo>BAR<foo>BAZ</foo></foo>"))
# [['BAR', ['BAZ']]]
Of course, pyparsing
can do a whole lot more than the above code can. But for this single purpose, the above NestedParser
is about 5x faster for small strings:
当然,pyparsing
可以做的比上面的代码做的更多。但是对于这个单一的目的,NestedParser
对于小字符串,上面的速度大约快 5 倍:
In [27]: import pyparsing as pp
In [28]: data = "( (a ( ( c ) b ) ) ( d ) e )"
In [32]: %timeit pp.nestedExpr().parseString(data).asList()
1000 loops, best of 3: 1.09 ms per loop
In [33]: %timeit NestedParser().parse(data)
1000 loops, best of 3: 234 us per loop
and around 28x faster for larger strings:
对于较大的字符串,速度大约快 28 倍:
In [44]: %timeit pp.nestedExpr().parseString('({})'.format(data*10000)).asList()
1 loops, best of 3: 8.27 s per loop
In [45]: %timeit NestedParser().parse('({})'.format(data*10000))
1 loops, best of 3: 297 ms per loop
回答by Il-Bhima
Python doesn't support recursion in regular expressions. So equivalents to .NET's balancing groups or PCRE regex in Perl are not immediately possible in Python.
Python 不支持正则表达式中的递归。因此,在 Python 中无法立即实现 Perl 中的 .NET 平衡组或 PCRE 正则表达式。
Like you said yourself: this is NOT a problem you should really be solving with a single regex.
就像您自己说的:这不是您真正应该使用单个正则表达式解决的问题。
回答by edef
I'd recommend removing the nesting from the regex itself, looping through the results and performing regexes on that.
我建议从正则表达式本身中删除嵌套,遍历结果并对其执行正则表达式。
回答by Vinay Sajip
Are you talking about recursion? It's not clear from your question. An example:
你说的是递归吗?从你的问题看不清楚。一个例子:
ActivePython 2.6.1.1 (ActiveState Software Inc.) based on
Python 2.6.1 (r261:67515, Dec 5 2008, 13:58:38) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> p = re.compile(r"((\w+((\d+)[.;]))(\s+)\d)")
>>> m = p.match("Fred99. \t9")
>>> m
<_sre.SRE_Match object at 0x00454F80>
>>> m.groups()
('Fred99. \t9', 'Fred99.', '9.', '9', ' \t')
This shows matching of nested groups. The numbering of groups depends on the order in which their opening parenthesis occurs in the pattern.
这显示了嵌套组的匹配。组的编号取决于它们的左括号在模式中出现的顺序。