反转 Python 中的正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/492716/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reversing a regular expression in Python
提问by Rory
I want to reverse a regular expression. I.e. given a regular expression, I want to produce anystring that will match that regex.
我想反转一个正则表达式。即给定一个正则表达式,我想生成任何匹配该正则表达式的字符串。
I know how to do this from a theoretical computer science background using a finite state machine, but I just want to know if someone has already written a library to do this. :)
我知道如何使用有限状态机从理论计算机科学背景中做到这一点,但我只想知道是否有人已经编写了一个库来做到这一点。:)
I'm using Python, so I'd like a Python library.
我正在使用 Python,所以我想要一个 Python 库。
To reiterate, I only want onestring that will match the regex. Things like "." or ".*" would make an infinite amount of strings match the regex, but I don't care about all options.
重申一下,我只想要一个与正则表达式匹配的字符串。像 ”。” 或“.*”将使无限数量的字符串与正则表达式匹配,但我不关心所有选项。
I'm willing for this library to only work on a certain subset of regex.
我愿意让这个库只在正则表达式的某个子集上工作。
回答by bjmc
Somebody else had a similar (duplicate?) question here, and I'd like to offer a little helper library for generating random strings with Pythonthat I've been working on.
其他人在这里有一个类似(重复?)的问题,我想提供一个小助手库,用于使用我一直在研究的Python 生成随机字符串。
It includes a method, xeger()
that allows you to create a string from a regex:
它包含一个方法,xeger()
允许您从正则表达式创建字符串:
>>> import rstr
>>> rstr.xeger(r'[A-Z]\d[A-Z] \d[A-Z]\d')
u'M5R 2W4'
Right now, it works with most basic regular expressions, but I'm sure it could be improved.
现在,它适用于大多数基本的正则表达式,但我相信它可以改进。
回答by bjmc
Although I don't see much sense in this, here goes:
虽然我认为这没有多大意义,但这里有:
import re
import string
def traverse(tree):
retval = ''
for node in tree:
if node[0] == 'any':
retval += 'x'
elif node[0] == 'at':
pass
elif node[0] in ['min_repeat', 'max_repeat']:
retval += traverse(node[1][2]) * node[1][0]
elif node[0] == 'in':
if node[1][0][0] == 'negate':
letters = list(string.ascii_letters)
for part in node[1][1:]:
if part[0] == 'literal':
letters.remove(chr(part[1]))
else:
for letter in range(part[1][0], part[1][1]+1):
letters.remove(chr(letter))
retval += letters[0]
else:
if node[1][0][0] == 'range':
retval += chr(node[1][0][1][0])
else:
retval += chr(node[1][0][1])
elif node[0] == 'not_literal':
if node[1] == 120:
retval += 'y'
else:
retval += 'x'
elif node[0] == 'branch':
retval += traverse(node[1][1][0])
elif node[0] == 'subpattern':
retval += traverse(node[1][1])
elif node[0] == 'literal':
retval += chr(node[1])
return retval
print traverse(re.sre_parse.parse(regex).data)
I took everything from the Regular Expression Syntaxup to groups -- this seems like a reasonable subset -- and I ignored some details, like line endings. Error handling, etc. is left as an exercise to the reader.
我把从正则表达式语法到组的所有内容都考虑在内——这似乎是一个合理的子集——我忽略了一些细节,比如行尾。错误处理等留给读者作为练习。
Of the 12 special characters in a regex, we can ignore 6 completely (2 even with the atom they apply to), 4.5 lead to a trivial replacement and 1.5 make us actually think.
在正则表达式中的 12 个特殊字符中,我们可以完全忽略 6 个(即使它们适用于原子也有 2 个),4.5 个导致微不足道的替换,1.5 个让我们真正思考。
What comes out of this is not too terribly interesting, I think.
我认为由此产生的东西并不太有趣。
回答by Hans Nowak
I don't know of any module to do this. If you don't find anything like this in the Cookbook or PyPI, you could try rolling your own, using the (undocumented) re.sre_parse module. This might help getting you started:
我不知道有任何模块可以做到这一点。如果您在 Cookbook 或 PyPI 中没有找到类似的内容,您可以尝试使用(未记录的)re.sre_parse 模块滚动您自己的内容。这可能有助于您入门:
In [1]: import re
In [2]: a = re.sre_parse.parse("[abc]+[def]*\d?z")
In [3]: a
Out[3]: [('max_repeat', (1, 65535, [('in', [('literal', 97), ('literal', 98), ('literal', 99)])])), ('max_repeat', (0, 65535, [('in', [('literal', 100), ('literal', 101), ('literal', 102)])])), ('max_repeat', (0, 1, [('in', [('category', 'category_digit')])])), ('literal', 122)]
In [4]: eval(str(a))
Out[4]:
[('max_repeat',
(1, 65535, [('in', [('literal', 97), ('literal', 98), ('literal', 99)])])),
('max_repeat',
(0,
65535,
[('in', [('literal', 100), ('literal', 101), ('literal', 102)])])),
('max_repeat', (0, 1, [('in', [('category', 'category_digit')])])),
('literal', 122)]
In [5]: a.dump()
max_repeat 1 65535
in
literal 97
literal 98
literal 99
max_repeat 0 65535
in
literal 100
literal 101
literal 102
max_repeat 0 1
in
category category_digit
literal 122
回答by Andrew Cox
While the other answers use the re engine to parse out the elements I have whipped up my own that parses the re and returns a minimal pattern that would match. (Note it doesn't handle [^ads], fancy grouping constructs, start/end of line special characters). I can supply the unit tests if you really like :)
虽然其他答案使用 re 引擎解析出我自己的元素,这些元素解析 re 并返回匹配的最小模式。(注意它不处理 [^ads]、花哨的分组结构、行首/行尾特殊字符)。如果你真的喜欢,我可以提供单元测试:)
import re
class REParser(object):
"""Parses an RE an gives the least greedy value that would match it"""
def parse(self, parseInput):
re.compile(parseInput) #try to parse to see if it is a valid RE
retval = ""
stack = list(parseInput)
lastelement = ""
while stack:
element = stack.pop(0) #Read from front
if element == "\":
element = stack.pop(0)
element = element.replace("d", "0").replace("D", "a").replace("w", "a").replace("W", " ")
elif element in ["?", "*"]:
lastelement = ""
element = ""
elif element == ".":
element = "a"
elif element == "+":
element = ""
elif element == "{":
arg = self._consumeTo(stack, "}")
arg = arg[:-1] #dump the }
arg = arg.split(",")[0] #dump the possible ,
lastelement = lastelement * int(arg)
element = ""
elif element == "[":
element = self._consumeTo(stack, "]")[0] # just use the first char in set
if element == "]": #this is the odd case of []<something>]
self._consumeTo(stack, "]") # throw rest away and use ] as first element
elif element == "|":
break # you get to an | an you have all you need to match
elif element == "(":
arg = self._consumeTo(stack, ")")
element = self.parse( arg[:-1] )
retval += lastelement
lastelement = element
retval += lastelement #Complete the string with the last char
return retval
def _consumeTo(self, stackToConsume, endElement ):
retval = ""
while not retval.endswith(endElement):
retval += stackToConsume.pop(0)
return retval
回答by Adam Rosenfield
Unless your regex is extremely simple (i.e. no stars or pluses), there will be infinitely many strings which match it. If your regex only involves concatenation and alternation, then you can expand each alternation into all of its possibilities, e.g. (foo|bar)(baz|quux)
can be expanded into the list ['foobaz', 'fooquux', 'barbaz', 'barquux']
.
除非您的正则表达式非常简单(即没有星号或加号),否则将有无数个与之匹配的字符串。如果您的正则表达式只涉及串联和交替,那么您可以将每个交替扩展到其所有可能性,例如(foo|bar)(baz|quux)
可以扩展到 list ['foobaz', 'fooquux', 'barbaz', 'barquux']
。
回答by Greg Hewgill
I haven't seen a Python module to do this, but I did see a (partial) implementation in Perl: Regexp::Genex
. From the module description, it sounds like the implementation relies on internal details of Perl's regular expression engine, so it may not be useful even from a theoretical point of view (I haven't investigated the implementation, just going by the comments in the documentation).
我还没有看到 Python 模块可以做到这一点,但我确实看到了 Perl: 中的(部分)实现Regexp::Genex
。从模块描述来看,这听起来像是实现依赖于 Perl 正则表达式引擎的内部细节,所以即使从理论的角度来看它也可能没有用(我没有研究实现,只是通过文档中的评论)。
I think doing what you propose in general is a hard problem and may require the use of nondeterministic programming techniques. A start would be to parse the regular expression and build a parse tree, then traverse the tree and build sample string(s) as you go. Challenging bits will probably be things like backreferences and avoiding infinite loops in your implementation.
我认为按照您的一般建议去做是一个难题,可能需要使用非确定性编程技术。首先是解析正则表达式并构建解析树,然后遍历树并构建示例字符串。具有挑战性的部分可能是反向引用和避免实现中的无限循环。
回答by Sjoerd
Exrexcan create strings from regexes.
Exrex可以从正则表达式创建字符串。
Exrex is a command line tool and python module that generates all - or random - matching strings to a given regular expression and more.
Exrex 是一个命令行工具和 python 模块,它生成所有或随机匹配的字符串到给定的正则表达式等等。
Example:
例子:
>>> exrex.getone('\d{4}-\d{4}-\d{4}-[0-9]{4}')
'3096-7886-2834-5671'