Python 为正则表达式编写解析器

Question

提问by Chinmay Kanchi

Even after years of programming, I'm ashamed to say that I've never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it's a technique that I find myself using increasingly often.

即使经过多年的编程，我也很惭愧地说我从未真正完全掌握正则表达式。一般来说，当一个问题需要一个正则表达式时，我通常可以（在一堆参考语法之后）想出一个合适的，但我发现自己越来越频繁地使用这种技术。

So, to teach myself and understand regular expressions properly, I've decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I'll probably abandon as soon as I feel I've learnt enough.

所以，为了自学和正确理解正则表达式，我决定做我在尝试学习时经常做的事情；即，尝试写一些雄心勃勃的东西，一旦我觉得我已经学到了足够多的东西，我可能会放弃。

To this end, I want to write a regular expression parser in Python. In this case, "learn enough" means that I want to implement a parser that can understand Perl's extended regex syntax completely. However, it doesn't have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.

为此，我想用Python编写一个正则表达式解析器。在这种情况下，“足够学习”意味着我要实现一个可以完全理解 Perl 扩展正则表达式语法的解析器。然而，它不一定是最有效的解析器，甚至不一定在现实世界中可用。它只需要正确匹配或无法匹配字符串中的模式。

The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.

问题是，我从哪里开始？除了正则表达式以某种方式涉及有限状态自动机这一事实之外，我对正则表达式的解析和解释几乎一无所知。关于如何解决这个相当令人生畏的问题的任何建议将不胜感激。

EDIT:I should clarify that while I'm going to implementthe regex parser in Python, I'm not overly fussed about what programming language the examples or articles are written in. As long as it's not in Brainfwor, I will probably understand enough of it to make it worth my while.

编辑：我应该澄清一下，虽然我将在 Python 中实现正则表达式解析器，但我并没有对示例或文章所用的编程语言过分大惊小怪。只要它不在 Brainfwor 中，我可能会理解得足够多让它值得我花时间。

Answer 1

采纳答案by Mark Byers

Writing an implementation of a regular expression engine is indeed a quite complex task.

编写正则表达式引擎的实现确实是一项相当复杂的任务。

But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:

但是如果你对如何去做感兴趣，即使你不能理解足够的细节来实际实现它，我建议你至少看看这篇文章：

Regular Expression Matching Can Be Simple And Fast(but is slow in Java, Perl, PHP, Python, Ruby, ...)

正则表达式匹配既简单又快速（但在 Java、Perl、PHP、Python、Ruby 中速度较慢……）

It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.

它解释了有多少流行的编程语言以对某些正则表达式来说可能非常慢的方式实现正则表达式，并解释了一种稍微不同的更快的方法。这篇文章包含了一些建议的实现如何工作的细节，包括一些 C 语言的源代码。如果你刚开始学习正则表达式，阅读可能有点繁重，但我认为了解两者之间的区别是非常值得的方法。

Answer 2

回答by Richard Fearn

There's an interesting (if slightly short) chapter in Beautiful Codeby Brian Kernighan, appropriately called "A Regular Expression Matcher". In it he discusses a simple matcher that can match literal characters, and the .^$*symbols.

Brian Kernighan在Beautiful Code 中有一个有趣的（如果略短）一章，恰当地称为“正则表达式匹配器”。在其中，他讨论了一个可以匹配文字字符和.^$*符号的简单匹配器。

Answer 3

回答by dhaffey

This papertakes an interesting approach. The implementation is given in Haskell, but it's been reimplemented in Pythonat least once.

本文采用了一种有趣的方法。实现是在 Haskell 中给出的，但它至少在 Python中被重新实现过一次。

Answer 4

回答by Steve314

I've already given a +1 to Mark Byers - but as far as I remember the paper doesn't really say that much about how regular expression matching works beyond explaining why one algorithm is bad and another much better. Maybe something in the links?

我已经给 Mark Byers 打了 +1 - 但据我所知，这篇论文并没有真正说明正则表达式匹配是如何工作的，只是解释了为什么一种算法不好而另一种算法更好。也许链接中有什么？

I'll focus on the good approach - creating finite automata. If you limit yourself to deterministic automata with no minimisation, this isn't really too difficult.

我将专注于好的方法 - 创建有限自动机。如果您将自己限制在没有最小化的确定性自动机，这并不是太困难。

What I'll (very quickly) describe is the approach taken in Modern Compiler Design.

我将（非常快地）描述的是Modern Compiler Design中采用的方法。

Imagine you have the following regular expression...

假设您有以下正则表达式...

a (b c)* d

The letters represent literal characters to match. The * is the usual zero-or-more repetitions match. The basic idea is to derive states based on dotted rules. State zero we'll take as the state where nothing has been matched yet, so the dot goes at the front...

字母代表要匹配的文字字符。* 是通常的零次或多次重复匹配。基本思想是根据虚线规则导出状态。状态零我们将作为尚未匹配任何内容的状态，所以点在前面......

0 : .a (b c)* d

The only possible match is 'a', so the next state we derive is...

唯一可能的匹配是“a”，所以我们得出的下一个状态是……

1 : a.(b c)* d

We now have two possibilities - match the 'b' (if there's at least one repeat of 'b c') or match the 'd' otherwise. Note - we are basically doing a digraph search here (either depth first or breadth first or whatever) but we are discovering the digraph as we search it. Assuming a breadth-first strategy, we'll need to queue one of our cases for later consideration, but I'll ignore that issue from here on. Anyway, we've discovered two new states...

我们现在有两种可能性 - 匹配 'b'（如果至少有一个重复的 'b c'）或匹配 'd' 否则。注意 - 我们在这里基本上是在进行有向图搜索（深度优先或广度优先或其他），但我们在搜索时发现了有向图。假设采用广度优先策略，我们需要将我们的一个案例排队以供以后考虑，但从这里开始我将忽略该问题。无论如何，我们已经发现了两个新状态......

2 : a (b.c)* d
3 : a (b c)* d.

State 3 is an end state (there may be more than one). For state 2, we can only match the 'c', but we need to be careful with the dot position afterwards. We get "a.(b c)* d" - which is the same as state 1, so we don't need a new state.

状态 3 是结束状态（可能不止一个）。对于状态 2，我们只能匹配 'c'，但是后面需要注意点的位置。我们得到 "a.(bc)* d" - 这与状态 1 相同，所以我们不需要新状态。

IIRC, the approach in Modern Compiler Design is to translate a rule when you hit an operator, in order to simplify the handling of the dot. State 1 would be transformed to...

IIRC，现代编译器设计中的方法是在您点击运算符时转换规则，以简化点的处理。状态 1 将转换为...

1 : a.b c (b c)* d
    a.d

That is, your next option is either to match the first repetition or to skip the repetition. The next states from this are equivalent to states 2 and 3. An advantage of this approach is that you can discard all your past matches (everything before the '.') as you only care about future matches. This typically gives a smaller state model (but not necessarily a minimal one).

也就是说，您的下一个选择是匹配第一次重复或跳过重复。此后的下一个状态相当于状态 2 和 3。这种方法的一个优点是您可以丢弃所有过去的匹配项（'.' 之前的所有内容），因为您只关心未来的匹配项。这通常会给出一个较小的状态模型（但不一定是最小的）。

EDITIf you do discard already matched details, your state description is a representation of the set of strings that can occur from this point on.

编辑如果您确实丢弃了已匹配的详细信息，则您的状态描述是从此时起可能出现的一组字符串的表示。

In terms of abstract algebra, this is a kind of set closure. An algebra is basically a set with one (or more) operators. Our set is of state descriptions, and our operators are our transitions (character matches). A closed set is one where applying any operator to any members in the set always produces another member that is in the set. The closure of a set is the mimimal larger set that is closed. So basically, starting with the obvious start state, we are constructing the minimal set of states that is closed relative to our set of transition operators - the minimal set of reachable states.

就抽象代数而言，这是一种集合闭包。代数基本上是具有一个（或多个）运算符的集合。我们的集合是状态描述，我们的操作符是我们的转换（字符匹配）。闭集是将任何运算符应用于集合中的任何成员总是会产生另一个在集合中的成员。一个集合的闭包是最小的大集合。所以基本上，从明显的开始状态开始，我们正在构建相对于我们的转换运算符集是封闭的最小状态集 - 最小可达状态集。

Minimal here refers to the closure process - there may be a smaller equivalent automata which is normally referred to as minimal.

最小在这里指的是闭合过程——可能有一个较小的等效自动机，通常称为最小。

With this basic idea in mind, it's not too difficult to say "if I have two state machines representing two sets of strings, how to I derive a third representing the union" (or intersection, or set difference...). Instead of dotted rules, your state representations will a current state (or set of current states) from each input automaton and perhaps additional details.

考虑到这个基本思想，说“如果我有两个状态机代表两组字符串，我如何导出代表联合的第三个状态机”（或交集，或集差......）并不难。您的状态表示将不是虚线规则，而是来自每个输入自动机的当前状态（或一组当前状态）以及可能的其他详细信息。

If your regular grammars are getting complex, you can minimise. The basic idea here is relatively simple. You group all your states into one equivalence class or "block". Then you repeatedly test whether you need to split blocks (the states aren't really equivalent) with respect to a particular transition type. If all states in a particular block can accept a match of the same character and, in doing so, reach the same next-block, they are equivalent.

如果您的常规语法变得复杂，您可以最小化。这里的基本思想比较简单。您将所有状态分组为一个等价类或“块”。然后，您反复测试是否需要针对特定转换类型拆分块（状态并不真正等效）。如果特定块中的所有状态都可以接受相同字符的匹配，并且在这样做时到达相同的下一个块，则它们是等效的。

Hopcrofts algorithm is an efficient way to handle this basic idea.

Hopcrofts 算法是处理这个基本思想的有效方法。

A particularly interesting thing about minimisation is that every deterministic finite automaton has precisely one minimal form. Furthermore, Hopcrofts algorithm will produce the same representation of that minimal form, no matter what representation of what larger case it started from. That is, this is a "canonical" representation which can be used to derive a hash or for arbitrary-but-consistent orderings. What this means is that you can use minimal automata as keys into containers.

关于最小化的一个特别有趣的事情是每个确定性有限自动机都恰好有一个最小形式。此外，Hopcrofts 算法将生成该最小形式的相同表示，无论它从什么更大的案例开始。也就是说，这是一种“规范”表示，可用于导出散列或任意但一致的排序。这意味着您可以使用最少的自动机作为进入容器的键。

The above is probably a bit sloppy WRT definitions, so make sure you look up any terms yourself before using them yourself, but with a bit of luck this gives a fair quick introduction to the basic ideas.

上面的 WRT 定义可能有点草率，因此请确保在自己使用任何术语之前自己查找它们，但幸运的是，这可以快速介绍基本思想。

BTW - have a look around the rest of Dick Grunes site- he has a free PDF book on parsing techniques. The first edition of Modern Compiler Design is pretty good IMO, but as you'll see, there's a second edition imminent.

顺便说一句 - 看看Dick Grunes 网站的其余部分- 他有一本关于解析技术的免费 PDF 书。Modern Compiler Design 的第一版是相当不错的 IMO，但正如您将看到的，第二版即将推出。

Answer 5

回答by A_Var

I do agree that writing a regex engine will improve understanding but have you taken a look at ANTLR??. It generates the parsers automatically for any kind of language. So maybe you can try your hand by taking one of the language grammars listed at Grammar examplesand run through the AST and parser that it generates. It generates a really complicated code but you will have a good understanding on how a parser works.

我确实同意编写正则表达式引擎会提高理解力，但您是否看过 ANTLR ？？。它为任何类型的语言自动生成解析器。因此，也许您可以尝试使用语法示例中列出的语言语法之一，并运行它生成的 AST 和解析器。它生成了一个非常复杂的代码，但您将对解析器的工作原理有很好的理解。

Python 为正则表达式编写解析器

提问by Chinmay Kanchi

采纳答案by Mark Byers

回答by Richard Fearn

回答by dhaffey

回答by Steve314

回答by A_Var

相关推荐

最近更新

标签

Python 为正则表达式编写解析器

提问by Chinmay Kanchi

采纳答案by Mark Byers

回答by Richard Fearn

回答by dhaffey

回答by Steve314

回答by A_Var

相关推荐

Python UnicodeEncodeError: 'charmap' 编解码器无法编码 - 字符映射到 <undefined>，打印函数

Python 日志记录删除/检查/修改由 fileConfig() 配置的处理程序

Python 如何在 WTForms 中生成动态字段

使用 IPython 对 Python 进行断点引发的交互式调试

相关推荐

最近更新

标签