Java 使用标记列表构建抽象语法树

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25049751/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 23:30:39  来源:igfitidea点击:

Constructing an Abstract Syntax Tree with a list of Tokens

javainterpreterabstract-syntax-tree

提问by metro-man

I want to construct an AST from a list of tokens. I'm making a scripting language and I've already done the lexical analysis part, but I have no idea how to create an AST. So the question is, how do I take something like this:

我想从令牌列表中构建一个 AST。我正在制作一种脚本语言,并且已经完成了词法分析部分,但我不知道如何创建 AST。所以问题是,我该如何处理这样的事情:

WORD, int
WORD, x
SYMBOL, =
NUMBER, 5
SYMBOL, ;

and convert it into an Abstract Syntax Tree? Preferably, I'd like to do so withouta library like ANTLR or whatever, I'd rather try and do it from scratch myself. However, if it's a really complex task, I don't mind using a library :) Thanks

并将其转换为抽象语法树?最好,我希望在没有像 ANTLR 之类的库的情况下这样做,我宁愿自己尝试从头开始。但是,如果这是一项非常复杂的任务,我不介意使用库:) 谢谢

采纳答案by Ira Baxter

The fundamental trick is to recognize that parsing, however accomplished, happens in incremental steps, including the reading of the tokens one by one.

基本的技巧是认识到解析,无论如何完成,都是以增量步骤发生的,包括一个一个地读取标记。

At each incremental step, there is an opportunity to build part of the AST by combining AST fragments built by other incremental steps. This is a recursive idea, and it bottoms out in building AST leaf nodes for tokens as they are scanned. This basic idea occurs in pretty much all AST-building parsers.

在每个增量步骤中,都有机会通过组合其他增量步骤构建的 AST 片段来构建 AST 的一部分。这是一个递归的想法,它在为被扫描的令牌构建 AST 叶节点时达到最低点。这个基本思想几乎出现在所有构建 AST 的解析器中。

If one builds a recursive descent parser, one in effect builds a cooperating system of recursive procedures, each one of which recognizes a nonterminal in whatever grammar is being implemented. For pure parsing, each procedure simply returns a boolean for "nonterminal (not) recognized".

如果构建一个递归下降解析器,那么实际上构建了一个递归过程的协作系统,每个递归过程都识别出正在实现的任何语法中的非终结符。对于纯解析,每个过程只返回一个布尔值来表示“非终端(未)识别”。

To build an AST with a recursive descent parser, one designs these procedures to return two values: the boolean "recognized", and, if recognized, an AST constructed (somehow) for the nonterminal. (A common hack is return a pointer, which is void for "not recognized", or points to the constructed AST if "recognized"). The way the resulting AST for a single procedure is built, is by combining the ASTs from the sub-procedures that it invokes. This is pretty trivial to do for leaf procedures, which read an input token and can immediately build a tree.

要使用递归下降解析器构建 AST,可以设计这些过程以返回两个值:“已识别”布尔值,以及(如果被识别)为非终结符(以某种方式)构造的 AST。(一个常见的技巧是返回一个指针,对于“未识别”是无效的,或者如果“已识别”则指向构造的 AST)。构建单个过程的结果 AST 的方式是组合来自它调用的子过程的 AST。这对于叶子过程来说是非常简单的,它读取输入标记并可以立即构建一棵树。

The downside to all this is one must manually code the recursive descent, and augment it with the tree building steps. In the grand scheme of things, this is actually pretty easy to code for small grammars.

所有这一切的缺点是必须手动编码递归下降,并通过树构建步骤对其进行扩充。从总体上看,这实际上很容易为小语法编写代码。

For OP's example, assume we have this grammar:

对于 OP 的示例,假设我们有以下语法:

GOAL = ASSIGNMENT 
ASSIGNMENT = LHS '=' RHS ';' 
LHS = IDENTIFIER 
RHS = IDENTIFIER | NUMBER

OK, our recursive descent parser:

好的,我们的递归下降解析器:

boolean parse_Goal()
{  if parse_Assignement()
   then return true
   else return false
}

boolean parse_Assignment()
{  if not Parse_LHS()
   then return false
   if not Parse_equalsign()
   then throw SyntaxError // because there are no viable alternatives from here
   if not Parse_RHS()
   then throw SyntaxError
   if not Parse_semicolon()
   the throw SyntaxError
   return true
}

boolean parse_LHS()
{  if parse_IDENTIFIER()
   then return true
   else return false
}

boolean parse_RHS()
{  if parse_IDENTIFIER()
   then return true
   if parse_NUMBER()
   then return true
   else return false
}

boolean parse_equalsign()
{  if TestInputAndAdvance("=")  // this can check for token instead
   then return true
   else return false
}

boolean parse_semicolon()
{  if TestInputAndAdvance(";")
   then return true
   else return false
}

boolean parse_IDENTIFIER()
{  if TestInputForIdentifier()
   then return true
   else return false
}

boolean parse_NUMBER()
{  if TestInputForNumber()
   then return true
   else return false
}

Now, let's revise it build a abstract syntax tree:

现在,让我们修改它构建一个抽象语法树:

AST* parse_Goal() // note: we choose to return a null pointer for "false"
{  node = parse_Assignment()
   if node != NULL
   then return node
   else return NULL
}

AST* parse_Assignment()
{  LHSnode = Parse_LHS()
   if LHSnode == NULL
   then return NULL
   EqualNode = Parse_equalsign()
   if EqualNode == NULL
   then throw SyntaxError // because there are no viable alternatives from here
   RHSnode = Parse_RHS()
   if RHSnode == NULL
   then throw SyntaxError
   SemicolonNode = Parse_semicolon()
   if SemicolonNode == NULL
   the throw SyntaxError
   return makeASTNode(ASSIGNMENT,LHSNode,RHSNode)
}

AST* parse_LHS()
{  IdentifierNode = parse_IDENTIFIER()
   if node != NULL
   then return IdentifierNode
   else return NULL
}

AST* parse_RHS()
{  RHSnode = parse_IDENTIFIER()
   if RHSnode != null
   then return RHSnode
   RHSnode = parse_NUMBER()
   if RHSnode != null
   then return RHSnode
   else return NULL
}

AST* parse_equalsign()
{  if TestInputAndAdvance("=")  // this can check for token instead
   then return makeASTNode("=")
   else return NULL
}

AST* parse_semicolon()
{  if TestInputAndAdvance(";")
   then return makeASTNode(";")
   else return NULL
}

AST* parse_IDENTIFIER()
{  text = TestInputForIdentifier()
   if text != NULL
   then return makeASTNode("IDENTIFIER",text)
   else return NULL
}

AST* parse_NUMBER()
{  text = TestInputForNumber()
   if text != NULL
   then return makeASTNode("NUMBER",text)
   else return NULL
}

I've obviously glossed over some details, but I assume the reader will have no trouble filling them in.

我显然掩盖了一些细节,但我认为读者填写它们不会有困难。

Parser generator tools like JavaCC and ANTLR basically generate recursive descent parsers, and have facilities for constructing trees that work very much like this.

解析器生成器工具如 JavaCC 和 ANTLR 基本上生成递归下降解析器,并且具有构建树的工具,其工作方式与此非常相似。

Parser generator tools that build bottom-up parsers (YACC, Bison, GLR, ...) also build AST nodes in the same style. However, there is no set of recursive functions; instead, a stack of tokens seen and reduced-to nonterminals is managed by these tools. The AST nodes are constructed on a parallel stack; when a reduction occurs, the AST nodes on the part of the stack covered by the reduction are combined to produce a nonterminal AST node to replace them. This happens with "zero-size" stack segments for grammar rules which are empty too causing AST nodes (typically for 'empty list' or 'missing option') to seemingly appear from nowhere.

构建自底向上解析器(YACC、Bison、GLR 等)的解析器生成器工具也以相同的方式构建 AST 节点。但是,没有一组递归函数;取而代之的是,这些工具可以管理一堆可见的和简化为非终结符的令牌。AST 节点构建在并行堆栈上;当发生归约时,被归约覆盖的堆栈部分上的 AST 节点被组合以产生一个非终端 AST 节点来替换它们。这种情况发生在语法规则的“零大小”堆栈段中,这些堆栈段也是空的,导致 AST 节点(通常用于“空列表”或“缺失选项”)似乎无处可见。

With bitty languages, writing recursive-descent parsers that build trees is pretty practical.

使用 bitty 语言,编写构建树的递归下降解析器非常实用。

A problem with real languages (whether old and hoary like COBOL or hot and shiny like Scala) is that the number of grammar rules is pretty large, complicated by the sophistication of the language and the insistence on whatever language committee is in charge of it to perpetually add new goodies offered by other languages ("language envy", see the evolutionary race between Java, C# and C++). Now writing a recursive descent parser gets way out of hand and one tends to use parser generators. But even with a parser generator, writing all the custom code to build AST nodes is also a big battle (and we haven't discussed what it takes to design a good "abstract" syntax vs. the first thing that comes to mind). Maintaining grammar rules and AST building goo gets progressively harder with scale and ongoing evolution. (If your language is successful, within a year you'll want to change it). So even writing the AST building rules gets awkward.

真正的语言(无论是像 COBOL 那样陈旧陈旧,还是像 Scala 那样炙手可热)的一个问题是语法规则的数量非常多,而且由于语言的复杂性以及对负责它的任何语言委员会的坚持而变得复杂不断添加其他语言提供的新东西(“语言嫉妒”,请参阅 Java、C# 和 C++ 之间的进化竞赛)。现在编写递归下降解析器变得一发不可收拾,人们倾向于使用解析器生成器。但即使使用解析器生成器,编写所有自定义代码来构建 AST 节点也是一场大战(我们还没有讨论设计一个好的“抽象”语法与首先想到的东西相比需要什么)。随着规模和持续发展,维护语法规则和 AST 构建 goo 变得越来越困难。(如果你的语言成功了,一年之内你会想要改变它)。因此,即使编写 AST 构建规则也会变得尴尬。

Ideally, one would just like to write a grammar, and get a parser and tree. You can do this with some recent parser generators: Our DMS Software Reengineering Toolkit accepts full context free grammars, and automatically constructs an AST, no work on the grammar engineer's part; its been doing this since 1995. The ANTLR guys finally figured this out in 2014, and ANTLR4 now offers an option like this.

理想情况下,人们只想写一个语法,然后得到一个解析器和树。您可以使用一些最近的解析器生成器来做到这一点:我们的 DMS 软件再造工具包接受完整的上下文无关语法,并自动构建一个 AST,无需语法工程师的工作;它自 1995 年以来一直在这样做。 ANTLR 人员终于在 2014 年弄清楚了这一点,现在 ANTLR4 提供了这样的选项。

Last point: having a parser (even with an AST) is hardly a solution to the actual problem you set out to solve, whatever it was. Its just a foundation piece, and much to the shock for most parser-newbies, it is the smallestpart to a tool that manipulates code. Google my essay on Life After Parsing (or check my bio) for more detail.

最后一点:拥有解析器(即使使用 AST)也很难解决您要解决的实际问题,无论它是什么。它只是一个基础部分,对于大多数解析器新手来说非常震惊,它是操作代码的工具的最小部分。谷歌我关于解析后的生活的文章(或查看我的简历)了解更多细节。

回答by jv110

It's not hard at all; in fact, it's one of the easiest things I've done. The general idea is that each structure (aka parser rules) is just a list of other structures, and when a parse() function is called, they just loop through their children, and tell them to parse. This isn't an infinite loop; tokens are structures, and when their parse() is called, they scan the lexer output. They should also have a name for identification, but this is not required. parse() would normally return a parse tree. Parse trees are just like structures - lists of children. It's also good to have a "text" field, and its parent structure, for identification. Here's an example (you'd want to organize it better and handle the null for real projects):

这一点都不难;事实上,这是我做过的最简单的事情之一。一般的想法是每个结构(又名解析器规则)只是其他结构的列表,当调用 parse() 函数时,它们只是循环遍历它们的子代,并告诉它们进行解析。这不是一个无限循环;标记是结构,当它们的 parse() 被调用时,它们会扫描词法分析器的输出。他们还应该有一个用于识别的名称,但这不是必需的。parse() 通常会返回一个解析树。解析树就像结构 - 孩子的列表。有一个“文本”字段及其父结构用于识别也很好。这是一个示例(您希望更好地组织它并处理实际项目的空值):

public void push(ParseTree tree) { // ParseTree
    children.add(tree);
    text += tree.text;
}

public ParseTree parse() { // Structure
    ParseTree tree = new ParseTree(this);
    for(Structure st: children) {
        tree.push(st.parse());
    }
    return tree;
}

public ParseTree parse() { // Token
    if(!lexer.nextToken() || !matches(lexer.token))
        return null;
    ParseTree tree = new ParseTree(this);
    tree.text = lexer.token;
    return tree;
}

There. Call the main structure's parse(), and you got an AST. Of course, this is a very barebones example, and won't work out of the box. It's also useful to have "modifiers"; e.g. match child 3 one or more times, child 2 is optional. That's also easy to do; store them in an array the same size as your child count, and when parsing, check for it:

那里。调用主结构的 parse(),你会得到一个 AST。当然,这是一个非常准系统的例子,不能开箱即用。拥有“修饰符”也很有用;例如匹配孩子 3 一次或多次,孩子 2 是可选的。这也很容易做到;将它们存储在一个与您的孩子数量相同的数组中,并在解析时检查它:

public void setModifier(int id, int mod) {
    mods[id] = mod;
}

public ParseTree parse() {
    ...
    ParseTree t;
    switch(mods[i]) {
        case 1: // Optional
            if((t = st.parse()) != null) tree.push(t);
        case 2: // Zero or more times
            while((t = st.parse()) != null) tree.push(t);
        ...
        default:
            tree.push(st.parse());
    }
    ...
}