Java 中的递归下降解析器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14913804/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Recursive Descent Parser in Java
提问by user1972748
I would like to preface this by saying this is a homework assignment for my third Year Programming Languages Class, and I'm looking for some help with it. My assignment reads:
我想先说这是我三年级编程语言课的家庭作业,我正在寻找一些帮助。我的作业是这样写的:
Deadline: February 22, 2013 at 11:55pm
Submission: Please upload the following to the CMS.
1. Source code
2. A screen shot of the execution of your program including the input file you usedUse any programming language you prefer to write a recursive-descent parser that parses the language generated by the following EBNF descriptions. Your parser should detect whether or not the input program has any syntax errors. It does not have to specify what and where the error is.
截止日期:2013 年 2 月 22 日晚上 11:55
提交:请将以下内容上传到 CMS。
1. 源代码
2. 程序执行的屏幕截图,包括您使用的输入文件使用您喜欢的任何编程语言编写递归下降解析器,该解析器解析由以下 EBNF 描述生成的语言。您的解析器应该检测输入程序是否有任何语法错误。它不必指定错误的内容和位置。
<program> ? begin <stmt_list> end
<stmt_list> ? <stmt> {;<stmt_list>}
<stmt> ? <assign_stmt> | <while_stmt>
<assign_stmt> ? <var> = <expr>
<var> ? identifier (An identifier is a string that begins with a letter followed by 0 or more letters and digits)
<expr> ? <var> { (+|-) <var>}
<while_stmt> ? while (<logic_expr>) <stmt>
<logic_expr> ? <var> (< | >) <var> (Assume that logic expressions have only less than or greater than operators)
The symbols that look funny are just arrows pointing to the right.
看起来很有趣的符号只是指向右侧的箭头。
My problem at the moment is more logical then it is programming: in my first attempt, I read the entire input program in, saved it to a string, then parsed that string and converted every symbol to either a terminal, expr, or what have you.
我目前的问题比编程更合乎逻辑:在我的第一次尝试中,我读取了整个输入程序,将其保存为一个字符串,然后解析该字符串并将每个符号转换为终端、expr 或具有你。
I eventually found that this way would not work because, A: I don't think that it is RDP, B: many of the non terminals are made of more then 1 statement.
我最终发现这种方式行不通,因为,A:我不认为它是 RDP,B:许多非终端都是由 1 个以上的语句组成的。
I gave up on that approach, and decided that before I waste more time programming, I would Pseudo everything out. My new idea was to make 1 method, for each non terminal symbol, and just parse the input string symbol by symbol, hoping between those methods. This approach seemed logical, but as I started writing the pseudocode I got very lost and confused as to what I needed to do. How would I finish this code?
我放弃了这种方法,并决定在我浪费更多时间编程之前,我将 Pseudo 一切都解决掉。我的新想法是为每个非终结符创建 1 个方法,然后逐个符号解析输入字符串,希望在这些方法之间进行。这种方法似乎合乎逻辑,但是当我开始编写伪代码时,我对我需要做什么感到非常迷茫和困惑。 我将如何完成此代码?
Here is some pseudocode for RDP:
下面是一些 RDP 的伪代码:
intputString;
public void parseProgram (Symbol.typeIsProgram) {
if getNextSymbol == "begin" {
if (intputString.substring (inputString.length()-3,
inputString.length()) == "end") {
Symbol stmt_lsit = new Symbol (intputString)
parseStmt_list(stmt_list);
} else {
Out "error, prog must end with end"
}
} else {
Out "error, prog must begin with begin"
}
}
public void parseStmt_list (Stmbol.typeIsStmt_list) {
symbol = getNextSymbol;
if (Symbol.typeIsVar) {
parseVar(symbol)
} else if (Symbol.typeIsWhile) {
// weve only capture if the first word return is a while, we dont have the whole while statement yet
ParseWhile_stmt(symbol)
} else { }
}
public void parseStmt () { }
public void parseAssign_stmt () { }
public void parseVar () { }
public void parseExpr () { }
public void parseWhile_stmt () { }
public void parseLogic_expr () { }
public Symbol getNextSymbol() {
//returns the next symbol in input string and removes it from the input string
}
Just an FYI a sample input program for my parser would be.
仅供参考,我的解析器的示例输入程序将是。
begin
total = var1 + var2;
while (var1 < var2)
while ( var3 > var4)
var2 = var2 - var1
end
回答by Dmitry Tikhonov
According to this particular assignment it is alright to use string processing in a functional way.
根据这个特定的任务,以功能方式使用字符串处理是可以的。
Try this algorithm:
试试这个算法:
- check, that line begins with
begin
and ends withend
- if yes - split remaining string with
;
and do the following steps for each part (statement) - check if statement consists a
=
orwhile
substring - for assignment check you can split an input with
+
or-
and for each part check the variable condition (alphanumeric content) - for while - check for
(
and)
and recursively process two strings between brackets and after it - finally, for logical expression split by substring
<
or>
, and check that parts are variables (alphanumeric)
- 检查,该行
begin
以end
- 如果是 - 拆分剩余的字符串
;
并为每个部分(语句)执行以下步骤 - 检查语句是否包含一个
=
或while
子字符串 - 对于赋值检查,您可以使用
+
或拆分输入,-
并为每个部分检查变量条件(字母数字内容) - for while - 检查
(
和)
并递归处理括号之间和后面的两个字符串 - 最后,对于由子字符串
<
或分割的逻辑表达式>
,并检查部分是变量(字母数字)
Maybe, it is not very flexible solution, but as I see it is acceptable for your task.
也许,这不是很灵活的解决方案,但我认为它可以满足您的任务。
EDIT:
编辑:
I found the task interesting and tried to write a complete solution.
我发现这个任务很有趣,并试图编写一个完整的解决方案。
public enum Parser {
PROGRAM {
void parse(String s) throws ParseException {
s = s.trim();
if (s.startsWith("begin") && s.endsWith("end")) {
STATEMENT_LIST.parse(s.substring(5, s.length() - 3));
} else {
throw new ParseException("Illegal begin/end");
}
}
},
STATEMENT_LIST {
void parse(String s) throws ParseException {
String[] parts = s.trim().split(";");
for (String part : parts) {
STATEMENT.parse(part.trim());
}
}
},
STATEMENT {
void parse(String s) throws ParseException {
if (s.startsWith("while")) {
WHILE.parse(s.substring(5));
} else if (s.contains("=")) {
ASSIGNMENT.parse(s);
} else {
throw new ParseException("Illegal statement: " + s);
}
}
},
WHILE {
void parse(String s) throws ParseException {
int i = s.indexOf("(");
int j = s.indexOf(")");
if (i != -1 && j != -1) {
LOGICAL.parse(s.substring(i + 1, j).trim());
STATEMENT.parse(s.substring(j + 1).trim());
} else {
throw new ParseException("Illegal while: " + s);
}
}
},
ASSIGNMENT {
void parse(String s) throws ParseException {
int i = s.indexOf("=");
if (i != -1) {
VARIABLE.parse(s.substring(0, i).trim());
EXPRESSION.parse(s.substring(i + 1).trim());
}
}
},
EXPRESSION {
void parse(String s) throws ParseException {
String[] parts = s.split("\+|-");
for (String part : parts) {
VARIABLE.parse(part.trim());
}
}
},
LOGICAL {
void parse(String s) throws ParseException {
int i;
if (s.contains("<")) {
i = s.indexOf("<");
} else if (s.contains(">")) {
i = s.indexOf(">");
} else {
throw new ParseException("Illegal logical: " + s);
}
VARIABLE.parse(s.substring(0, i).trim());
VARIABLE.parse(s.substring(i + 1).trim());
}
},
VARIABLE {
void parse(String s) throws ParseException {
if (!s.matches("^[a-zA-Z][a-zA-Z0-9]*$")) {
throw new ParseException("Illegal variable: " + s);
}
}
};
abstract void parse(String s) throws ParseException;
public static void main(String[] args) {
try {
PROGRAM.parse("begin \n" +
"total = var1 + var2; \n" +
"while (var1 < var2) \n" +
"while ( var3 > var4)\n" +
"var2 = var2 - var1 \n" +
"end");
System.out.println("OK");
} catch (ParseException e) {
System.out.println(e.getMessage());
}
}
}
class ParseException extends Exception {
public ParseException(String message) {
super(message);
}
}
回答by user1972748
1) Tokenization
1) 代币化
First break the input into tokens. In this case, every identifier and operator and literal. Make a big list of all the input tokens. Once you have tokens then you can start parsing. Make the tokens a linked list, so you can just say Token.Next to read the next token or Token.Next.Next to read 2 tokens ahead. Put a bunch of EOF tokens at the end so you will never skip past it.
首先将输入分解为标记。在这种情况下,每个标识符和运算符和文字。列出所有输入标记的大列表。一旦你有了令牌,你就可以开始解析了。使令牌成为一个链表,所以你可以说 Token.Next 读取下一个令牌或 Token.Next.Next 读取前面的 2 个令牌。在最后放一堆 EOF 令牌,这样你就永远不会跳过它。
2) Parsing
2)解析
Parsing is like what you have already. So instead of thinking Symbol think Token. The Parse Statements list is a while loop that keeps parsing statements until the end.
解析就像你已经拥有的一样。因此,与其思考 Symbol,不如思考 Token。Parse Statements 列表是一个 while 循环,它会一直解析语句直到结束。
For Parse Statement
对于解析语句
public void parseStmt ()
{
if (Token.kind == KEYWORD && Token.id == kw_while) {
return ParseWhileStatement();
}
else {
return ParseAssignStatement();
}
}
Parsing the while statement will loop back to parse statements and thus it will "recursively descend" back into Parse statement, yielding nested while loops etc...
解析 while 语句将循环回 parse 语句,因此它将“递归下降”回 Parse 语句,产生嵌套的 while 循环等......
Parsing the assignment statement is very similar. PArse the Left side and then the right side. You need a bunch of functions for that....
解析赋值语句非常相似。解析左侧,然后解析右侧。你需要一堆功能......
The Node here is an Ast node. Abstract Syntax Tree.
这里的 Node 是一个 Ast 节点。抽象语法树。
Stuff like...
之类的...
class Node {
}
class OpNode {
OpNode Lhs;
OpNode Rhs;
}
class MultiplyNode : OpNode {
MultiplyNode(byref Token tok, OpNode Left, OpNode right) {
tok = tok.Next;
Lhs = left;
Rhs = right;
}
}
Node ParseSimpleExp() {
if (Token.kind == Identifier) {
Node n = new IdentNode;
NextToken();
return n;
}
if (Token.kind == Number) {
Node n = new NumberNode;
NextToken();
return n;
}
}
// In these examples move the token to next token when you create the new nodes
Node ParseMulExp() {
Node n = ParseSimpleExp();
while (1) {
if (Token.Kind == Multiply) {
n = new MultiplyNode(Token,n,ParseSimpleExp());
continue;
}
if (Token.Kind == Divide) {
n = new DivideNode(Token,n,ParseSimpleExp());
continue;
}
return n;
}
}
Node ParseAddExp() {
Node n = ParseMulExp();
while (1) {
if (Token.Kind == Add) {
n = new AddNode(Token,n,ParseMulExp());
continue;
}
if (Token.Kind == Subtract) {
n = new SubtractNode(Token,n,ParseMulExp());
continue;
}
return n;
}
}
Node ParseAssignStatement() {
Node n = ParseAddExp();
if (Token.kind == ASSIGN) {
return new AssignStatement(Token,n,ParseAddExp());
}
}
If you follow the logic you can see how the precedence rules are followed by recursively arriving at each target. Parsing expressions and starting from the assign is not a loop. It is like shown here. Obviously this is to simple, but it shows the technique.
如果您遵循逻辑,您可以看到如何遵循优先规则以递归方式到达每个目标。解析表达式并从赋值开始不是循环。就像这里显示的一样。显然这很简单,但它显示了技术。
So the RDP is caused by looking at the current token and then jumping to some function to handle token. Naturally this can come back to the same function hence is recursive. If you look at the ParseSimpleExp function then you can see that is a good place to maybe handle a parenthesized expression. A parens expression will cause recursion back to simple exp and possibly all the others like mul and add.
所以RDP是通过查看当前token然后跳转到某个函数来处理token引起的。自然这可以回到相同的函数,因此是递归的。如果您查看 ParseSimpleExp 函数,您会发现这是处理括号表达式的好地方。parens 表达式将导致递归返回到简单的 exp 以及可能的所有其他表达式,例如 mul 和 add。
回答by Wolfgang Fahl
The structure of your parser code should resemble the structure of the language syntax. E.g.
解析器代码的结构应该类似于语言语法的结构。例如
<program> ::= begin <stmt_list> end
would translate into something like
会翻译成类似的东西
function parse_program() {
parse_begin();
repeat parse_stmt();
parse_end();
}
You might want not to confuse the token handling (scanner) with the parsing of the structure (parser).
您可能不希望将令牌处理(扫描器)与结构解析(解析器)混淆。
I'd go with exception handling instead of if / else structures for error handling. You might want to keep track of where you are in the source (scanner) part for displaying a proper error message. Just ask the scanner for its state.
我会使用异常处理而不是 if / else 结构进行错误处理。您可能希望跟踪您在源(扫描仪)部分中的位置以显示正确的错误消息。只需询问扫描仪的状态即可。
The assignment fortunately seems to need no conflict resolution so your recursive decent should work well. The interesting part is where in the parsing of
幸运的是,该任务似乎不需要解决冲突,因此您的递归体面应该可以很好地工作。有趣的部分是在解析
<while_stmt> ::= while (<logic_expr>) <stmt>
you'll end up to call parse_stmt()
recursively. This is what the whole idea of recursive descent parsing is about.
你最终会parse_stmt()
递归调用。这就是递归下降解析的整个思想。