为什么不能用 LR(1) 解析器解析 C++?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/243383/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why can't C++ be parsed with a LR(1) parser?
提问by Cheery
I was reading about parsers and parser generators and found this statement in wikipedia's LR parsing -page:
我正在阅读有关解析器和解析器生成器的内容,并在维基百科的 LR parsing -page 中找到了此语句:
Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++.
可以使用 LR 解析器的某些变体来解析许多编程语言。一个值得注意的例外是 C++。
Why is it so? What particular property of C++ causes it to be impossible to parse with LR parsers?
为什么会这样?C++ 的什么特殊属性导致它无法使用 LR 解析器进行解析?
Using google, I only found that C can be perfectly parsed with LR(1) but C++ requires LR(∞).
使用google,我只发现C可以用LR(1)完美解析,但C++需要LR(∞)。
采纳答案by Rob Walker
There is an interesting thread on Lambda the Ultimatethat discusses the LALR grammar for C++.
Lambda the Ultimate上有一个有趣的线程,讨论了C++的LALR 语法。
It includes a link to a PhD thesisthat includes a discussion of C++ parsing, which states that:
它包含指向博士论文的链接,其中包含对 C++ 解析的讨论,其中指出:
"C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities".
“C++ 语法是模棱两可的,依赖于上下文,并且可能需要无限前瞻来解决一些歧义”。
It goes on to give a number of examples (see page 147 of the pdf).
它继续给出了一些例子(见pdf的第147页)。
The example is:
例子是:
int(x), y, *const z;
meaning
意义
int x;
int y;
int *const z;
Compare to:
相比于:
int(x), y, new int;
meaning
意义
(int(x)), (y), (new int));
(a comma-separated expression).
(逗号分隔的表达式)。
The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.
两个标记序列具有相同的初始子序列但不同的解析树,这取决于最后一个元素。在消除歧义的标记之前可以有任意多个标记。
回答by Ira Baxter
LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).
根据设计,LR 解析器无法处理歧义的语法规则。(在 1970 年代,当想法被制定出来时,使理论更容易)。
C and C++ both allow the following statement:
C 和 C++ 都允许以下语句:
x * y ;
It has two different parses:
它有两种不同的解析:
- It can be the declaration of y, as pointer to type x
- It can be a multiply of x and y, throwing away the answer.
- 它可以是 y 的声明,作为指向类型 x 的指针
- 它可以是 x 和 y 的乘积,丢弃答案。
Now, you might think the latter is stupid and should be ignored. Most would agree with you; however, there are cases where it might have a side effect (e.g., if multiply is overloaded). but that isn't the point. The point is there aretwo different parses, and therefore a program can mean different things depending on how this shouldhave been parsed.
现在,您可能认为后者很愚蠢,应该被忽略。大多数人会同意你的看法;但是,在某些情况下它可能会产生副作用(例如,如果乘法过载)。但这不是重点。问题的关键是有有两种不同的解析,因此程序可以根据如何表达不同的意思应该已经被解析。
The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammar ambiguous.
编译器必须在适当的情况下接受适当的一个,并且在没有任何其他信息(例如,x 的类型的知识)的情况下必须收集两者以便以后决定做什么。因此,语法必须允许这样做。这使得语法模棱两可。
Thus pure LR parsing can't handle this. Nor can many other widely available parser generators, such as Antlr, JavaCC, YACC, or traditional Bison, or even PEG-style parsers, used in a "pure" way.
因此,纯 LR 解析无法处理这个问题。许多其他广泛使用的解析器生成器,例如 Antlr、JavaCC、YACC 或传统的 Bison,甚至 PEG 样式的解析器,也不能以“纯粹”的方式使用。
There are lots of more complicated cases (parsing template syntax requires arbitrary lookahead, whereas LALR(k) can look ahead at most k tokens), but only it only takes one counterexample to shoot down pureLR (or the others) parsing.
有很多更复杂的情况(解析模板语法需要任意的前瞻,而 LALR(k) 最多可以前瞻 k 个标记),但只需要一个反例就可以击倒纯LR(或其他)解析。
Most real C/C++ parsers handle this example by using some kind of deterministic parser with an extra hack: they intertwine parsing with symbol table collection... so that by the time "x" is encountered, the parser knows if x is a type or not, and can thus choose between the two potential parses. But a parser that does this isn't context free, and LR parsers (the pure ones, etc.) are (at best) context free.
大多数真正的 C/C++ 解析器通过使用某种具有额外技巧的确定性解析器来处理这个示例:它们将解析与符号表集合交织在一起......这样当遇到“x”时,解析器就知道 x 是否是一个类型与否,因此可以在两个潜在的解析之间进行选择。但是执行此操作的解析器不是上下文无关的,而 LR 解析器(纯解析器等)(充其量)是上下文无关的。
One can cheat, and add per-rule reduction-time semantic checks in the to LR parsers to do this disambiguation. (This code often isn't simple). Most of the other parser types have some means to add semantic checks at various points in the parsing, that can be used to do this.
可以作弊,并在 LR 解析器中添加每个规则减少时间的语义检查来消除歧义。(这段代码通常并不简单)。大多数其他解析器类型都有一些方法可以在解析的各个点添加语义检查,可用于执行此操作。
And if you cheat enough, you can make LR parsers work for C and C++. The GCC guys did for awhile, but gave it up for hand-coded parsing, I think because they wanted better error diagnostics.
如果你足够作弊,你可以让 LR 解析器为 C 和 C++ 工作。GCC 人员做了一段时间,但放弃了手工编码解析,我想是因为他们想要更好的错误诊断。
There's another approach, though, which is nice and clean and parses C and C++ just fine without any symbol table hackery: GLR parsers. These are full context free parsers (having effectively infinite lookahead). GLR parsers simply accept bothparses, producing a "tree" (actually a directed acyclic graph that is mostly tree like) that represents the ambiguous parse. A post-parsing pass can resolve the ambiguities.
不过,还有另一种方法,它既漂亮又干净,可以很好地解析 C 和 C++,无需任何符号表技巧:GLR 解析器。这些是完全上下文无关的解析器(具有有效的无限前瞻)。GLR 解析器简单地接受这两种解析,生成一个“树”(实际上是一个主要是树状的有向无环图)来表示歧义解析。解析后传递可以解决歧义。
We use this technique in the C and C++ front ends for our DMS Software Reengineering Tookit (as of June 2017 these handle full C++17 in MS and GNU dialects). They have been used to process millions of lines of large C and C++ systems, with complete, precise parses producing ASTs with complete details of the source code. (See the AST for C++'s most vexing parse.)
我们在 DMS 软件再造工具包的 C 和 C++ 前端中使用了这种技术(截至 2017 年 6 月,这些技术在 MS 和 GNU 方言中处理完整的 C++17)。它们已被用于处理数百万行大型 C 和 C++ 系统,通过完整、精确的解析生成具有完整源代码细节的 AST。(请参阅AST 以了解 C++ 最令人头疼的解析。)
回答by reuns
The problem is never defined like this, whereas it should be interesting :
这个问题从来没有像这样定义过,但它应该很有趣:
what is the smallest set of modifications to C++ grammar that would be necessary so that this new grammar could be perfectly parsed by a "non-context-free" yacc parser ? (making use only of one 'hack' : the typename/identifier disambiguation, the parser informing the lexer of every typedef/class/struct)
为了让“非上下文无关”yacc 解析器可以完美解析这个新语法,对 C++ 语法进行的最小修改是什么?(仅使用一个“hack”:类型名/标识符消歧,解析器通知每个类型定义/类/结构的词法分析器)
I see a few ones:
我看到几个:
Type Type;
is forbidden. An identifier declared as a typename cannot become a non-typename identifier (note thatstruct Type Type
is not ambiguous and could be still allowed).There are 3 types of
names tokens
:types
: builtin-type or because of a typedef/class/struct- template-functions
- identifiers : functions/methods and variables/objects
Considering template-functions as different tokens solves the
func<
ambiguity. Iffunc
is a template-function name, then<
must be the beginning of a template parameter list, otherwisefunc
is a function pointer and<
is the comparison operator.Type a(2);
is an object instantiation.Type a();
andType a(int)
are function prototypes.int (k);
is completely forbidden, should be writtenint k;
typedef int func_type();
andtypedef int (func_type)();
are forbidden.A function typedef must be a function pointer typedef :
typedef int (*func_ptr_type)();
template recursion is limited to 1024, otherwise an increased maximum could be passed as an option to the compiler.
int a,b,c[9],*d,(*f)(), (*g)()[9], h(char);
could be forbidden too, replaced byint a,b,c[9],*d;
int (*f)();
int (*g)()[9];
int h(char);
one line per function prototype or function pointer declaration.
An highly preferred alternative would be to change the awful function pointer syntax,
int (MyClass::*MethodPtr)(char*);
being resyntaxed as:
int (MyClass::*)(char*) MethodPtr;
this being coherent with the cast operator
(int (MyClass::*)(char*))
typedef int type, *type_ptr;
could be forbidden too : one line per typedef. Thus it would becometypedef int type;
typedef int *type_ptr;
sizeof int
,sizeof char
,sizeof long long
and co. could be declared in each source file. Thus, each source file making use of the typeint
should begin with#type int : signed_integer(4)
and
unsigned_integer(4)
would be forbidden outside of that#type
directive this would be a big step into the stupidsizeof int
ambiguity present in so many C++ headers
Type Type;
是禁止的。声明为 typename 的标识符不能成为非 typename 标识符(请注意,这struct Type Type
不是模棱两可的,仍然可以被允许)。有 3 种类型
names tokens
:types
: 内置类型或因为 typedef/class/struct- 模板函数
- 标识符:函数/方法和变量/对象
将模板函数视为不同的标记可以解决
func<
歧义。如果func
是模板函数名,则<
必须是模板参数列表的开头,否则func
是函数指针并且<
是比较运算符。Type a(2);
是一个对象实例化。Type a();
并且Type a(int)
是函数原型。int (k);
完全禁止,应该写int k;
typedef int func_type();
并且typedef int (func_type)();
被禁止。函数 typedef 必须是函数指针 typedef :
typedef int (*func_ptr_type)();
模板递归限制为 1024,否则可以将增加的最大值作为选项传递给编译器。
int a,b,c[9],*d,(*f)(), (*g)()[9], h(char);
也可以被禁止,替换为int a,b,c[9],*d;
int (*f)();
int (*g)()[9];
int h(char);
每个函数原型或函数指针声明一行。
一个非常受欢迎的替代方法是更改糟糕的函数指针语法,
int (MyClass::*MethodPtr)(char*);
被重新语法化为:
int (MyClass::*)(char*) MethodPtr;
这与强制转换运算符一致
(int (MyClass::*)(char*))
typedef int type, *type_ptr;
也可能被禁止:每个 typedef 一行。这样就会变成typedef int type;
typedef int *type_ptr;
sizeof int
,sizeof char
,sizeof long long
和 co。可以在每个源文件中声明。因此,每个使用该类型的源文件int
都应以#type int : signed_integer(4)
并且
unsigned_integer(4)
会在该#type
指令之外被禁止,这将是进入sizeof int
如此多 C++ 头文件中存在的愚蠢歧义的一大步
The compiler implementing the resyntaxed C++ would, if encountering a C++ source making use of ambiguous syntax, move source.cpp
too an ambiguous_syntax
folder, and would create automatically an unambiguous translated source.cpp
before compiling it.
实现重新语法的 C++ 的编译器,如果遇到使用歧义语法的 C++ 源代码,source.cpp
也会移动一个ambiguous_syntax
文件夹,并source.cpp
在编译之前自动创建一个明确的翻译。
Please add your ambiguous C++ syntaxes if you know some!
如果您知道一些不明确的 C++ 语法,请添加!
回答by Sam Harwell
As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).
正如您在我的回答中看到的那样,C++ 包含的语法无法由 LL 或 LR 解析器确定性地解析,因为类型解析阶段(通常是解析后)改变了操作顺序,因此改变了 AST 的基本形状(通常预期由第一阶段解析提供)。
回答by casademora
I think you are pretty close to the answer.
我认为你非常接近答案。
LR(1) means that parsing from left to right needs only one token to look-ahead for the context, whereas LR(∞) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.
LR(1) 意味着从左到右的解析只需要一个标记来向前看上下文,而 LR(∞) 意味着无限向前看。也就是说,解析器必须知道即将到来的一切,才能弄清楚它现在在哪里。
回答by casademora
The "typedef" problem in C++ can be parsed with an LALR(1) parser that builds a symbol table while parsing (not a pure LALR parser). The "template" problem probably cannot be solved with this method. The advantage of this kind of LALR(1) parser is that the grammar (shown below) is an LALR(1) grammar (no ambiguity).
C++ 中的“typedef”问题可以用 LALR(1) 解析器解析,该解析器在解析时构建符号表(不是纯 LALR 解析器)。“模板”问题可能无法用这种方法解决。这种 LALR(1) 解析器的优点是文法(如下所示)是 LALR(1) 文法(没有歧义)。
/* C Typedef Solution. */
/* Terminal Declarations. */
<identifier> => lookup(); /* Symbol table lookup. */
/* Rules. */
Goal -> [Declaration]... <eof> +> goal_
Declaration -> Type... VarList ';' +> decl_
-> typedef Type... TypeVarList ';' +> typedecl_
VarList -> Var /','...
TypeVarList -> TypeVar /','...
Var -> [Ptr]... Identifier
TypeVar -> [Ptr]... TypeIdentifier
Identifier -> <identifier> +> identifier_(1)
TypeIdentifier -> <identifier> =+> typedefidentifier_(1,{typedef})
// The above line will assign {typedef} to the <identifier>,
// because {typedef} is the second argument of the action typeidentifier_().
// This handles the context-sensitive feature of the C++ language.
Ptr -> '*' +> ptr_
Type -> char +> type_(1)
-> int +> type_(1)
-> short +> type_(1)
-> unsigned +> type_(1)
-> {typedef} +> type_(1)
/* End Of Grammar. */
The following input can be parsed without a problem:
可以毫无问题地解析以下输入:
typedef int x;
x * y;
typedef unsigned int uint, *uintptr;
uint a, b, c;
uintptr p, q, r;
The LRSTAR parser generatorreads the above grammar notation and generates a parser that handles the "typedef" problem without ambiguity in the parse tree or AST. (Disclosure: I am the guy who created LRSTAR.)
所述LRSTAR解析器生成器读取上述语法表示法,并产生一个解析器手柄而不在分析树或AST歧义“的typedef”的问题。(披露:我是创建 LRSTAR 的人。)