C语言 GCC 和 Clang 解析器真的是手写的吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6319086/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Are GCC and Clang parsers really handwritten?
提问by JCLL
It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and notmachine generated, Bison-Flex based, bottom up parsing.
似乎 GCC 和 LLVM-Clang 正在使用手写递归下降解析器,而不是机器生成的、基于 Bison-Flex 的自底向上解析。
Could someone here please confirm that this is the case? And if so, why do mainstream compiler frameworks use handwritten parsers?
这里有人可以确认是这种情况吗?如果是这样,为什么主流编译器框架使用手写解析器?
采纳答案by Matthew Slattery
Yes:
是的:
GCC used a yacc (bison) parser once upon a time, but it was replaced with a hand-written recursive descent parser at some point in the 3.x series: see http://gcc.gnu.org/wiki/New_C_Parserfor links to relevant patch submissions.
Clang also uses a hand-written recursive descent parser: see the section "A single unified parser for C, Objective C, C++ and Objective C++" near the end of http://clang.llvm.org/features.html.
GCC使用YACC(野牛)解析器曾几何时,但它与在3.x系列的一些点手写递归下降解析器代替:看http://gcc.gnu.org/wiki/New_C_Parser为相关补丁提交的链接。
Clang 还使用手写的递归下降解析器:请参阅http://clang.llvm.org/features.html末尾附近的“C、Objective C、C++ 和 Objective C++ 的单一统一解析器”部分。
回答by Ira Baxter
There's a folk-theorem that says C is hard to parse, and C++ essentially impossible.
有一个民间定理说 C 很难解析,而 C++ 基本上是不可能的。
It isn't true.
这不是真的。
What is true is that C and C++ are pretty hard to parse using LALR(1) parsers without hacking the parsing machinery and tangling in symbol table data. GCC in fact used to parse them, using YACC and additional hackery like this, and yes it was ugly. Now GCC uses handwritten parsers, but still with the symbol table hackery. The Clang folks never tried to use automated parser generators; AFAIK the Clang parser has always been hand-coded recursive descent.
事实是,使用 LALR(1) 解析器很难解析 C 和 C++,而无需破解解析机制并在符号表数据中纠缠不清。事实上,GCC 曾经使用 YACC 和像这样的其他黑客来解析它们,是的,它很丑陋。 现在 GCC 使用手写解析器,但仍然使用符号表hacky。Clang 人从未尝试过使用自动解析器生成器;AFAIK Clang 解析器一直是手工编码的递归下降。
What is true, is that C and C++ are relatively easy to parse with stronger automatically generated parsers, e.g., GLR parsers, and you don't need any hacks. The ElsaC++ parser is one example of this. Our C++ Front Endis another (as are all our "compiler" front ends, GLR is pretty wonderful parsing technology).
事实是,使用更强大的自动生成解析器(例如GLR 解析器)解析 C 和 C++ 相对容易,而且您不需要任何技巧。所述埃尔莎C ++解析器是这样的一个例子。我们的C++ 前端是另一个(就像我们所有的“编译器”前端一样,GLR 是非常棒的解析技术)。
Our C++ front end isn't as fast as GCC's, and certainly slower than Elsa; we've put little energy into tuning it carefully because we have other more pressing issues (nontheless it has been used on millions of lines of C++ code). Elsa is likely slower than GCC simply because it is more general. Given processor speeds these days, these differences might not matter a lot in practice.
我们的 C++ 前端不如 GCC 快,当然也比 Elsa 慢;我们几乎没有投入精力仔细调整它,因为我们还有其他更紧迫的问题(尽管它已被用于数百万行 C++ 代码)。Elsa 可能比 GCC 慢,因为它更通用。鉴于如今的处理器速度,这些差异在实践中可能无关紧要。
But the "real compilers" that are widely distributed today have their roots in compilers of 10 or 20 years ago or more. Inefficiencies then mattered much more, and nobody had heard of GLR parsers, so people did what they knew how to do. Clang is certainly more recent, but then folk theorems retain their "persuasiveness" for a long time.
但是今天广泛分布的“真正的编译器”都源于 10 或 20 年前或更早的编译器。效率低下变得更加重要,没有人听说过 GLR 解析器,所以人们做他们知道如何做的事情。Clang 肯定是最近的,但是民间定理在很长一段时间内都保持着它们的“说服力”。
You don't have to do it that way anymore. You can very reasonably use GLR and other such parsers as front ends, with an improvement in compiler maintainability.
你不必再这样做了。您可以非常合理地使用 GLR 和其他此类解析器作为前端,从而提高编译器的可维护性。
What istrue, is that getting a grammar that matches your friendly neighborhood compiler's behavior is hard. While virtually all C++ compilers implement (most) of the original standard, they also tend have lots of dark corner extensions, e.g., DLL specifications in MS compilers, etc. If you have a strong parsing engine, you can spend your time trying to get the final grammar to match reality, rather than trying to bend your grammar to match the limitations of your parser generator.
什么是真实的,是让符合您的睦邻友好编译器的行为语法是很难的。虽然几乎所有 C++ 编译器都实现了(大部分)原始标准,但它们也往往有很多暗角扩展,例如 MS 编译器中的 DLL 规范等。如果您有强大的解析引擎,您可以花时间尝试获得匹配现实的最终语法,而不是试图改变语法以匹配解析器生成器的限制。
EDIT November 2012: Since writing this answer, we've improved our C++ front end to handle full C++11, including ANSI, GNU, and MS variant dialects. While there was lots of extra stuff, we don't have to change our parsing engine; we just revised the grammar rules. We didhave to change the semantic analysis; C++11 is semantically very complicated, and this work swamps the effort to get the parser to run.
2012 年 11 月编辑:自撰写此答案以来,我们改进了 C++ 前端以处理完整的 C++11,包括 ANSI、GNU 和 MS 变体方言。虽然有很多额外的东西,但我们不必改变我们的解析引擎;我们刚刚修改了语法规则。我们确实必须改变语义分析;C++11 在语义上非常复杂,这项工作淹没了让解析器运行的努力。
EDIT February 2015: ... now handles full C++14. (See get human readable AST from c++ codefor GLR parses of a simple bit of code, and C++'s infamous "most vexing parse").
编辑 2015 年 2 月:...现在处理完整的 C++14。(参见get human readable AST from c++ codefor GLR parses of a simple bit of code,以及 C++ 臭名昭著的“最令人烦恼的解析”)。
EDIT April 2017: Now handles (draft) C++17.
编辑 2017 年 4 月:现在处理(草案)C++17。
回答by Doug
Clang's parser is a hand-written recursive-descent parser, as are several other open-source and commercial C and C++ front ends.
Clang 的解析器是一个手写的递归下降解析器,其他几个开源和商业 C 和 C++ 前端也是如此。
Clang uses a recursive-descent parser for several reasons:
Clang 使用递归下降解析器有以下几个原因:
- Performance: a hand-written parser allows us to write a fast parser, optimizing the hot paths as needed, and we're always in control of that performance. Having a fast parser has allowed Clang to be used in other development tools where "real" parsers are typically not used, e.g., syntax highlighting and code completion in an IDE.
- Diagnostics and error recovery: because you're in full control with a hand-written recursive-descent parser, it's easy to add special cases that detect common problems and provide great diagnostics and error recovery (e.g., see http://clang.llvm.org/features.html#expressivediags) With automatically generated parsers, you're limited to the capabilities of the generator.
- Simplicity: recursive-descent parsers are easy to write, understand, and debug. You don't need to be a parsing expert or learn a new tool to extend/improve the parser (which is especially important for an open-source project), yet you can still get great results.
- 性能:手写解析器允许我们编写快速解析器,根据需要优化热路径,并且我们始终可以控制该性能。拥有快速解析器允许 Clang 用于其他通常不使用“真正”解析器的开发工具,例如,IDE 中的语法突出显示和代码完成。
- 诊断和错误恢复:因为您可以完全控制手写的递归下降解析器,所以很容易添加特殊情况来检测常见问题并提供出色的诊断和错误恢复(例如,请参阅http://clang.llvm .org/features.html#expressivediags) 使用自动生成的解析器,您只能使用生成器的功能。
- 简单性:递归下降解析器易于编写、理解和调试。您不需要成为解析专家或学习新工具来扩展/改进解析器(这对于开源项目尤其重要),但您仍然可以获得很好的结果。
Overall, for a C++ compiler, it just doesn't matter much: the parsing part of C++ is non-trivial, but it's still one of the easier parts, so it pays to keep it simple. Semantic analysis---particularly name lookup, initialization, overload resolution, and template instantiation---is orders of magnitude more complicated than parsing. If you want proof, go check out the distribution of code and commits in Clang's "Sema" component (for semantic analysis) vs. its "Parse" component (for parsing).
总体而言,对于 C++ 编译器来说,这并不重要:C++ 的解析部分并不重要,但它仍然是更容易的部分之一,因此保持简单是值得的。语义分析——尤其是名称查找、初始化、重载解析和模板实例化——比解析复杂几个数量级。如果您需要证据,请查看 Clang 的“Sema”组件(用于语义分析)与它的“Parse”组件(用于解析)中的代码分布和提交。
回答by Rafe Kettler
gcc's parser is handwritten.. I suspect the same for clang. This is probably for a few reasons:
gcc 的解析器是手写的。. 我怀疑 clang 也是如此。这大概有以下几个原因:
- Performance: something that you've hand-optimized for your particular task will almost always perform better than a general solution. Abstraction usually has a performance hit
- Timing: at least in the case of GCC, GCC predates a lot of free developer tools (came out in 1987). There was no free version of yacc, etc. at the time, which I'd imagine would've been a priority to the people at the FSF.
- 性能:您为特定任务手动优化的东西几乎总是比一般解决方案表现得更好。抽象通常会影响性能
- 时间:至少在 GCC 的情况下,GCC 早于许多免费的开发人员工具(1987 年问世)。当时还没有 yacc 等的免费版本,我想这将是 FSF 人员的优先考虑事项。
This is probably not a case of "not invented here" syndrome, but more along the lines of "there was nothing optimized specifically for what we needed, so we wrote our own".
这可能不是“这里没有发明”综合症的情况,而是“没有任何专门针对我们需要的优化,所以我们自己编写”的情况。
回答by reuns
Weird answers there!
那里有奇怪的答案!
C/C++ grammars aren't context free. They are context sensitive because of the Foo * bar; ambiguity. We have to build a list of typedefs to know if Foo is a type or not.
C/C++ 语法不是上下文无关的。由于 Foo * bar,它们是上下文敏感的;歧义。我们必须建立一个 typedef 列表来知道 Foo 是否是一个类型。
Ira Baxter: I don't see the point with your GLR thing. Why build a parse tree which comprises ambiguities. Parsing means solving ambiguities, building the syntax tree. You resolve these ambiguities in a second pass, so this isn't less ugly. For me it is far more ugly ...
艾拉·巴克斯特:我不认为你的 GLR 事情有什么意义。为什么要构建包含歧义的解析树。解析意味着解决歧义,构建语法树。您在第二遍中解决了这些歧义,所以这并不难看。对我来说,它丑得多......
Yacc is a LR(1) parser generator (or LALR(1)), but it can be easily modified to be context sensitive. And there is nothing ugly in it. Yacc/Bison has been created to help in parsing C language, so probably it isn't the ugliest tool to generate a C parser ...
Yacc 是一个 LR(1) 解析器生成器(或 LALR(1)),但它可以很容易地修改为上下文敏感。它没有什么丑陋的东西。Yacc/Bison 是为了帮助解析 C 语言而创建的,所以它可能不是生成 C 解析器的最丑陋的工具......
Until GCC 3.x the C parser is generated by yacc/bison, with typedefs table built during parsing. With "in parse" typedefs table building, C grammar becomes locally context free and furthermore "locally LR(1)".
在 GCC 3.x 之前,C 解析器由 yacc/bison 生成,并在解析过程中构建了 typedefs 表。通过“in parse”typedefs 表构建,C 语法变得与本地上下文无关,而且“本地 LR(1)”。
Now, in Gcc 4.x, it is a recursive descent parser. It is exactly the same parser as in Gcc 3.x, it is still LR(1), and has the same grammar rules. The difference is that the yacc parser has been hand rewritten, the shift/reduce are now hidden in the call stack, and there is no "state454 : if (nextsym == '(') goto state398" as in gcc 3.x yacc's parser, so it is easier to patch, handle errors and print nicer messages, and to perform some of the next compiling steps during parsing. At the price of much less "easy to read" code for a gcc noob.
现在,在 Gcc 4.x 中,它是一个递归下降解析器。它与 Gcc 3.x 中的解析器完全相同,仍然是 LR(1),并且具有相同的语法规则。不同之处在于 yacc 解析器已被手动重写,shift/reduce 现在隐藏在调用堆栈中,并且没有 gcc 3.x yacc 中的“state454 : if (nextsym == '(') goto state398”)解析器,因此更容易修补、处理错误和打印更好的消息,并在解析过程中执行一些下一个编译步骤。代价是 gcc 菜鸟的“易于阅读”代码少得多。
Why did they switched from yacc to recursive descent? Because it is quite necessary to avoid yacc to parse C++, and because GCC dreams to be multi language compiler, i.e. sharing maximum of code between the different languages it can compile. This is why the C++ and the C parser are written in the same way.
为什么他们从 yacc 切换到递归下降?因为避免yacc解析C++是很有必要的,而且因为GCC梦想成为多语言编译器,即在它可以编译的不同语言之间共享最多的代码。这就是 C++ 和 C 解析器以相同方式编写的原因。
C++ is harder to parse than C because it isn't "locally" LR(1) as C, it is not even LR(k).
Look at func<4 > 2>which is a template function instantiated with 4 > 2, i.e. func<4 > 2>has to be read as func<1>. This is definitely not LR(1). Now consider, func<4 > 2 > 1 > 3 > 3 > 8 > 9 > 8 > 7 > 8>. This is where a recursive descent can easily solve ambiguity, at the price of a few more function calls (parse_template_parameter is the ambiguous parser function. If parse_template_parameter(17tokens) failed, try again parse_template_parameter(15tokens), parse_template_parameter(13tokens)
... until it works).
C++ 比 C 更难解析,因为它不是“本地”LR(1) 作为 C,它甚至不是 LR(k)。看func<4 > 2>哪个是用4>2实例化的模板函数,即func<4 > 2>必须读作func<1>。这绝对不是LR(1)。现在考虑,func<4 > 2 > 1 > 3 > 3 > 8 > 9 > 8 > 7 > 8>。这是递归下降可以轻松解决歧义的地方,代价是多调用几个函数(parse_template_parameter 是歧义解析器函数。如果 parse_template_parameter(17tokens) 失败,请重试 parse_template_parameter(15tokens), parse_template_parameter(13tokens) ... 直到有用)。
I don't know why it wouldn't be possible to add into yacc/bison recursive sub grammars, maybe this will be the next step in gcc/GNU parser development?
我不知道为什么不能添加到 yacc/bison 递归子语法中,也许这将是 gcc/GNU 解析器开发的下一步?
回答by Vanessa McHale
It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.
似乎 GCC 和 LLVM-Clang 正在使用手写递归下降解析器,而不是机器生成的、基于 Bison-Flex 的自底向上解析。
Bison in particular I don't think can handle the grammar without parsing some things ambiguously and doing a second pass later.
特别是 Bison,我不认为可以在不模糊地解析某些内容并稍后再做第二遍的情况下处理语法。
I know Haskell's Happy allows for monadic (i.e. state-dependent) parsers that can resolve the particular issue with C syntax, but I know of no C parser generators that allow a user-supplied state monad.
我知道 Haskell 的 Happy 允许使用 monadic(即状态相关)解析器来解决 C 语法的特定问题,但我知道没有允许用户提供状态 monad 的 C 解析器生成器。
In theory, error recovery would be a point in favor of a handwritten parser, but my experience with GCC/Clang has been that the error messages are not particularly good.
理论上,错误恢复将是一个支持手写解析器的点,但我对 GCC/Clang 的经验是错误消息不是特别好。
As for performance - some of the claims seem unsubstantiated. Generating a big state machine using a parser generator should result in something that's O(n)and I doubt parsing is the bottleneck in much tooling.
至于性能——有些说法似乎没有根据。使用解析器生成器生成一个大状态机应该会导致一些东西,O(n)我怀疑解析是许多工具的瓶颈。

