C++ 是上下文无关的还是上下文敏感的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14589346/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 18:29:00  来源:igfitidea点击:

Is C++ context-free or context-sensitive?

c++syntaxgrammarcontext-free-grammarcontext-sensitive-grammar

提问by fredoverflow

I often hear claims that C++ is a context-sensitive language. Take the following example:

我经常听到声称 C++ 是一种上下文敏感的语言。以下面的例子为例:

a b(c);

Is this a variable definition or a function declaration? That depends on the meaning of the symbol c. If cis a variable, then a b(c);defines a variable named bof type a. It is directly initialized with c. But if cis a type, then a b(c);declares a function named bthat takes a cand returns an a.

这是变量定义还是函数声明?这取决于符号的含义c。如果c变量,则a b(c);定义一个名为btype的变量a。它直接用 初始化c。但是如果c是一个类型,则a b(c);声明一个名为的函数b,它接受 ac并返回一个a

If you look up the definition of context-free languages, it will basically tell you that all grammar rules must have left-hand sides that consist of exactly one non-terminal symbol. Context-sensitive grammars, on the other hand, allow arbitrary strings of terminal and non-terminal symbols on the left-hand side.

如果您查看上下文无关语言的定义,它基本上会告诉您所有语法规则的左侧都必须由一个非终结符组成。另一方面,上下文相关文法允许在左侧使用任意的终结符和非终结符字符串。

Browsing through Appendix A of "The C++ Programming Language", I couldn't find a single grammar rule that had anything else besides a single non-terminal symbol on its left-hand side. That would imply that C++ is context-free. (Of course, every context-free language is also context-sensitive in the sense that the context-free languages form a subset of the context-sensitive languages, but that is not the point.)

浏览“C++ 编程语言”的附录 A,我找不到一条语法规则除了左侧的单个非终结符之外还有其他任何内容。这意味着 C++ 是上下文无关的。(当然,每个上下文无关语言也是上下文敏感的,因为上下文无关语言构成上下文敏感语言的子集,但这不是重点。)

So, is C++ context-free or context-sensitive?

那么,C++ 是上下文无关的还是上下文敏感的?

采纳答案by rici

Below is my (current) favorite demonstration of why parsing C++ is (probably) Turing-complete, since it shows a program which is syntactically correct if and only if a given integer is prime.

下面是我(当前)最喜欢的为什么解析 C++ 是(可能)图灵完备的演示,因为它显示了一个程序,当且仅当给定整数是素数时,该程序在语法上是正确的。

So I assert that C++ is neither context-free nor context-sensitive.

所以我断言C++ 既不是 context-free 也不是 context-sensitive

If you allow arbitrary symbol sequences on both sides of any production, you produce an Type-0 grammar ("unrestricted") in the Chomsky hierarchy, which is more powerful than a context-sensitive grammar; unrestricted grammars are Turing-complete. A context-sensitive (Type-1) grammar allows multiple symbols of context on the left hand side of a production, but the same context must appear on the right hand side of the production (hence the name "context-sensitive"). [1] Context-sensitive grammars are equivalent to linear-bounded Turing machines.

如果在任何产生式的两侧都允许任意符号序列,则在Chomsky 层次结构中产生了一个 Type-0 文法(“无限制”),它比上下文敏感文法更强大;无限制语法是图灵完备的。上下文敏感(Type-1)语法允许在产生式的左侧出现多个上下文符号,但相同的上下文必须出现在产生式的右侧(因此名称为“上下文敏感”)。[1] 上下文相关文法等价于线性有界图灵机

In the example program, the prime computation could be performed by a linear-bounded Turing machine, so it does not quite prove Turing equivalence, but the important part is that the parser needs to perform the computation in order to perform syntactic analysis. It could have been any computation expressible as a template instantiation and there is every reason to believe that C++ template instantiation is Turing-complete. See, for example, Todd L. Veldhuizen's 2003 paper.

在示例程序中,素数计算可以由线性有界图灵机进行,因此并不能完全证明图灵等价,但重要的是解析器需要进行计算才能进行句法分析。它可以是任何可表示为模板实例化的计算,并且有充分的理由相信 C++ 模板实例化是图灵完备的。例如,参见Todd L. Veldhuizen 2003 年的论文

Regardless, C++ can be parsed by a computer, so it could certainly be parsed by a Turing machine. Consequently, an unrestricted grammar could recognize it. Actually writing such a grammar would be impractical, which is why the standard doesn't try to do so. (See below.)

无论如何,C++可以被计算机解析,所以它当然可以被图灵机解析。因此,不受限制的语法可以识别它。实际上编写这样的语法是不切实际的,这就是标准不尝试这样做的原因。(见下文。)

The issue with "ambiguity" of certain expressions is mostly a red herring. To start with, ambiguity is a feature of a particular grammar, not a language. Even if a language can be proven to have no unambiguous grammars, if it can be recognized by a context-free grammar, it's context-free. Similarly, if it cannot be recognized by a context-free grammar but it can be recognized by a context-sensitive grammar, it's context-sensitive. Ambiguity is not relevant.

某些表达的“歧义”问题主要是一个红鲱鱼。首先,歧义是特定语法的特征,而不是语言。即使一种语言可以被证明没有明确的语法,如果它可以被上下文无关的语法识别,它就是上下文无关的。同样,如果它不能被上下文无关文法识别但可以被上下文敏感文法识别,则它是上下文敏感的。歧义是不相关的。

But in any event, like line 21 (i.e. auto b = foo<IsPrime<234799>>::typen<1>();) in the program below, the expressions are not ambiguous at all; they are simply parsed differently depending on context. In the simplest expression of the issue, the syntactic category of certain identifiers is dependent on how they have been declared (types and functions, for example), which means that the formal language would have to recognize the fact that two arbitrary-length strings in the same program are identical (declaration and use). This can be modelled by the "copy" grammar, which is the grammar which recognizes two consecutive exact copies of the same word. It's easy to prove with the pumping lemmathat this language is not context-free. A context-sensitive grammar for this language is possible, and a Type-0 grammar is provided in the answer to this question: https://math.stackexchange.com/questions/163830/context-sensitive-grammar-for-the-copy-language.

但无论如何,就像auto b = foo<IsPrime<234799>>::typen<1>();下面程序中的第 21 行(即),表达式完全没有歧义;它们只是根据上下文进行不同的解析。在这个问题的最简单表达中,某些标识符的句法类别取决于它们的声明方式(例如类型和函数),这意味着形式语言必须认识到以下事实:两个任意长度的字符串在相同的程序是相同的(声明和使用)。这可以通过“复制”语法建模,该语法识别同一个单词的两个连续的精确副本。用抽水引理很容易证明这种语言不是上下文无关的。该语言的上下文敏感语法是可能的,并且在此问题的答案中提供了 Type-0 语法:https: //math.stackexchange.com/questions/163830/context-sensitive-grammar-for-the-复制语言

If one were to attempt to write a context-sensitive (or unrestricted) grammar to parse C++, it would quite possibly fill the universe with scribblings. Writing a Turing machine to parse C++ would be an equally impossible undertaking. Even writing a C++ program is difficult, and as far as I know none have been proven correct. This is why the standard does not attempt to provide a complete formal grammar, and why it chooses to write some of the parsing rules in technical English.

如果有人试图编写一种上下文敏感(或不受限制)的语法来解析 C++,它很可能会用涂鸦填满整个宇宙。编写一个图灵机来解析 C++ 将是一项同样不可能的任务。即使编写 C++ 程序也很困难,据我所知,没有一个被证明是正确的。这就是为什么该标准不尝试提供完整的形式语法,以及为什么它选择用技术英语编写一些解析规则的原因。

What looks like a formal grammar in the C++ standard is not the complete formal definition of the syntax of the C++ language. It's not even the complete formal definition of the language after preprocessing, which might be easier to formalize. (That wouldn't be the language, though: the C++ language as defined by the standard includes the preprocessor, and the operation of the preprocessor is described algorithmically since it would be extremely hard to describe in any grammatical formalism. It is in that section of the standard where lexical decomposition is described, including the rules where it must be applied more than once.)

在 C++ 标准中看起来像正式语法的东西并不是 C++ 语言语法的完整正式定义。它甚至不是预处理后语言的完整正式定义,这可能更容易形式化。(不过,这不是语言:标准定义的 C++ 语言包括预处理器,并且预处理器的操作是通过算法描述的,因为在任何语法形式主义中都很难描述。它在该部分描述词法分解的标准,包括必须多次应用的规则。)

The various grammars (two overlapping grammars for lexical analysis, one which takes place before preprocessing and the other, if necessary, afterwards, plus the "syntactic" grammar) are collected in Appendix A, with this important note (emphasis added):

附录 A 中收集了各种语法(用于词法分析的两种重叠语法,一种发生在预处理之前,另一种在必要时在处理之后加上“句法”语法),并附有以下重要说明(强调):

This summary of C++ syntax is intended to be an aid to comprehension. It is not an exact statement of the language. In particular, the grammar described here accepts a superset of valid C++ constructs. Disambiguation rules (6.8, 7.1, 10.2) must be applied to distinguish expressions from declarations. Further, access control, ambiguity, and type rules must be used to weed out syntactically valid but meaningless constructs.

此 C++ 语法摘要旨在帮助理解。它不是语言的精确陈述。特别是,这里描述的语法接受有效 C++ 结构超集。必须应用消歧规则(6.8、7.1、10.2)来区分表达式和声明。此外,必须使用访问控制、歧义和类型规则来清除语法上有效但无意义的结构。

Finally, here's the promised program. Line 21 is syntactically correct if and only if the N in IsPrime<N>is prime. Otherwise, typenis an integer, not a template, so typen<1>()is parsed as (typen<1)>()which is syntactically incorrect because ()is not a syntactically valid expression.

最后,这是承诺的程序。当且仅当 NIsPrime<N>是素数时,第 21 行在语法上是正确的。否则,typen是整数,而不是模板,因此typen<1>()被解析为(typen<1)>()语法错误,因为()它不是语法有效的表达式。

template<bool V> struct answer { answer(int) {} bool operator()(){return V;}};

template<bool no, bool yes, int f, int p> struct IsPrimeHelper
  : IsPrimeHelper<p % f == 0, f * f >= p, f + 2, p> {};
template<bool yes, int f, int p> struct IsPrimeHelper<true, yes, f, p> { using type = answer<false>; };
template<int f, int p> struct IsPrimeHelper<false, true, f, p> { using type = answer<true>; };

template<int I> using IsPrime = typename IsPrimeHelper<!(I&1), false, 3, I>::type;
template<int I>
struct X { static const int i = I; int a[i]; }; 

template<typename A> struct foo;
template<>struct foo<answer<true>>{
  template<int I> using typen = X<I>;
};
template<> struct foo<answer<false>>{
  static const int typen = 0;
};

int main() {
  auto b = foo<IsPrime<234799>>::typen<1>(); // Syntax error if not prime
  return 0;
}


[1] To put it more technically, every production in a context-sensitive grammar must be of the form:

[1] 更专业地说,上下文相关文法中的每个产生式都必须采用以下形式:

αAβ → αγβ

αAβ → αγβ

where Ais a non-terminal and α, βare possibly empty sequences of grammar symbols, and γis a non-empty sequence. (Grammar symbols may be either terminals or non-terminals).

其中A是非终结符and αβ可能是文法符号的空序列,γ是非空序列。(语法符号可以是终结符也可以是非终结符)。

This can be read as A → γonly in the context [α, β]. In a context-free (Type 2) grammar, αand βmust be empty.

A → γ只能在上下文中阅读[α, β]。在上下文无关(类型 2)语法中,α并且β必须为空。

It turns out that you can also restrict grammars with the "monotonic" restriction, where every production must be of the form:

事实证明,您还可以使用“单调”限制来限制语法,其中每个产生式都必须采用以下形式:

α → βwhere |α| ≥ |β| > 0  (|α|means "the length of α")

α → β其中|α| ≥ |β| > 0  (|α|表示“”的长度α)

It's possible to prove that the set of languages recognized by monotonic grammars is exactly the same as the set of languages recognized by context-sensitive grammars, and it's often the case that it's easier to base proofs on monotonic grammars. Consequently, it's pretty common to see "context-sensitive" used as though it meant "monotonic".

可以证明单调文法识别的语言集与上下文相关文法识别的语言集完全相同,并且通常情况下,更容易基于单调文法进行证明。因此,经常会看到“上下文敏感”被用作“单调”的意思。

回答by jpalecek

First, you rightly observed there are no context sensitive rules in the grammar at the end of the C++ standard, so that grammar iscontext-free.

首先,您正确地观察到 C++ 标准末尾的语法中没有上下文敏感规则,因此该语法上下文无关的。

However, that grammar doesn't precisely describe the C++ language, because it produces non-C++ programs such as

但是,该语法并不能准确地描述 C++ 语言,因为它生成了非 C++ 程序,例如

int m() { m++; }

or

或者

typedef static int int;

The C++ language defined as "the set of well-formed C++ programs" is not context-free (it's possible to show that merely demanding variables to be declared makes it so). Given you can theoretically write Turing-complete programs in templates and make a program ill-formed based on their result, it's not even context-sensitive.

定义为“一组结构良好的 C++ 程序”的 C++ 语言不是上下文无关的(可以证明仅要求声明变量就可以实现)。鉴于理论上您可以在模板中编写图灵完备的程序,并根据它们的结果使程序格式错误,它甚至不是上下文敏感的。

Now, (ignorant) people (usually not language theorists, but parser designers) typically use "not context-free" in some of the following meanings

现在,(无知的)人们(通常不是语言理论家,而是解析器设计者)通常在以下某些含义中使用“非上下文无关”

  • ambiguous
  • can't be parsed with Bison
  • not LL(k), LR(k), LALR(k) or whatever parser-defined language class they chose
  • 模糊的
  • 不能用 Bison 解析
  • 不是 LL(k)、LR(k)、LALR(k) 或他们选择的任何解析器定义的语言类

The grammar at the back of the standard doesn't satisfy these categories (i.e. it is ambiguous, not LL(k)...) so C++ grammar is "not context-free" for them. And in a sense, they're right it's damn well hard to produce a working C++ parser.

标准后面的语法不满足这些类别(即它是模棱两可的,而不是 LL(k)...)所以 C++ 语法对它们来说“不是上下文无关的”。从某种意义上说,他们是对的,产生一个有效的 C++ 解析器非常困难。

Note that the properties here used are only weakly connected to context-free languages - ambiguity doesn't have anything to do with context-sensitivity (in fact, context-sensitive rules typically help disambiguate productions), the other two are merely subsets of context-free languages. And parsing context-free languages is not a linear process (although parsing deterministic ones is).

请注意,这里使用的属性仅与上下文无关语言弱相关 - 歧义与上下文敏感无关(实际上,上下文敏感规则通常有助于消除产生式的歧义),另外两个只是上下文的子集- 自由语言。解析上下文无关语言不是一个线性过程(尽管解析确定性语言是)。

回答by Sam Harwell

Yes. The following expression has a different order of operationsdepending on type resolved context:

是的。根据类型解析的上下文,以下表达式具有不同的操作顺序

Edit: When the actual order of operation varies, it makes it incredibly difficult to use a "regular" compiler that parses to an undecorated AST before decorating it (propagating type information). Other context sensitive things mentioned are "rather easy" compared to this (not that template evaluation is at all easy).

编辑:当实际的操作顺序发生变化时,使用“常规”编译器在装饰它之前解析为未装饰的 AST(传播类型信息)变得非常困难。与此相比,提到的其他上下文敏感的东西“相当容易”(模板评估根本不简单)。

#if FIRST_MEANING
   template<bool B>
   class foo
   { };
#else
   static const int foo = 0;
   static const int bar = 15;
#endif

Followed by:

其次是:

static int foobar( foo < 2 ? 1 < 1 : 0 > & bar );

回答by Dan

To answer your question, you need to distinguish two different questions.

要回答您的问题,您需要区分两个不同的问题。

  1. The mere syntax of almost every programming language is context-free. Typically, it is given as an extended Backus-Naur form or context-free gramar.

  2. However, even if a program conforms with the context-free gramar defined by the programming language, it is not necessarily a validprogram. There are many non-context-free poperties that a program has to satisfy in order to be a valid program. E.g., the most simple such property is the scope of variables.

  1. 几乎每种编程语言的语法都是上下文无关的。通常,它以扩展的 Backus-Naur 形式或上下文无关文法形式给出。

  2. 但是,即使程序符合编程语言定义的上下文无关语法,也不一定是有效的程序。程序必须满足许多非上下文无关的属性才能成为有效程序。例如,最简单的此类属性是变量的作用域。

To conclude, whether or not C++ is context-free depends on the question you ask.

总而言之,C++ 是否是上下文无关的取决于您提出的问题。

回答by Dan

You might want to take a look at The Design & Evolution of C++, by Bjarne Stroustrup. In it he describes his problems trying to use yacc (or similar) to parse an early version of C++, and wishing he had used recursive descent instead.

您可能想看看Bjarne Stroustrup 的The Design & Evolution of C++。在其中,他描述了他在尝试使用 yacc(或类似的)来解析早期版本的 C++ 时遇到的问题,并希望他使用递归下降来代替。

回答by Calmarius

Yeah C++ is context sensitive, very context sensitive. You cannot build the syntax tree by simply parsing through the file using a context free parser because in some cases you need to know the symbol from previous knowledge to decide (ie. build a symbol table while parsing).

是的,C++ 是上下文敏感的,对上下文非常敏感。您不能通过使用上下文无关解析器简单地解析文件来构建语法树,因为在某些情况下,您需要从先前的知识中知道符号来决定(即在解析时构建符号表)。

First example:

第一个例子:

A*B;

Is this a multiplication expression?

这是乘法表达式吗?

OR

或者

Is this a declaration of Bvariable to be a pointer of type A?

这是B变量声明为类型的指针A吗?

If A is a variable, then it's an expression, if A is type, it's a pointer declaration.

如果 A 是变量,那么它就是一个表达式,如果 A 是类型,那么它就是一个指针声明。

Second example:

第二个例子:

A B(bar);

Is this a function prototype taking an argument of bartype?

这是一个带有bar类型参数的函数原型吗?

OR

或者

Is this declare variable Bof type Aand calls A's constructor with barconstant as an initializer?

这是声明B类型的变量A并使用bar常量作为初始值设定项调用 A 的构造函数吗?

You need to know again whether baris a variable or a type from symbol table.

您需要再次知道bar是符号表中的变量还是类型。

Third example:

第三个例子:

class Foo
{
public:
    void fn(){x*y;}
    int x, y;
};

This is the case when building symbol table while parsing does not help because the declaration of x and y comes after the function definition. So you need to scan through the class definition first, and look at the method definitions in a second pass, to tell x*y is an expression, and not a pointer declaration or whatever.

这是在解析时构建符号表无济于事的情况,因为 x 和 y 的声明在函数定义之后。所以你需要先浏览类定义,然后在第二遍中查看方法定义,告诉 x*y 是一个表达式,而不是一个指针声明或其他什么。

回答by AraK

C++ is parsed with GLR parser. That means during parsing the source code, the parser mayencounter ambiguity but it should continue and decide which grammar rule to use later.

C++ 使用 GLR 解析器进行解析。这意味着在解析源代码的过程中,解析器可能会遇到歧义,但它应该继续并决定稍后使用哪个语法规则。

look also,

也看看,

Why C++ cannot be parsed with a LR(1) parser?

为什么 C++ 不能用 LR(1) 解析器解析?



Remember that context-free grammar can notdescribe ALLthe rules of a programming language syntax. For example, Attribute grammar is used to check the validity of an expression type.

请记住,上下文无关语法不能描述编程语言语法的所有规则。例如,属性语法用于检查表达式类型的有效性。

int x;
x = 9 + 1.0;

You can notdescribe the following rule with context-free grammar : The Right Side of the assignment should be of the same type of the Left Hand side.

不能用上下文无关语法描述以下规则: 赋值的右侧应该与左侧的类型相同。

回答by Omri Barel

I have a feeling that there's some confusion between the formal definition of "context-sensitive" and the informal use of "context-sensitive". The former has a well-defined meaning. The latter is used for saying "you need context in order to parse the input".

我有一种感觉,“上下文敏感”的正式定义和“上下文敏感”的非正式使用之间存在一些混淆。前者有明确的含义。后者用于表示“您需要上下文才能解析输入”。

This is also asked here: Context-sensitivity vs Ambiguity.

这也在这里问: Context-sensitive vs Ambiguity

Here's a context-free grammar:

这是一个上下文无关的语法:

<a> ::= <b> | <c>
<b> ::= "x"
<c> ::= "x"

It's ambiguous, so in order to parse the input "x" you need some context (or live with the ambiguity, or emit "Warning: E8271 - Input is ambiguous in line 115"). But it's certainly not a context-sensitive grammar.

它是模棱两可的,所以为了解析输入“x”,你需要一些上下文(或者忍受歧义,或者发出“警告:E8271 - 输入在第 115 行不明确”)。但它肯定不是上下文敏感的语法。

回答by James Jones

No Algol-like language is context-free, because they have rules that constrain expressions and statements that identifiers can appear in based on their type, and because there's no limit on the number of statements that can occur between declaration and use.

没有任何类似 Algol 的语言是上下文无关的,因为它们具有根据类型限制标识符可以出现在其中的表达式和语句的规则,并且因为在声明和使用之间可以出现的语句数量没有限制。

The usual solution is to write a context-free parser that actually accepts a superset of valid programs and put the context-sensitive portions in ad hoc"semantic" code attached to rules.

通常的解决方案是编写一个上下文无关的解析器,它实际上接受有效程序的超集,并将上下文敏感的部分放在附加到规则的临时“语义”代码中。

C++ goes well beyond this, thanks to its Turing-complete template system. See Stack Overflow Question 794015.

由于其图灵完备的模板系统,C++ 远不止于此。请参阅堆栈溢出问题 794015

回答by Jerry Coffin

The productions in the C++ standard are written context-free, but as we all know don't really define the language precisely. Some of what most people see as ambiguity in the current language could (I believe) be resolved unambiguously with a context sensitive grammar.

C++ 标准中的产生式编写为上下文无关的,但我们都知道并没有真正准确地定义语言。大多数人认为当前语言中的一些歧义可以(我相信)用上下文敏感的语法明确解决。

For the most obvious example, let's consider the Most Vexing Parse: int f(X);. If Xis a value, then this defines fas a variable that will be initialized with X. If Xis a type, it defines fas a function taking a single parameter of type X.

对于最明显的例子,让我们考虑最烦人的解析:int f(X);。如果X是一个值,那么它定义f为一个将用 初始化的变量X。IfX是一个类型,它定义f为一个函数,它带有一个 type 的参数X

Looking at that from a grammatical viewpoint, we could view it like this:

从语法的角度来看,我们可以这样看:

A variable_decl ::= <type> <identifier> '(' initializer ')' ';'

B function_decl ::= <type> <identifier> '(' param_decl ')' ';'

A ::= [declaration of X as value]
B ::= [declaration of X as type]

Of course, to be entirely correct we'd need to add some extra "stuff" to account for the possibility of intervening declarations of other types (i.e., A and B should both really be "declarations including declaration of X as...", or something on that order).

当然,为了完全正确,我们需要添加一些额外的“东西”来解释干预其他类型声明的可能性(即,A 和 B 都应该是“声明,包括将 X 声明为……” ,或该订单上的其他东西)。

This is still rather different from a typical CSG though (or at least what I recall of them). This depends on a symbol table being constructed -- the part that specifically recognizes Xas a type or value, not just some type of statement preceding this, but the correct type of statement for the right symbol/identifier.

尽管如此(或者至少我记得它们),这仍然与典型的 CSG 有很大不同。这取决于正在构建的符号表——专门识别X类型或值的部分,不仅仅是在此之前的某种类型的语句,而是正确符号/标识符的正确语句类型。

As such, I'd have to do some looking to be sure, but my immediate guess is that this doesn't really qualify as a CSG, at least as the term is normally used.

因此,我必须做一些事情来确定,但我的直接猜测是,这并不真正符合 CSG 的要求,至少在通常使用的术语中是这样。