C# 的穷人“词法分析器”

Question

提问by Paul Hollingsworth

I'm trying to write a very simple parser in C#.

我正在尝试用 C# 编写一个非常简单的解析器。

I need a lexer -- something that lets me associate regular expressions with tokens, so it reads in regexs and gives me back symbols.

我需要一个词法分析器——它可以让我将正则表达式与标记相关联，因此它读取正则表达式并返回符号。

It seems like I ought to be able to use Regex to do the actual heavy lifting, but I can't see an easy way to do it. For one thing, Regex only seems to work on strings, not streams (why is that!?!?).

看起来我应该能够使用 Regex 来完成实际的繁重工作，但我看不到一种简单的方法来做到这一点。一方面，Regex 似乎只适用于字符串，而不适用于流（为什么！？！？）。

Basically, I want an implementation of the following interface:

基本上，我想要以下接口的实现：

interface ILexer : IDisposable
{
    /// <summary>
    /// Return true if there are more tokens to read
    /// </summary>
    bool HasMoreTokens { get; }
    /// <summary>
    /// The actual contents that matched the token
    /// </summary>
    string TokenContents { get; }
    /// <summary>
    /// The particular token in "tokenDefinitions" that was matched (e.g. "STRING", "NUMBER", "OPEN PARENS", "CLOSE PARENS"
    /// </summary>
    object Token { get; }
    /// <summary>
    /// Move to the next token
    /// </summary>
    void Next();
}

interface ILexerFactory
{
    /// <summary>
    /// Create a Lexer for converting a stream of characters into tokens
    /// </summary>
    /// <param name="reader">TextReader that supplies the underlying stream</param>
    /// <param name="tokenDefinitions">A dictionary from regular expressions to their "token identifers"</param>
    /// <returns>The lexer</returns>
    ILexer CreateLexer(TextReader reader, IDictionary<string, object> tokenDefinitions);
}

So, pluz send the codz...
No, seriously, I am about to start writing an implementation of the above interface yet I find it hard to believe that there isn't some simple way of doing this in .NET (2.0) already.

所以，请发送代码...
不，说真的，我即将开始编写上述接口的实现，但我发现很难相信在 .NET (2.0) 中已经没有一些简单的方法可以做到这一点.

So, any suggestions for a simple way to do the above? (Also, I don't want any "code generators". Performance is not important for this thing and I don't want to introduce any complexity into the build process.)

那么，有什么建议可以用一种简单的方法来做上述事情吗？（另外，我不想要任何“代码生成器”。性能对于这件事并不重要，我不想在构建过程中引入任何复杂性。）

Answer 1

采纳答案by Paul Hollingsworth

The original version I posted here as an answer had a problem in that it only worked while there was more than one "Regex" that matched the current expression. That is, as soon as only one Regex matched, it would return a token - whereas most people want the Regex to be "greedy". This was especially the case for things such as "quoted strings".

我在此处发布的原始版本作为答案存在一个问题，因为它仅在存在多个与当前表达式匹配的“正则表达式”时才有效。也就是说，只要只有一个 Regex 匹配，它就会返回一个标记——而大多数人希望 Regex 是“贪婪的”。对于诸如“带引号的字符串”之类的内容尤其如此。

The only solution that sits on top of Regex is to read the input line-by-line (which means you cannot have tokens that span multiple lines). I can live with this - it is, after all, a poor man's lexer! Besides, it's usually useful to get line number information out of the Lexer in any case.

位于 Regex 之上的唯一解决方案是逐行读取输入（这意味着您不能拥有跨越多行的标记）。我可以忍受这个 - 毕竟，它是一个穷人的词法分析器！此外，在任何情况下从 Lexer 中获取行号信息通常很有用。

So, here's a new version that addresses these issues. Credit also goes to this

所以，这是一个解决这些问题的新版本。信用也归于此

public interface IMatcher
{
    /// <summary>
    /// Return the number of characters that this "regex" or equivalent
    /// matches.
    /// </summary>
    /// <param name="text">The text to be matched</param>
    /// <returns>The number of characters that matched</returns>
    int Match(string text);
}

sealed class RegexMatcher : IMatcher
{
    private readonly Regex regex;
    public RegexMatcher(string regex) => this.regex = new Regex(string.Format("^{0}", regex));

    public int Match(string text)
    {
        var m = regex.Match(text);
        return m.Success ? m.Length : 0;
    }
    public override string ToString() => regex.ToString();
}

public sealed class TokenDefinition
{
    public readonly IMatcher Matcher;
    public readonly object Token;

    public TokenDefinition(string regex, object token)
    {
        this.Matcher = new RegexMatcher(regex);
        this.Token = token;
    }
}

public sealed class Lexer : IDisposable
{
    private readonly TextReader reader;
    private readonly TokenDefinition[] tokenDefinitions;

    private string lineRemaining;

    public Lexer(TextReader reader, TokenDefinition[] tokenDefinitions)
    {
        this.reader = reader;
        this.tokenDefinitions = tokenDefinitions;
        nextLine();
    }

    private void nextLine()
    {
        do
        {
            lineRemaining = reader.ReadLine();
            ++LineNumber;
            Position = 0;
        } while (lineRemaining != null && lineRemaining.Length == 0);
    }

    public bool Next()
    {
        if (lineRemaining == null)
            return false;
        foreach (var def in tokenDefinitions)
        {
            var matched = def.Matcher.Match(lineRemaining);
            if (matched > 0)
            {
                Position += matched;
                Token = def.Token;
                TokenContents = lineRemaining.Substring(0, matched);
                lineRemaining = lineRemaining.Substring(matched);
                if (lineRemaining.Length == 0)
                    nextLine();

                return true;
            }
        }
        throw new Exception(string.Format("Unable to match against any tokens at line {0} position {1} \"{2}\"",
                                          LineNumber, Position, lineRemaining));
    }

    public string TokenContents { get; private set; }
    public object Token   { get; private set; }
    public int LineNumber { get; private set; }
    public int Position   { get; private set; }

    public void Dispose() => reader.Dispose();
}

Example program:

示例程序：

string sample = @"( one (two 456 -43.2 "" \"" quoted"" ))";

var defs = new TokenDefinition[]
{
    // Thanks to [steven levithan][2] for this great quoted string
            // regex
    new TokenDefinition(@"([""'])(?:\|.)*?", "QUOTED-STRING"),
    // Thanks to http://www.regular-expressions.info/floatingpoint.html
    new TokenDefinition(@"[-+]?\d*\.\d+([eE][-+]?\d+)?", "FLOAT"),
    new TokenDefinition(@"[-+]?\d+", "INT"),
    new TokenDefinition(@"#t", "TRUE"),
    new TokenDefinition(@"#f", "FALSE"),
    new TokenDefinition(@"[*<>\?\-+/A-Za-z->!]+", "SYMBOL"),
    new TokenDefinition(@"\.", "DOT"),
    new TokenDefinition(@"\(", "LEFT"),
    new TokenDefinition(@"\)", "RIGHT"),
    new TokenDefinition(@"\s", "SPACE")
};

TextReader r = new StringReader(sample);
Lexer l = new Lexer(r, defs);
while (l.Next())
    Console.WriteLine("Token: {0} Contents: {1}", l.Token, l.TokenContents);

Output:

输出：

Token: LEFT Contents: (
Token: SPACE Contents:
Token: SYMBOL Contents: one
Token: SPACE Contents:
Token: LEFT Contents: (
Token: SYMBOL Contents: two
Token: SPACE Contents:
Token: INT Contents: 456
Token: SPACE Contents:
Token: FLOAT Contents: -43.2
Token: SPACE Contents:
Token: QUOTED-STRING Contents: " \" quoted"
Token: SPACE Contents:
Token: RIGHT Contents: )
Token: RIGHT Contents: )

Answer 2

回答by Kent Boogaart

If you take a look at the ExpressionConverter in my WPF Converters library, it has basic lexing and parsing of C# expressions. No regex involved, from memory.

如果您查看我的WPF 转换器库中的 ExpressionConverter ，它具有 C# 表达式的基本词法分析和解析。从记忆中不涉及正则表达式。

Answer 3

回答by Juliet

Unless you have a very unconventional grammar, I'd stronglyrecommend not to roll your own lexer/parser.

除非您有非常规语法，否则我强烈建议您不要使用自己的词法分析器/解析器。

I usually find lexer/parsers for C# are really lacking. However, F# comes with fslex and fsyacc, which you can learn how to use in this tutorial. I've written several lexer/parsers in F# and used them in C#, and its very easy to do.

我通常发现 C# 的词法分析器/解析器真的很缺乏。但是，F# 附带了 fslex 和 fsyacc，您可以在本教程中了解如何使用它们。我已经在 F# 中编写了几个词法分析器/解析器，并在 C# 中使用了它们，而且很容易做到。

I suppose its not really a poor man's lexer/parser, seeing that you have to learn an entirely new language to get started, but its a start.

我想它不是真正的穷人的词法分析器/解析器，因为你必须学习一种全新的语言才能开始，但它是一个开始。

Answer 4

回答by Chris S

Changing my original answer.

改变我原来的答案。

Take a look at SharpTemplatethat has parsers for different syntax types, e.g.

看看具有不同语法类型解析器的SharpTemplate，例如

#foreach ($product in $Products)
   <tr><td>$product.Name</td>
   #if ($product.Stock > 0)
      <td>In stock</td>
   #else
     <td>Backordered</td>
   #end
  </tr>
#end

It uses regexes for each type of token:

它对每种类型的令牌使用正则表达式：

public class Velocity : SharpTemplateConfig
{
    public Velocity()
    {
        AddToken(TemplateTokenType.ForEach, @"#(foreach|{foreach})\s+\(\s*(?<iterator>[a-z_][a-z0-9_]*)\s+in\s+(?<expr>.*?)\s*\)", true);
        AddToken(TemplateTokenType.EndBlock, @"#(end|{end})", true);
        AddToken(TemplateTokenType.If, @"#(if|{if})\s+\((?<expr>.*?)\s*\)", true);
        AddToken(TemplateTokenType.ElseIf, @"#(elseif|{elseif})\s+\((?<expr>.*?)\s*\)", true);
        AddToken(TemplateTokenType.Else, @"#(else|{else})", true);
        AddToken(TemplateTokenType.Expression, @"${(?<expr>.*?)}", false);
        AddToken(TemplateTokenType.Expression, @"$(?<expr>[a-zA-Z_][a-zA-Z0-9_\.@]*?)(?![a-zA-Z0-9_\.@])", false);
    }
}

Which is used like this

像这样使用

foreach (Match match in regex.Matches(inputString))
{
    ...

    switch (tokenMatch.TokenType)
    {
        case TemplateTokenType.Expression:
            {
                currentNode.Add(new ExpressionNode(tokenMatch));
            }
            break;

        case TemplateTokenType.ForEach:
            {
                nodeStack.Push(currentNode);

                currentNode = currentNode.Add(new ForEachNode(tokenMatch));
            }
            break;
        ....
    }

    ....
}

It pushes and pops from a Stack to keep state.

它从堆栈中推送和弹出以保持状态。

Answer 5

回答by the_e

It is possible to use Flex and Bison for C#.

可以将 Flex 和 Bison 用于 C#。

A researcher at the University of Ireland has developed a partial implementation that can be found at the following link: Flex/Bison for C#

爱尔兰大学的研究人员开发了一个部分实现，可在以下链接中找到：Flex/Bison for C#

It could definitely be considered a 'poor mans lexer' as he seems to still have some issues with his implementation, such as no preprocessor, issues with a 'dangling else' case, etc.

它绝对可以被认为是一个“可怜的人的词法分析器”，因为他的实现似乎仍然存在一些问题，例如没有预处理器、“悬空 else”案例的问题等。

Answer 6

回答by Kieron

Malcolm Crowe has a great LEX/YACC implementation for C# here. Works by creating regular expressions for the LEX...

Malcolm Crowe在这里为 C# 提供了一个很棒的 LEX/YACC 实现。通过为 LEX 创建正则表达式来工作...

Direct download

直接下载

Answer 7

回答by Andy Dent

It may be overkill, but have a look at Ironyon CodePlex.

这可能有点矫枉过正，但看看CodePlex上的Irony。

Irony is a development kit for implementing languages on .NET platform. It uses the flexibility and power of c# language and .NET Framework 3.5 to implement a completely new and streamlined technology of compiler construction. Unlike most existing yacc/lex-style solutions Irony does not employ any scanner or parser code generation from grammar specifications written in a specialized meta-language. In Irony the target language grammar is coded directly in c# using operator overloading to express grammar constructs. Irony's scanner and parser modules use the grammar encoded as c# class to control the parsing process. See the expression grammar sample for an example of grammar definition in c# class, and using it in a working parser.

Irony 是一个用于在 .NET 平台上实现语言的开发工具包。它利用 c# 语言和 .NET Framework 3.5 的灵活性和强大功能，实现了一种全新的、精简的编译器构造技术。与大多数现有的 yacc/lex 风格的解决方案不同，Irony 不使用任何扫描器或解析器代码从以专门元语言编写的语法规范生成。在 Irony 中，目标语言语法直接在 c# 中编码，使用运算符重载来表达语法结构。Irony 的扫描器和解析器模块使用编码为 c# 类的语法来控制解析过程。有关 c# 类中的语法定义示例以及在工作解析器中使用它的示例，请参阅表达式语法示例。

C# 的穷人“词法分析器”

提问by Paul Hollingsworth

采纳答案by Paul Hollingsworth

回答by Kent Boogaart

回答by Juliet

回答by Chris S

回答by the_e

回答by Kieron

回答by Andy Dent

相关推荐

最近更新

标签

C# 的穷人“词法分析器”

提问by Paul Hollingsworth

采纳答案by Paul Hollingsworth

回答by Kent Boogaart

回答by Juliet

回答by Chris S

回答by the_e

回答by Kieron

回答by Andy Dent

相关推荐

C# 无论顺序如何，获取字符串列表的哈希值

C# 什么时候应该使用链表的真实世界例子是什么？

在 C# 中访问 Imap

C# 重用SqlCommand？

相关推荐

最近更新

标签