C语言 用于识别标记的词法分析器 C 程序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38343706/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 10:26:15  来源:igfitidea点击:

Lexical Analyzer C program for identifying tokens

clexical-analysis

提问by Manoj Kandala

I wrote a C program for lex analyzer (a small code) that will identify keywords, identifiers and constants. I am taking a string (C source code as a string) and then converting splitting it into words.

我为 lex 分析器(一个小代码)编写了一个 C 程序,它将识别关键字、标识符和常量。我正在获取一个字符串(C 源代码作为字符串),然后将其拆分为单词。

#include <stdio.h>
#include <conio.h>
#include <string.h>

char symTable[5][7] = { "int", "void", "float", "char", "string" };

int main() {
    int i, j, k = 0, flag = 0;
    char string[7];
    char str[] = "int main(){printf(\"Hello\");return 0;}";
    char *ptr;
    printf("Splitting string \"%s\" into tokens:\n", str);
    ptr = strtok(str, " (){};""");
    printf("\n\n");
    while (ptr != NULL) {
        printf ("%s\n", ptr);

        for (i = k; i < 5; i++) {
            memset(&string[0], 0, sizeof(string));
            for (j = 0; j < 7; j++) {
                string[j] = symTable[i][j];
            }

            if (strcmp(ptr, string) == 0) {
                printf("Keyword\n\n");
                break;
            } else
            if (string[j] == 0 || string[j] == 1 || string[j] == 2 ||
                string[j] == 3 || string[j] == 4 || string[j] == 5 ||
                string[j] == 6 || string[j] == 7 || string[j] == 8 ||
                string[j] == 9) {
                printf("Constant\n\n");
                break;
            } else {
                printf("Identifier\n\n");
                break;
            }
        }
        ptr = strtok(NULL, " (){};""");
        k++;
    }
    _getch();
    return 0;
}

With the above code, I am able to identify keywords and identifiers but I couldn't obtain the result for numbers. I've tried using strspn()but of no avail. I even replaced 0,1,2...,9to '0','1',....,'9'.

使用上面的代码,我能够识别关键字和标识符,但无法获得数字的结果。我试过使用strspn()但无济于事。我什0,1,2...,9至替换为'0','1',....,'9'.

Any help would be appreciated.

任何帮助,将不胜感激。

回答by chqrlie

Here are some problems in your parser:

以下是您的解析器中的一些问题:

  • The test string[j] == 0does not test if string[j]is the digit 0. The characters for digits are written '0'through '9', their values are 48 to 57 in ASCII and UTF-8. Furthermore, you should be comparing *pinstead of string[j]to test if you have a digit in the string indicating the start of a number.

  • Splitting the string with strtok()is not a good idea: it modifies the string and overwrites the first separatorcharacter with '\0': this will prevent matching operators such as (, )...

  • The string " (){};"""is exactly the same as " (){};". In order to escape "inside strings, you must use \".

  • 测试string[j] == 0不测试是否string[j]为 digit 0。数字字符'0'通过书写'9',其值为 48 到 57 的 ASCII 和 UTF-8。此外,您应该比较*p而不是string[j]测试字符串中是否有表示数字开头的数字。

  • 拆分字符串strtok()不是一个好主意:它会修改字符串并用 覆盖第一个分隔'\0':这将阻止匹配运算符,例如(, )...

  • 该字符串" (){};"""" (){};". 为了"在字符串内部转义,您必须使用\".

To write a lexer for C, you should switch on the first character and check the following characters depending on the value of the first character:

要为 C 编写词法分析器,您应该打开第一个字符并根据第一个字符的值检查以下字符:

  • if you have white space, skip it
  • if you have //, it is a line comment: skip all characters up to the newline.
  • if you have /*, it is a block comment: skip all characters until you get the pair */.
  • if you have a ', you have a character constant: parse the characters, handling escape sequences until you get a closing '.
  • if you have a ", you have astring literal. do the same as for character constants.
  • if you have a digit, consume all subsequent digits, you have an integer. Parsing the full number syntax requires much more code: leave that for later.
  • if you have a letter or an underscore: consume all subsequent letters, digits and underscores, then compare the word with the set of predefined keywords. You have either a keyword or an identifier.
  • otherwise, you have an operator: check if the next characters are part of a 2 or 3 character operator, such as ==and >>=.
  • 如果您有空白,请跳过它
  • 如果有//,它是一行注释:跳过所有字符直到换行符。
  • 如果有/*,则是块注释:跳过所有字符,直到获得对*/
  • 如果你有',你就有一个字符常量:解析字符,处理转义序列,直到你得到一个结束的'.
  • 如果你有 a ",你就有字符串文字。与字符常量相同。
  • 如果你有一个数字,消耗所有后续数字,你就有一个整数。解析完整的数字语法需要更多的代码:留待以后使用。
  • 如果您有字母或下划线:使用所有后续字母、数字和下划线,然后将该单词与一组预定义的关键字进行比较。您有关键字或标识符。
  • 否则,您有一个运算符:检查下一个字符是否是 2 或 3 个字符运算符的一部分,例如==and >>=

That's about it for a simple C parser. The full syntax requires more work, but you will get there one step at a time.

这就是一个简单的 C 解析器。完整的语法需要更多的工作,但您将一步一步地完成。

回答by Aleksandar Makragi?

When you're writing lexer, always create specific function that finds your tokens (name yylexis used for tool System Lex, that is why I used that name). Writing lexer in main is not smart idea, especially if you want to do syntax, semantic analysis later on.

当您编写词法分析器时,始终创建特定的函数来查找您的标记(名称yylex用于工具System Lex,这就是我使用该名称的原因)。在 main 中编写词法分析器并不是一个聪明的主意,尤其是如果您想稍后进行语法、语义分析。

From your question it is not clear whether you just want to figure out what are number tokens, or whether you want token + fetch number value. I will assume first one.

从您的问题中,不清楚您是只想弄清楚数字标记是什么,还是想要标记 + 获取数字值。我将假设第一个。

This is example code, that finds whole numbers:

这是查找整数的示例代码

int yylex(){

    /* We read one char from standard input */
    char c = getchar();

    /* If we read new line, we will return end of input token */
    if(c == '\n')
        return EOI;

    /* If we see digit on input, we can not return number token at the moment. 
         For example input could be 123a and that is lexical error  */
    if(isdigit(c)){

        while(isdigit(c = getchar()))
            ;

        ungetc(c,stdin);
        return NUM;
    }

    /* Additional code for keywords, identifiers, errors, etc. */
}

Tokens EOI, NUM, etc. should be defined on top. Later on, when you want to write syntax analysis, you use these tokens to figure out whether code responds to language syntax or not. In lexical analysis, usually ASCII values are not defined at all, your lexer function would simply return ')'for example. Knowing that, tokens should be defined above 255 value. For example:

令牌EOINUM等应定义在顶部。稍后,当您要编写语法分析时,您可以使用这些标记来确定代码是否响应语言语法。在词法分析中,通常根本没有定义 ASCII 值')',例如,您的词法分析器函数将简单地返回。知道这一点,令牌应定义为高于 255 个值。例如:

#define EOI 256
#define NUM 257

If you have any futher questions, feel free to ask.

如果您有任何其他问题,请随时提出。

回答by Basile Starynkevitch

string[j]==1

string[j]==1

This test is wrong(1)(on all C implementations I heard of), since string[j]is some chare.g. using ASCII(or UTF-8, or even the old EBCDICused on IBM mainframes) encoding and the encoding of the chardigit 1 is not the the number 1. On my Linux/x86-64 machine (and on most machines using ASCII or UTF-8, e.g. almost all of them) using UTF-8, the character1is encoded as the byte of code 48 (that is (char)48 == '1')

这个测试是错误的(1)(在我听说过的所有 C 实现上),因为string[j]有些是char使用ASCII(或UTF-8,甚至IBM 大型机上使用的旧EBCDIC)编码,而char数字 1的编码不是数字 1. 在我的 Linux/x86-64 机器上(以及大多数使用 ASCII 或 UTF-8 的机器,例如几乎所有机器)使用 UTF-8,字符1被编码为代码 48 的字节(即(char)48 == '1'

You probably want

你可能想要

string[j]=='1'

and you should consider using the standard isdigit(and related) function.

并且您应该考虑使用标准isdigit(和相关)功能。

Be aware that UTF-8 is practically used everywherebut is a multi-byteencoding (of displayable characters). See this answer.

请注意,UTF-8 实际上无处不在,但它是一种多字节编码(可显示字符)。看到这个答案



Note (1): the string[j]==1test is probably misplaced too! Perhaps you might test isdigit(*ptr)at some better place.

注意(1):string[j]==1测试也可能放错地方了!也许你可以isdigit(*ptr)在更好的地方测试。

PS. Please take the habit of compiling with all warnings and debug info (e.g. with gcc -Wall -Wextra -gif using GCC...) and use the debugger(e.g. gdb). You should have find out your bug in less time than it took you to get an answer here.

附注。请养成编译所有警告和调试信息的习惯(例如,gcc -Wall -Wextra -g如果使用GCC...)并使用调试器(例如gdb)。您应该在比在此处获得答案所需的时间更短的时间内找出您的错误。