C语言从文本文件中读取并将行解析为 C 语言中的单词

Question

提问by user2203774

I'm a beginner in C and system programming. For a homework assignment, I need to write a program that reads input from stdin parsing lines into words and sending words to the sort sub-processes using System V message queues (e.g., count words). I got stuck at the input part. I'm trying to process the input, remove non-alpha characters, put all alpha words in lower case and lastly, split a line of words into multiple words. So far I can print all alpha words in lower case, but there are lines between words, which I believe isn't correct. Can someone take a look and give me some suggestions?

我是 C 和系统编程的初学者。对于家庭作业，我需要编写一个程序，将来自 stdin 解析行的输入读取为单词，并将单词发送到使用 System V 消息队列（例如，计数单词）的排序子进程。我被困在输入部分。我正在尝试处理输入，删除非字母字符，将所有字母单词都设为小写，最后将一行单词拆分为多个单词。到目前为止，我可以以小写形式打印所有字母单词，但是单词之间存在线条，我认为这是不正确的。有人可以看看并给我一些建议吗？

Example from a text file: The Project Gutenberg EBook of The Iliad of Homer, by Homer

来自文本文件的示例：荷马伊利亚特的古腾堡计划电子书，荷马着

I think the correct output should be:

我认为正确的输出应该是：

the
project
gutenberg
ebook
of
the
iliad
of
homer
by
homer

But my output is the following:

但我的输出如下：

project
gutenberg
ebook
of
the
iliad
of
homer
                         <------There is a line there
by
homer

I think the empty line is caused by the space between "," and "by". I tried things like "if isspace(c) then do nothing", but it doesn't work. My code is below. Any help or suggestion is appreciated.

我认为空行是由“,”和“by”之间的空格引起的。我尝试过诸如“如果 isspace(c) 然后什么都不做”之类的事情，但它不起作用。我的代码如下。任何帮助或建议表示赞赏。

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>


//Main Function
int main (int argc, char **argv)
{
    int c;
    char *input = argv[1];
    FILE *input_file;

    input_file = fopen(input, "r");

    if (input_file == 0)
    {
        //fopen returns 0, the NULL pointer, on failure
        perror("Canot open input file\n");
        exit(-1);
    }
    else
    {        
        while ((c =fgetc(input_file)) != EOF )
        {
            //if it's an alpha, convert it to lower case
            if (isalpha(c))
            {
                c = tolower(c);
                putchar(c);
            }
            else if (isspace(c))
            {
                ;   //do nothing
            }
            else
            {
                c = '\n';
                putchar(c);
            }
        }
    }

    fclose(input_file);

    printf("\n");

    return 0;
}

EDIT **

编辑**

I edited my code and finally got the correct output:

我编辑了我的代码，最终得到了正确的输出：

int main (int argc, char **argv)
{
    int c;
    char *input = argv[1];
    FILE *input_file;

    input_file = fopen(input, "r");

    if (input_file == 0)
    {
        //fopen returns 0, the NULL pointer, on failure
        perror("Canot open input file\n");
        exit(-1);
    }
    else
    {
        int found_word = 0;

        while ((c =fgetc(input_file)) != EOF )
        {
            //if it's an alpha, convert it to lower case
            if (isalpha(c))
            {
                found_word = 1;
                c = tolower(c);
                putchar(c);
            }
            else {
                if (found_word) {
                    putchar('\n');
                    found_word=0;
                }
            }

        }
    }

    fclose(input_file);

    printf("\n");

    return 0;
}

Answer 1

采纳答案by Rob

I think that you just need to ignore any non-alpha character !isalpha(c)otherwise convert to lowercase. You will need to keep track when you find a word in this case.

我认为您只需要忽略任何非字母字符，!isalpha(c)否则将转换为小写。在这种情况下，您需要在找到单词时进行跟踪。

int found_word = 0;

while ((c =fgetc(input_file)) != EOF )
{
    if (!isalpha(c))
    {
        if (found_word) {
            putchar('\n');
            found_word = 0;
        }
    }
    else {
        found_word = 1;
        c = tolower(c);
        putchar(c);
    }
}

If you need to handle apostrophes within words such as "isn't" then this should do it -

如果您需要在诸如“不是”之类的词中处理撇号，则应该这样做-

int found_word = 0;
int found_apostrophe = 0;
    while ((c =fgetc(input_file)) != EOF )
    {
    if (!isalpha(c))
    {
        if (found_word) {
            if (!found_apostrophe && c=='\'') {
                found_apostrophe = 1;
            }
            else {
                found_apostrophe = 0;
                putchar('\n');
                found_word = 0;
            }
                }
    }
    else {
        if (found_apostrophe) {
            putchar('\'');
            found_apostrophe = 0;
        }
        found_word = 1;
        c = tolower(c);
        putchar(c);
    }
}

Answer 2

回答by abarnert

I suspect you really want to handle allnon-alphabetical characters as separators, not just handle spaces as separators and ignore non-alphabetical characters. Otherwise, foo--barwould show up as a single word foobar, right? The good news is, that makes things easier. You can remove the isspaceclause, and just use the elseclause.

我怀疑您真的想将所有非字母字符作为分隔符处理，而不仅仅是将空格作为分隔符处理并忽略非字母字符。否则，foo--bar会显示为一个单词foobar，对吗？好消息是，这让事情变得更容易了。您可以删除该isspace子句，然后仅使用该else子句。

Meanwhile, whether you treat punctuations specially or not, you've got a problem: You print a newline for any space at all. So, a line that ends with \r\nor \n, or even a sentence that ends with ., will print a blank line. The obvious way around that is to keep track of the last character, or a flag, so you only print a newline if you've previously printed a letter.

同时，无论您是否特别对待标点符号，您都会遇到一个问题：您根本无法为任何空格打印换行符。因此，以\r\n或结尾的行，\n甚至以结尾的句子.，都会打印一个空行。解决这个问题的显而易见的方法是跟踪最后一个字符或标志，因此如果您以前打印过一个字母，则只打印换行符。

For example:

例如：

int last_c = 0

while ((c = fgetc(input_file)) != EOF )
{
    //if it's an alpha, convert it to lower case
    if (isalpha(c))
    {
        c = tolower(c);
        putchar(c);
    }
    else if (isalpha(last_c))
    {
        putchar(c);
    }
    last_c = c;
}

But do you really want to treat all punctuation the same? The problem statement implies that you do, but in real life, that's a bit odd. For example, foo--barshould probably show up as separate words fooand bar, but should it'sreally show up as separate words itand s? For that matter, using isalphaas your rule for "word characters" also means that, say, 2ndwill show up as nd.

但是你真的想对所有标点符号一视同仁吗？问题陈述暗示你这样做，但在现实生活中，这有点奇怪。例如，foo--bar可能应该显示为单独的单词fooand bar，但it's真的应该显示为单独的单词itands吗？就此而言，将isalpha“单词字符”用作规则也意味着，比如说，2nd将显示为nd.

So, if isasciiisn't the appropriate rule for your use case to distinguish word characters from separator characters, you'll have to write your own function that makes the right distinction. You can easily express such a rule in logic (e.g., isalnum(c) || c == '\'') or with a table (just an array of 128 ints, so the function is c >= 0 && c < 128 && word_char_table[c]). Doing things that way has the added benefit that you can later extend your code to deal with Latin-1 or Unicode, or to handle program text (which has different word characters than English language text), or …

因此，如果isascii您的用例不适合区分单词字符和分隔符的规则，您将必须编写自己的函数来进行正确的区分。你可以很容易地用逻辑（例如，isalnum(c) || c == '\''）或表格（只是一个 128 个整数的数组，所以函数是c >= 0 && c < 128 && word_char_table[c]）来表达这样的规则。这样做有一个额外的好处，你可以在以后扩展你的代码来处理 Latin-1 或 Unicode，或者处理程序文本（它的单词字符与英语语言文本不同），或者……

Answer 3

回答by P0W

It appears that you are separating words by spaces, so I think just

看来您是用空格分隔单词，所以我认为

while ((c =fgetc(input_file)) != EOF )
{
    if (isalpha(c))
    {
        c = tolower(c);
        putchar(c);
    }
    else if (isspace(c))
    {
       putchar('\n');
    }
}

will work too. Provided your input text won't have more than one space between words.

也会工作。假设您的输入文本单词之间不会有多个空格。

C语言从文本文件中读取并将行解析为 C 语言中的单词

提问by user2203774

采纳答案by Rob

回答by abarnert

回答by P0W

相关推荐

最近更新

标签

C语言 从文本文件中读取并将行解析为 C 语言中的单词

提问by user2203774

采纳答案by Rob

回答by abarnert

回答by P0W

相关推荐

C语言 对齐堆栈是什么意思？

C语言 如果访问共享内存的关键是 shmget() 的返回值，那么拥有 key_t 的意义何在？

C语言 这四行棘手的 C 代码背后的概念

C语言 有没有办法使用 gcc 将 C 转换为 MIPS？

相关推荐

最近更新

标签

C语言从文本文件中读取并将行解析为 C 语言中的单词

C语言对齐堆栈是什么意思？

C语言如果访问共享内存的关键是 shmget() 的返回值，那么拥有 key_t 的意义何在？

C语言这四行棘手的 C 代码背后的概念

C语言有没有办法使用 gcc 将 C 转换为 MIPS？