C语言 从文本文件中读取并将行解析为 C 语言中的单词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18109458/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read from a text file and parse lines into words in C
提问by user2203774
I'm a beginner in C and system programming. For a homework assignment, I need to write a program that reads input from stdin parsing lines into words and sending words to the sort sub-processes using System V message queues (e.g., count words). I got stuck at the input part. I'm trying to process the input, remove non-alpha characters, put all alpha words in lower case and lastly, split a line of words into multiple words. So far I can print all alpha words in lower case, but there are lines between words, which I believe isn't correct. Can someone take a look and give me some suggestions?
我是 C 和系统编程的初学者。对于家庭作业,我需要编写一个程序,将来自 stdin 解析行的输入读取为单词,并将单词发送到使用 System V 消息队列(例如,计数单词)的排序子进程。我被困在输入部分。我正在尝试处理输入,删除非字母字符,将所有字母单词都设为小写,最后将一行单词拆分为多个单词。到目前为止,我可以以小写形式打印所有字母单词,但是单词之间存在线条,我认为这是不正确的。有人可以看看并给我一些建议吗?
Example from a text file: The Project Gutenberg EBook of The Iliad of Homer, by Homer
来自文本文件的示例:荷马伊利亚特的古腾堡计划电子书,荷马着
I think the correct output should be:
我认为正确的输出应该是:
the
project
gutenberg
ebook
of
the
iliad
of
homer
by
homer
But my output is the following:
但我的输出如下:
project
gutenberg
ebook
of
the
iliad
of
homer
<------There is a line there
by
homer
I think the empty line is caused by the space between "," and "by". I tried things like "if isspace(c) then do nothing", but it doesn't work. My code is below. Any help or suggestion is appreciated.
我认为空行是由“,”和“by”之间的空格引起的。我尝试过诸如“如果 isspace(c) 然后什么都不做”之类的事情,但它不起作用。我的代码如下。任何帮助或建议表示赞赏。
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
//Main Function
int main (int argc, char **argv)
{
int c;
char *input = argv[1];
FILE *input_file;
input_file = fopen(input, "r");
if (input_file == 0)
{
//fopen returns 0, the NULL pointer, on failure
perror("Canot open input file\n");
exit(-1);
}
else
{
while ((c =fgetc(input_file)) != EOF )
{
//if it's an alpha, convert it to lower case
if (isalpha(c))
{
c = tolower(c);
putchar(c);
}
else if (isspace(c))
{
; //do nothing
}
else
{
c = '\n';
putchar(c);
}
}
}
fclose(input_file);
printf("\n");
return 0;
}
EDIT **
编辑**
I edited my code and finally got the correct output:
我编辑了我的代码,最终得到了正确的输出:
int main (int argc, char **argv)
{
int c;
char *input = argv[1];
FILE *input_file;
input_file = fopen(input, "r");
if (input_file == 0)
{
//fopen returns 0, the NULL pointer, on failure
perror("Canot open input file\n");
exit(-1);
}
else
{
int found_word = 0;
while ((c =fgetc(input_file)) != EOF )
{
//if it's an alpha, convert it to lower case
if (isalpha(c))
{
found_word = 1;
c = tolower(c);
putchar(c);
}
else {
if (found_word) {
putchar('\n');
found_word=0;
}
}
}
}
fclose(input_file);
printf("\n");
return 0;
}
采纳答案by Rob
I think that you just need to ignore any non-alpha character !isalpha(c)otherwise convert to lowercase. You will need to keep track when you find a word in this case.
我认为您只需要忽略任何非字母字符,!isalpha(c)否则将转换为小写。在这种情况下,您需要在找到单词时进行跟踪。
int found_word = 0;
while ((c =fgetc(input_file)) != EOF )
{
if (!isalpha(c))
{
if (found_word) {
putchar('\n');
found_word = 0;
}
}
else {
found_word = 1;
c = tolower(c);
putchar(c);
}
}
If you need to handle apostrophes within words such as "isn't" then this should do it -
如果您需要在诸如“不是”之类的词中处理撇号,则应该这样做-
int found_word = 0;
int found_apostrophe = 0;
while ((c =fgetc(input_file)) != EOF )
{
if (!isalpha(c))
{
if (found_word) {
if (!found_apostrophe && c=='\'') {
found_apostrophe = 1;
}
else {
found_apostrophe = 0;
putchar('\n');
found_word = 0;
}
}
}
else {
if (found_apostrophe) {
putchar('\'');
found_apostrophe = 0;
}
found_word = 1;
c = tolower(c);
putchar(c);
}
}
回答by abarnert
I suspect you really want to handle allnon-alphabetical characters as separators, not just handle spaces as separators and ignore non-alphabetical characters. Otherwise, foo--barwould show up as a single word foobar, right? The good news is, that makes things easier. You can remove the isspaceclause, and just use the elseclause.
我怀疑您真的想将所有非字母字符作为分隔符处理,而不仅仅是将空格作为分隔符处理并忽略非字母字符。否则,foo--bar会显示为一个单词foobar,对吗?好消息是,这让事情变得更容易了。您可以删除该isspace子句,然后仅使用该else子句。
Meanwhile, whether you treat punctuations specially or not, you've got a problem: You print a newline for any space at all. So, a line that ends with \r\nor \n, or even a sentence that ends with ., will print a blank line. The obvious way around that is to keep track of the last character, or a flag, so you only print a newline if you've previously printed a letter.
同时,无论您是否特别对待标点符号,您都会遇到一个问题:您根本无法为任何空格打印换行符。因此,以\r\n或结尾的行,\n甚至以 结尾的句子.,都会打印一个空行。解决这个问题的显而易见的方法是跟踪最后一个字符或标志,因此如果您以前打印过一个字母,则只打印换行符。
For example:
例如:
int last_c = 0
while ((c = fgetc(input_file)) != EOF )
{
//if it's an alpha, convert it to lower case
if (isalpha(c))
{
c = tolower(c);
putchar(c);
}
else if (isalpha(last_c))
{
putchar(c);
}
last_c = c;
}
But do you really want to treat all punctuation the same? The problem statement implies that you do, but in real life, that's a bit odd. For example, foo--barshould probably show up as separate words fooand bar, but should it'sreally show up as separate words itand s? For that matter, using isalphaas your rule for "word characters" also means that, say, 2ndwill show up as nd.
但是你真的想对所有标点符号一视同仁吗?问题陈述暗示你这样做,但在现实生活中,这有点奇怪。例如,foo--bar可能应该显示为单独的单词fooand bar,但it's真的应该显示为单独的单词itands吗?就此而言,将isalpha“单词字符”用作规则也意味着,比如说,2nd将显示为nd.
So, if isasciiisn't the appropriate rule for your use case to distinguish word characters from separator characters, you'll have to write your own function that makes the right distinction. You can easily express such a rule in logic (e.g., isalnum(c) || c == '\'') or with a table (just an array of 128 ints, so the function is c >= 0 && c < 128 && word_char_table[c]). Doing things that way has the added benefit that you can later extend your code to deal with Latin-1 or Unicode, or to handle program text (which has different word characters than English language text), or …
因此,如果isascii您的用例不适合区分单词字符和分隔符的规则,您将必须编写自己的函数来进行正确的区分。你可以很容易地用逻辑(例如,isalnum(c) || c == '\'')或表格(只是一个 128 个整数的数组,所以函数是c >= 0 && c < 128 && word_char_table[c])来表达这样的规则。这样做有一个额外的好处,你可以在以后扩展你的代码来处理 Latin-1 或 Unicode,或者处理程序文本(它的单词字符与英语语言文本不同),或者……
回答by P0W
It appears that you are separating words by spaces, so I think just
看来您是用空格分隔单词,所以我认为
while ((c =fgetc(input_file)) != EOF )
{
if (isalpha(c))
{
c = tolower(c);
putchar(c);
}
else if (isspace(c))
{
putchar('\n');
}
}
will work too. Provided your input text won't have more than one space between words.
也会工作。假设您的输入文本单词之间不会有多个空格。

