Linux 如何在 C 中读/写 UTF8 文本文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21737906/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to Read/Write UTF8 text files in C?
提问by user2768374
i am trying to read UTF8 text from a text file, and then print some of it to another file. I am using Linux and gcc compiler. This is the code i am using:
我正在尝试从文本文件中读取 UTF8 文本,然后将其中的一些打印到另一个文件中。我正在使用 Linux 和 gcc 编译器。这是我正在使用的代码:
#include <stdio.h>
#include <stdlib.h>
int main(){
FILE *fin;
FILE *fout;
int character;
fin=fopen("in.txt", "r");
fout=fopen("out.txt","w");
while((character=fgetc(fin))!=EOF){
putchar(character); // It displays the right character (UTF8) in the terminal
fprintf(fout,"%c ",character); // It displays weird characters in the file
}
fclose(fin);
fclose(fout);
printf("\nFile has been created...\n");
return 0;
}
It works for English characters for now.
它现在适用于英文字符。
采纳答案by user2768374
This code worked for me:
这段代码对我有用:
/* fgetwc example */
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main ()
{
setlocale(LC_ALL, "en_US.UTF-8");
FILE * fin;
FILE * fout;
wint_t wc;
fin=fopen ("in.txt","r");
fout=fopen("out.txt","w");
while((wc=fgetwc(fin))!=WEOF){
// work with: "wc"
}
fclose(fin);
fclose(fout);
printf("File has been created...\n");
return 0;
}
回答by Josh Durham
Instead of
代替
fprintf(fout,"%c ",character);
use
用
fprintf(fout,"%c",character);
The second fprintf()
does not contain a space after %c
which is what was causing out.txtto display weird characters. The reason is that fgetc()
is retrieving a single byte (the same thing as an ASCII character), nota UTF-8 character. Since UTF-8 is also ASCII compatible, it will write English characters to the file just fine.
第二个fprintf()
不包含空格,之后%c
是导致out.txt显示奇怪字符的原因。原因是fgetc()
检索单个字节(与 ASCII 字符相同),而不是UTF-8 字符。由于 UTF-8 也与 ASCII 兼容,因此它将英文字符写入文件就好了。
putchar(character)
output the bytes sequentially without the extra space between every byte so the original UTF-8 sequence remained intact. To see what I'm talking about, try
putchar(character)
按顺序输出字节,每个字节之间没有额外的空间,因此原始 UTF-8 序列保持不变。要了解我在说什么,请尝试
while((character=fgetc(fin))!=EOF){
putchar(character);
printf(" "); // This mimics what you are doing when you write to out.txt
fprintf(fout,"%c ",character);
}
If you want to write UTF-8 characters with the space between them to out.txt, you would need to handle the variable length encoding of a UTF-8 character.
如果要将带有空格的 UTF-8 字符写入 out.txt,则需要处理 UTF-8 字符的可变长度编码。
#include <stdio.h>
#include <stdlib.h>
/* The first byte of a UTF-8 character
* indicates how many bytes are in
* the character, so only check that
*/
int numberOfBytesInChar(unsigned char val) {
if (val < 128) {
return 1;
} else if (val < 224) {
return 2;
} else if (val < 240) {
return 3;
} else {
return 4;
}
}
int main(){
FILE *fin;
FILE *fout;
int character;
fin = fopen("in.txt", "r");
fout = fopen("out.txt","w");
while( (character = fgetc(fin)) != EOF) {
for (int i = 0; i < numberOfBytesInChar((unsigned char)character) - 1; i++) {
putchar(character);
fprintf(fout, "%c", character);
character = fgetc(fin);
}
putchar(character);
printf(" ");
fprintf(fout, "%c ", character);
}
fclose(fin);
fclose(fout);
printf("\nFile has been created...\n");
return 0;
}
回答by Kev Youren
If you do not wish to use the wide options, experiment with the following:
如果您不想使用宽选项,请尝试以下操作:
Read and write bytes, not characters. Also known as, use binary, not text.
读取和写入字节,而不是字符。也称为使用二进制,而不是文本。
fgetc effectively gets a byte from a file, but if the byte is greater than 127, try treating it as a int instead of a char. fputc, on the other hand, silently ignores putting a char > 127. It will work if you use an int rather than char as the input.
fgetc 有效地从文件中获取一个字节,但如果该字节大于 127,请尝试将其视为 int 而不是 char。另一方面,fputc 会默默地忽略将 char > 127 放置。如果您使用 int 而不是 char 作为输入,它将起作用。
Also, in the open mode, try using binary, so try rb & wb rather than r & w
另外,在开放模式下,尝试使用二进制,所以尝试 rb & wb 而不是 r & w
回答by Renra
The C-style solution is very insightful, but if you'd consider using C++ the task becomes much more high level and it does not require you to have so much knowledge about utf-8 encoding. Consider the following:
C 风格的解决方案非常有见地,但如果您考虑使用 C++,则任务变得更加高级,并且不需要您对 utf-8 编码有太多了解。考虑以下:
#include<iostream>
#include<fstream>
int main(){
wifstream input { "in.txt" }
wofstream output { "out.txt" }
// Look out - this part is not portable to windows
locale utf8 {"en_us.UTF-8"};
input.imbue(utf8);
output.imbue(utf8);
wcout.imbue(utf8);
wchar_t c;
while(input >> noskipws >> c) {
wcout << c;
output << c;
}
return 0;
}