Linux 如何在 C 中读/写 UTF8 文本文件？

Question

提问by user2768374

i am trying to read UTF8 text from a text file, and then print some of it to another file. I am using Linux and gcc compiler. This is the code i am using:

我正在尝试从文本文件中读取 UTF8 文本，然后将其中的一些打印到另一个文件中。我正在使用 Linux 和 gcc 编译器。这是我正在使用的代码：

#include <stdio.h>
#include <stdlib.h>

int main(){
    FILE *fin;
    FILE *fout;
    int character;
    fin=fopen("in.txt", "r");
    fout=fopen("out.txt","w");
    while((character=fgetc(fin))!=EOF){
        putchar(character); // It displays the right character (UTF8) in the terminal
        fprintf(fout,"%c ",character); // It displays weird characters in the file
    }
    fclose(fin);
    fclose(fout);
    printf("\nFile has been created...\n");
    return 0;
}

It works for English characters for now.

它现在适用于英文字符。

Answer 1

采纳答案by user2768374

This code worked for me:

这段代码对我有用：

/* fgetwc example */
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main ()
{
  setlocale(LC_ALL, "en_US.UTF-8");
  FILE * fin;
  FILE * fout;
  wint_t wc;
  fin=fopen ("in.txt","r");
  fout=fopen("out.txt","w");
  while((wc=fgetwc(fin))!=WEOF){
        // work with: "wc"
  }
  fclose(fin);
  fclose(fout);
  printf("File has been created...\n");
  return 0;
}

Answer 2

回答by Josh Durham

Instead of

代替

fprintf(fout,"%c ",character);

use

用

fprintf(fout,"%c",character);

The second fprintf()does not contain a space after %cwhich is what was causing out.txtto display weird characters. The reason is that fgetc()is retrieving a single byte (the same thing as an ASCII character), nota UTF-8 character. Since UTF-8 is also ASCII compatible, it will write English characters to the file just fine.

第二个fprintf()不包含空格，之后%c是导致out.txt显示奇怪字符的原因。原因是fgetc()检索单个字节（与 ASCII 字符相同），而不是UTF-8 字符。由于 UTF-8 也与 ASCII 兼容，因此它将英文字符写入文件就好了。

putchar(character)output the bytes sequentially without the extra space between every byte so the original UTF-8 sequence remained intact. To see what I'm talking about, try

putchar(character)按顺序输出字节，每个字节之间没有额外的空间，因此原始 UTF-8 序列保持不变。要了解我在说什么，请尝试

while((character=fgetc(fin))!=EOF){
    putchar(character);
    printf(" "); // This mimics what you are doing when you write to out.txt
    fprintf(fout,"%c ",character);
}

If you want to write UTF-8 characters with the space between them to out.txt, you would need to handle the variable length encoding of a UTF-8 character.

如果要将带有空格的 UTF-8 字符写入 out.txt，则需要处理 UTF-8 字符的可变长度编码。

#include <stdio.h>
#include <stdlib.h>

/* The first byte of a UTF-8 character
 * indicates how many bytes are in
 * the character, so only check that
 */
int numberOfBytesInChar(unsigned char val) {
    if (val < 128) {
        return 1;
    } else if (val < 224) {
        return 2;
    } else if (val < 240) {
        return 3;
    } else {
        return 4;
    }
}

int main(){
    FILE *fin;
    FILE *fout;
    int character;
    fin = fopen("in.txt", "r");
    fout = fopen("out.txt","w");
    while( (character = fgetc(fin)) != EOF) {
        for (int i = 0; i < numberOfBytesInChar((unsigned char)character) - 1; i++) {
            putchar(character);
            fprintf(fout, "%c", character);
            character = fgetc(fin);
        }
        putchar(character);
        printf(" ");
        fprintf(fout, "%c ", character);
    }
    fclose(fin);
    fclose(fout);
    printf("\nFile has been created...\n");
    return 0;
}

Answer 3

回答by Kev Youren

If you do not wish to use the wide options, experiment with the following:

如果您不想使用宽选项，请尝试以下操作：

Read and write bytes, not characters. Also known as, use binary, not text.

读取和写入字节，而不是字符。也称为使用二进制，而不是文本。

fgetc effectively gets a byte from a file, but if the byte is greater than 127, try treating it as a int instead of a char. fputc, on the other hand, silently ignores putting a char > 127. It will work if you use an int rather than char as the input.

fgetc 有效地从文件中获取一个字节，但如果该字节大于 127，请尝试将其视为 int 而不是 char。另一方面，fputc 会默默地忽略将 char > 127 放置。如果您使用 int 而不是 char 作为输入，它将起作用。

Also, in the open mode, try using binary, so try rb & wb rather than r & w

另外，在开放模式下，尝试使用二进制，所以尝试 rb & wb 而不是 r & w

Answer 4

回答by Renra

The C-style solution is very insightful, but if you'd consider using C++ the task becomes much more high level and it does not require you to have so much knowledge about utf-8 encoding. Consider the following:

C 风格的解决方案非常有见地，但如果您考虑使用 C++，则任务变得更加高级，并且不需要您对 utf-8 编码有太多了解。考虑以下：

#include<iostream>
#include<fstream>

int main(){
  wifstream input { "in.txt" }
  wofstream output { "out.txt" }

  // Look out - this part is not portable to windows                                             
  locale utf8 {"en_us.UTF-8"};   

  input.imbue(utf8);                                                             
  output.imbue(utf8);
  wcout.imbue(utf8);

  wchar_t c;

  while(input >> noskipws >> c) {
    wcout << c;
    output << c; 
  }

  return 0;  
}

Linux 如何在 C 中读/写 UTF8 文本文件？

提问by user2768374

采纳答案by user2768374

回答by Josh Durham

回答by Kev Youren

回答by Renra

相关推荐

最近更新

标签

Linux 如何在 C 中读/写 UTF8 文本文件？

提问by user2768374

采纳答案by user2768374

回答by Josh Durham

回答by Kev Youren

回答by Renra

相关推荐

Linux 查找正在运行的进程的 PID 并存储为数组

C# 从两个用 LINQ 连接的数据表创建组合数据表。C＃

如何在 Linux 中限制用户命令

C# 使用 LINQ 对对象进行分页

相关推荐

最近更新

标签