C语言如何逐行读取unicode（utf-8）/二进制文件

Question

提问by Freeseif

Hi programmers,

程序员们好，

I want read line by line a Unicode (UTF-8) text file created by Notepad, i don't want display the Unicode string in the screen, i want just read and compare the strings!.

我想逐行读取由记事本创建的 Unicode (UTF-8) 文本文件，我不想在屏幕上显示 Unicode 字符串，我只想读取和比较字符串！。

This code read ANSI file line by line, and compare the strings

此代码逐行读取 ANSI 文件，并比较字符串

What i want

我想要的是

Read test_ansi.txt line by line

逐行读取 test_ansi.txt

if the line = "b" print "YES!"

如果行 = "b" 打印 "YES!"

else print "NO!"

否则打印“不！”

read_ansi_line_by_line.c

#include <stdio.h>

int main()
{
    char *inname = "test_ansi.txt";
    FILE *infile;
    char line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
    char line_number;

    infile = fopen(inname, "r");
    if (!infile) {
        printf("\nfile '%s' not found\n", inname);
        return 0;
    }
    printf("\n%s\n\n", inname);

    line_number = 0;
    while (fgets(line_buffer, sizeof(line_buffer), infile)) {
        ++line_number;
        /* note that the newline is in the buffer */
        if (strcmp("b\n", line_buffer) == 0 ){
            printf("%d: YES!\n", line_number);
        }else{
            printf("%d: NO!\n", line_number,line_buffer);
        }
    }
    printf("\n\nTotal: %d\n", line_number);
    return 0;
}

test_ansi.txt

a
b
c

Compiling

编译

gcc -o read_ansi_line_by_line read_ansi_line_by_line.c

Output

输出

test_ansi.txt

1: NO!
2: YES!
3: NO!


Total: 3

Now i need read Unicode (UTF-8) file created by Notepad, after more than 6 months i don't found any good code/library in C can read file coded in UTF-8!, i don't know exactly why but i think the standard C don't support Unicode!

现在我需要读取由记事本创建的 Unicode (UTF-8) 文件，6 个多月后我没有找到任何好的 C 代码/库可以读取以 UTF-8 编码的文件！，我不知道确切的原因，但是我认为标准 C 不支持 Unicode！

Reading Unicode binary file its OK!, but the probleme is the binary file most be already created in binary mode!, that mean if we want read a Unicode (UTF-8) file created by Notepad we need to translate it from UTF-8 file to BINARY file!

读取 Unicode 二进制文件没问题！但问题是大多数已经以二进制模式创建的二进制文件！这意味着如果我们要读取记事本创建的 Unicode (UTF-8) 文件，我们需要将其从 UTF-8 转换文件到二进制文件！

This code write Unicode string to a binary file, NOTE the C file is coded in UTF-8 and compiled by GCC

此代码将 Unicode 字符串写入二进制文件，注意 C 文件以 UTF-8 编码并由 GCC 编译

What i want

我想要的是

Write the Unicode char "?" to test_bin.dat

写入 Unicode 字符“？” 到 test_bin.dat

create_bin.c

#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif

#include <stdio.h>
#include <wchar.h>

int main()
{
     /*Data to be stored in file*/
     wchar_t line_buffer[BUFSIZ]=L"?";
     /*Opening file for writing in binary mode*/
     FILE *infile=fopen("test_bin.dat","wb");
     /*Writing data to file*/
     fwrite(line_buffer, 1, 13, infile);
     /*Closing File*/
     fclose(infile);

    return 0;
}

Compiling

编译

gcc -o create_bin create_bin.c

Output

输出

create test_bin.dat

Now i want read the binary file line by line and compare!

现在我想逐行读取二进制文件并进行比较！

What i want

我想要的是

Read test_bin.dat line by line if the line = "?" print "YES!" else print "NO!"

如果 line = "?"，则逐行读取 test_bin.dat 打印“是的！” 否则打印“不！”

read_bin_line_by_line.c

#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif

#include <stdio.h>
#include <wchar.h>

int main()
{
    wchar_t *inname = L"test_bin.dat";
    FILE *infile;
    wchar_t line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */

    infile = _wfopen(inname,L"rb");
    if (!infile) {
        wprintf(L"\nfile '%s' not found\n", inname);
        return 0;
    }
    wprintf(L"\n%s\n\n", inname);

    /*Reading data from file into temporary buffer*/
    while (fread(line_buffer,1,13,infile)) {
        /* note that the newline is in the buffer */
        if ( wcscmp ( L"?" , line_buffer ) == 0 ){
             wprintf(L"YES!\n");
        }else{
             wprintf(L"NO!\n", line_buffer);
        }
    }
    /*Closing File*/
    fclose(infile);
    return 0;
}

Output

输出

test_bin.dat

YES!

THE PROBLEM

问题

This method is VERY LONG! and NOT POWERFUL (i m beginner in software engineering)

这个方法很长！并且不强大（我是软件工程的初学者）

Please any one know how to read Unicode file ? (i know its not easy!) Please any one know how to convert Unicode file to Binary file ? (simple method) Please any one know how to read Unicode file in binary mode ? (i m not sure)

请问有人知道如何阅读Unicode文件吗？（我知道这并不容易！）请问有人知道如何将 Unicode 文件转换为二进制文件吗？（简单的方法）请问有谁知道如何以二进制方式读取Unicode文件？（我不知道）

Thank You.

谢谢你。

Answer 1

采纳答案by Freeseif

I found a solution to my problem, and I would like to share the solution to any one interested in reading UTF-8 file in C99.

我找到了解决我的问题的方法，我想与任何有兴趣阅读 C99 中的 UTF-8 文件的人分享该解决方案。

void ReadUTF8(FILE* fp)
{
    unsigned char iobuf[255] = {0};
    while( fgets((char*)iobuf, sizeof(iobuf), fp) )
    {
            size_t len = strlen((char *)iobuf);
            if(len > 1 &&  iobuf[len-1] == '\n')
                iobuf[len-1] = 0;
            len = strlen((char *)iobuf);
            printf("(%d) \"%s\"  ", len, iobuf);
            if( iobuf[0] == '\n' )
                printf("Yes\n");
            else
                printf("No\n");
    }
}

void ReadUTF16BE(FILE* fp)
{
}

void ReadUTF16LE(FILE* fp)
{
}

int main()
{
    FILE* fp = fopen("test_utf8.txt", "r");
    if( fp != NULL)
    {
        // see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
        // encoding
        unsigned char b[3] = {0};
        fread(b,1,2, fp);
        if( b[0] == 0xEF && b[1] == 0xBB)
        {
            fread(b,1,1,fp); // 0xBF
            ReadUTF8(fp);
        }
        else if( b[0] == 0xFE && b[1] == 0xFF)
        {
            ReadUTF16BE(fp);
        }
        else if( b[0] == 0 && b[1] == 0)
        {
            fread(b,1,2,fp); 
            if( b[0] == 0xFE && b[1] == 0xFF)
                ReadUTF16LE(fp);
        }
        else
        {
            // we don't know what kind of file it is, so assume its standard
            // ascii with no BOM encoding
            rewind(fp);
            ReadUTF8(fp);
        }
    }        

    fclose(fp);
}

Answer 2

回答by robinr

A nice property of UTF-8 is that you do notneed to decode in order to compare it. The order returned from strcmp will be the same whether you decode it first or not. So just read it as raw bytes and run strcmp.

UTF-8的一个很好的特性是，你并不需要解码，以便进行比较。无论您是否先解码，从 strcmp 返回的顺序都将相同。因此，只需将其作为原始字节读取并运行 strcmp。

Answer 3

回答by Hans Passant

fgets() can decode UTF-8 encoded files if you use Visual Studio 2005 and up. Change your code like this:

如果您使用 Visual Studio 2005 及更高版本，fgets() 可以解码 UTF-8 编码的文件。像这样改变你的代码：

infile = fopen(inname, "r, ccs=UTF-8");

Answer 4

回答by Thorsten S.

In this article a coding and decoding routine is written and it is explained how the unicode is encoded:

在这篇文章中编写了一个编码和解码例程，并解释了 unicode 是如何编码的：

http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/

It can be easily adjusted to C. Simply encode your ANSI or decode the UTF-8 String and make a byte compare

它可以很容易地调整为 C. 只需编码您的 ANSI 或解码 UTF-8 字符串并进行字节比较

EDIT: After the OP said that it is too hard to rewrite the function from C++ here a template:

编辑：在 OP 说从 C++ 重写函数太难之后，这里有一个模板：

What is needed:
+ Free the allocated memory (or wait till the process ends or ignore it)
+ Add the 4 byte functions
+ Tell me that short and int is not guaranteed to be 2 and 4 bytes long (I know, but C is really stupid !) and finally
+ Find some other errors

需要什么：
+ 释放分配的内存（或等到进程结束或忽略它）
+ 添加 4 字节函数
+ 告诉我 short 和 int 不能保证为 2 和 4 字节长（我知道，但 C 是真的很愚蠢！）最后
+找到一些其他错误

#include <stdlib.h>
#include <string.h>

#define         MASKBITS                0x3F
#define         MASKBYTE                0x80
#define         MASK2BYTES              0xC0
#define         MASK3BYTES              0xE0
#define         MASK4BYTES              0xF0
#define         MASK5BYTES              0xF8
#define         MASK6BYTES              0xFC

char* UTF8Encode2BytesUnicode(unsigned short* input)
{
   int size = 0,
       cindex = 0;
   while (input[size] != 0)
     size++;
   // Reserve enough place; The amount of 
   char* result = (char*) malloc(size);
   for (int i=0; i<size; i++)
   {
      // 0xxxxxxx
      if(input[i] < 0x80)
      {
         result[cindex++] = ((char) input[i]);
      }
      // 110xxxxx 10xxxxxx
      else if(input[i] < 0x800)
      {
         result[cindex++] = ((char)(MASK2BYTES | input[i] >> 6));
         result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
      }
      // 1110xxxx 10xxxxxx 10xxxxxx
      else if(input[i] < 0x10000)
      {
         result[cindex++] = ((char)(MASK3BYTES | input[i] >> 12));
         result[cindex++] = ((char)(MASKBYTE | input[i] >> 6 & MASKBITS));
         result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
      }
   }
}

wchar_t* UTF8Decode2BytesUnicode(char* input)
{
  int size = strlen(input);
  wchar_t* result = (wchar_t*) malloc(size*sizeof(wchar_t));
  int rindex = 0,
      windex = 0;
  while (rindex < size)
  {
      wchar_t ch;

      // 1110xxxx 10xxxxxx 10xxxxxx
      if((input[rindex] & MASK3BYTES) == MASK3BYTES)
      {
         ch = ((input[rindex] & 0x0F) << 12) | (
               (input[rindex+1] & MASKBITS) << 6)
              | (input[rindex+2] & MASKBITS);
         rindex += 3;
      }
      // 110xxxxx 10xxxxxx
      else if((input[rindex] & MASK2BYTES) == MASK2BYTES)
      {
         ch = ((input[rindex] & 0x1F) << 6) | (input[rindex+1] & MASKBITS);
         rindex += 2;
      }
      // 0xxxxxxx
      else if(input[rindex] < MASKBYTE)
      {
         ch = input[rindex];
         rindex += 1;
      }

      result[windex] = ch;
   }
}

char* getUnicodeToUTF8(wchar_t* myString) {
  int size = sizeof(wchar_t);
  if (size == 1)
    return (char*) myString;
  else if (size == 2)
    return UTF8Encode2BytesUnicode((unsigned short*) myString);
  else
    return UTF8Encode4BytesUnicode((unsigned int*) myString);
}

Answer 5

回答by elcuco

I know I am bad... but you don't even take under consideration BOM! Most examples here will fail.

我知道我很坏……但你甚至没有考虑 BOM！这里的大多数例子都会失败。

EDIT:

编辑：

Byte Order Marks are a few bytes at the beginnig of the file, which can be used to identify the encoding of the file. Some editors add them, and many times they just break things in faboulous ways (I remember fighting a PHP headers problems for several minutes because of this issue).

字节顺序标记是文件开头的几个字节，可以用来标识文件的编码。一些编辑器添加了它们，很多时候他们只是以极好的方式破坏事物（我记得因为这个问题，我与 PHP 标头问题斗争了几分钟）。

Some RTFM: http://en.wikipedia.org/wiki/Byte_order_mark http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx What is XML BOM and how do I detect it?

一些 RTFM：http: //en.wikipedia.org/wiki/Byte_order_mark http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx 什么是 XML BOM，我如何检测它？

Answer 6

回答by pm100

just to settle the BOM argument. Here is a file from notepad

只是为了解决 BOM 争论。这是记事本中的文件

 [paul@paul-es5 tests]$ od -t x1 /mnt/hgfs/cdrive/test.txt
 0000000 ef bb bf 61 0d 0a 62 0d 0a 63
 0000012

with a BOM at the start

以 BOM 开头

Personally I dont think there should be a BOM (since its a byte format) but thats not the point

我个人认为不应该有 BOM（因为它是字节格式），但这不是重点

C语言如何逐行读取unicode（utf-8）/二进制文件

提问by Freeseif

What i want

我想要的是

read_ansi_line_by_line.c

read_ansi_line_by_line.c

test_ansi.txt

test_ansi.txt

Compiling

编译

Output

输出

What i want

我想要的是

create_bin.c

create_bin.c

Compiling

编译

Output

输出

What i want

我想要的是

read_bin_line_by_line.c

read_bin_line_by_line.c

Output

输出

THE PROBLEM

问题

采纳答案by Freeseif

回答by robinr

回答by Hans Passant

回答by Thorsten S.

回答by elcuco

回答by pm100

相关推荐

最近更新

标签

C语言 如何逐行读取unicode（utf-8）/二进制文件

提问by Freeseif

What i want

我想要的是

read_ansi_line_by_line.c

read_ansi_line_by_line.c

test_ansi.txt

test_ansi.txt

Compiling

编译

Output

输出

What i want

我想要的是

create_bin.c

create_bin.c

Compiling

编译

Output

输出

What i want

我想要的是

read_bin_line_by_line.c

read_bin_line_by_line.c

Output

输出

THE PROBLEM

问题

采纳答案by Freeseif

回答by robinr

回答by Hans Passant

回答by Thorsten S.

回答by elcuco

回答by pm100

相关推荐

C语言 警告：在此函数中可以使用未初始化的 X

C语言 如何有效地计算C中字符串的长度？

C语言 <stdlib.h> 和 <malloc.h> 的区别

C语言 赋值使指针来自整数而不进行强制转换

相关推荐

最近更新

标签

C语言如何逐行读取unicode（utf-8）/二进制文件

C语言警告：在此函数中可以使用未初始化的 X

C语言如何有效地计算C中字符串的长度？

C语言赋值使指针来自整数而不进行强制转换