C语言 如何逐行读取unicode(utf-8)/二进制文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2113270/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read unicode (utf-8) / binary file line by line
提问by Freeseif
Hi programmers,
程序员们好,
I want read line by line a Unicode (UTF-8) text file created by Notepad, i don't want display the Unicode string in the screen, i want just read and compare the strings!.
我想逐行读取由记事本创建的 Unicode (UTF-8) 文本文件,我不想在屏幕上显示 Unicode 字符串,我只想读取和比较字符串!。
This code read ANSI file line by line, and compare the strings
此代码逐行读取 ANSI 文件,并比较字符串
What i want
我想要的是
Read test_ansi.txt line by line
逐行读取 test_ansi.txt
if the line = "b" print "YES!"
如果行 = "b" 打印 "YES!"
else print "NO!"
否则打印“不!”
read_ansi_line_by_line.c
read_ansi_line_by_line.c
#include <stdio.h>
int main()
{
char *inname = "test_ansi.txt";
FILE *infile;
char line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
char line_number;
infile = fopen(inname, "r");
if (!infile) {
printf("\nfile '%s' not found\n", inname);
return 0;
}
printf("\n%s\n\n", inname);
line_number = 0;
while (fgets(line_buffer, sizeof(line_buffer), infile)) {
++line_number;
/* note that the newline is in the buffer */
if (strcmp("b\n", line_buffer) == 0 ){
printf("%d: YES!\n", line_number);
}else{
printf("%d: NO!\n", line_number,line_buffer);
}
}
printf("\n\nTotal: %d\n", line_number);
return 0;
}
test_ansi.txt
test_ansi.txt
a
b
c
Compiling
编译
gcc -o read_ansi_line_by_line read_ansi_line_by_line.c
Output
输出
test_ansi.txt
1: NO!
2: YES!
3: NO!
Total: 3
Now i need read Unicode (UTF-8) file created by Notepad, after more than 6 months i don't found any good code/library in C can read file coded in UTF-8!, i don't know exactly why but i think the standard C don't support Unicode!
现在我需要读取由记事本创建的 Unicode (UTF-8) 文件,6 个多月后我没有找到任何好的 C 代码/库可以读取以 UTF-8 编码的文件!,我不知道确切的原因,但是我认为标准 C 不支持 Unicode!
Reading Unicode binary file its OK!, but the probleme is the binary file most be already created in binary mode!, that mean if we want read a Unicode (UTF-8) file created by Notepad we need to translate it from UTF-8 file to BINARY file!
读取 Unicode 二进制文件没问题!但问题是大多数已经以二进制模式创建的二进制文件!这意味着如果我们要读取记事本创建的 Unicode (UTF-8) 文件,我们需要将其从 UTF-8 转换文件到二进制文件!
This code write Unicode string to a binary file, NOTE the C file is coded in UTF-8 and compiled by GCC
此代码将 Unicode 字符串写入二进制文件,注意 C 文件以 UTF-8 编码并由 GCC 编译
What i want
我想要的是
Write the Unicode char "?" to test_bin.dat
写入 Unicode 字符“?” 到 test_bin.dat
create_bin.c
create_bin.c
#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif
#include <stdio.h>
#include <wchar.h>
int main()
{
/*Data to be stored in file*/
wchar_t line_buffer[BUFSIZ]=L"?";
/*Opening file for writing in binary mode*/
FILE *infile=fopen("test_bin.dat","wb");
/*Writing data to file*/
fwrite(line_buffer, 1, 13, infile);
/*Closing File*/
fclose(infile);
return 0;
}
Compiling
编译
gcc -o create_bin create_bin.c
Output
输出
create test_bin.dat
Now i want read the binary file line by line and compare!
现在我想逐行读取二进制文件并进行比较!
What i want
我想要的是
Read test_bin.dat line by line if the line = "?" print "YES!" else print "NO!"
如果 line = "?",则逐行读取 test_bin.dat 打印“是的!” 否则打印“不!”
read_bin_line_by_line.c
read_bin_line_by_line.c
#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t *inname = L"test_bin.dat";
FILE *infile;
wchar_t line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
infile = _wfopen(inname,L"rb");
if (!infile) {
wprintf(L"\nfile '%s' not found\n", inname);
return 0;
}
wprintf(L"\n%s\n\n", inname);
/*Reading data from file into temporary buffer*/
while (fread(line_buffer,1,13,infile)) {
/* note that the newline is in the buffer */
if ( wcscmp ( L"?" , line_buffer ) == 0 ){
wprintf(L"YES!\n");
}else{
wprintf(L"NO!\n", line_buffer);
}
}
/*Closing File*/
fclose(infile);
return 0;
}
Output
输出
test_bin.dat
YES!
THE PROBLEM
问题
This method is VERY LONG! and NOT POWERFUL (i m beginner in software engineering)
这个方法很长!并且不强大(我是软件工程的初学者)
Please any one know how to read Unicode file ? (i know its not easy!) Please any one know how to convert Unicode file to Binary file ? (simple method) Please any one know how to read Unicode file in binary mode ? (i m not sure)
请问有人知道如何阅读Unicode文件吗?(我知道这并不容易!)请问有人知道如何将 Unicode 文件转换为二进制文件吗?(简单的方法)请问有谁知道如何以二进制方式读取Unicode文件?(我不知道)
Thank You.
谢谢你。
采纳答案by Freeseif
I found a solution to my problem, and I would like to share the solution to any one interested in reading UTF-8 file in C99.
我找到了解决我的问题的方法,我想与任何有兴趣阅读 C99 中的 UTF-8 文件的人分享该解决方案。
void ReadUTF8(FILE* fp)
{
unsigned char iobuf[255] = {0};
while( fgets((char*)iobuf, sizeof(iobuf), fp) )
{
size_t len = strlen((char *)iobuf);
if(len > 1 && iobuf[len-1] == '\n')
iobuf[len-1] = 0;
len = strlen((char *)iobuf);
printf("(%d) \"%s\" ", len, iobuf);
if( iobuf[0] == '\n' )
printf("Yes\n");
else
printf("No\n");
}
}
void ReadUTF16BE(FILE* fp)
{
}
void ReadUTF16LE(FILE* fp)
{
}
int main()
{
FILE* fp = fopen("test_utf8.txt", "r");
if( fp != NULL)
{
// see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
// encoding
unsigned char b[3] = {0};
fread(b,1,2, fp);
if( b[0] == 0xEF && b[1] == 0xBB)
{
fread(b,1,1,fp); // 0xBF
ReadUTF8(fp);
}
else if( b[0] == 0xFE && b[1] == 0xFF)
{
ReadUTF16BE(fp);
}
else if( b[0] == 0 && b[1] == 0)
{
fread(b,1,2,fp);
if( b[0] == 0xFE && b[1] == 0xFF)
ReadUTF16LE(fp);
}
else
{
// we don't know what kind of file it is, so assume its standard
// ascii with no BOM encoding
rewind(fp);
ReadUTF8(fp);
}
}
fclose(fp);
}
回答by robinr
A nice property of UTF-8 is that you do notneed to decode in order to compare it. The order returned from strcmp will be the same whether you decode it first or not. So just read it as raw bytes and run strcmp.
UTF-8的一个很好的特性是,你并不需要解码,以便进行比较。无论您是否先解码,从 strcmp 返回的顺序都将相同。因此,只需将其作为原始字节读取并运行 strcmp。
回答by Hans Passant
fgets() can decode UTF-8 encoded files if you use Visual Studio 2005 and up. Change your code like this:
如果您使用 Visual Studio 2005 及更高版本,fgets() 可以解码 UTF-8 编码的文件。像这样改变你的代码:
infile = fopen(inname, "r, ccs=UTF-8");
回答by Thorsten S.
In this article a coding and decoding routine is written and it is explained how the unicode is encoded:
在这篇文章中编写了一个编码和解码例程,并解释了 unicode 是如何编码的:
http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/
http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/
It can be easily adjusted to C. Simply encode your ANSI or decode the UTF-8 String and make a byte compare
它可以很容易地调整为 C. 只需编码您的 ANSI 或解码 UTF-8 字符串并进行字节比较
EDIT: After the OP said that it is too hard to rewrite the function from C++ here a template:
编辑:在 OP 说从 C++ 重写函数太难之后,这里有一个模板:
What is needed:
+ Free the allocated memory (or wait till the process ends or ignore it)
+ Add the 4 byte functions
+ Tell me that short and int is not guaranteed to be 2 and 4 bytes long (I know, but
C is really stupid !) and finally
+ Find some other errors
需要什么:
+ 释放分配的内存(或等到进程结束或忽略它)
+ 添加 4 字节函数
+ 告诉我 short 和 int 不能保证为 2 和 4 字节长(我知道,但 C 是真的很愚蠢!)最后
+找到一些其他错误
#include <stdlib.h>
#include <string.h>
#define MASKBITS 0x3F
#define MASKBYTE 0x80
#define MASK2BYTES 0xC0
#define MASK3BYTES 0xE0
#define MASK4BYTES 0xF0
#define MASK5BYTES 0xF8
#define MASK6BYTES 0xFC
char* UTF8Encode2BytesUnicode(unsigned short* input)
{
int size = 0,
cindex = 0;
while (input[size] != 0)
size++;
// Reserve enough place; The amount of
char* result = (char*) malloc(size);
for (int i=0; i<size; i++)
{
// 0xxxxxxx
if(input[i] < 0x80)
{
result[cindex++] = ((char) input[i]);
}
// 110xxxxx 10xxxxxx
else if(input[i] < 0x800)
{
result[cindex++] = ((char)(MASK2BYTES | input[i] >> 6));
result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
}
// 1110xxxx 10xxxxxx 10xxxxxx
else if(input[i] < 0x10000)
{
result[cindex++] = ((char)(MASK3BYTES | input[i] >> 12));
result[cindex++] = ((char)(MASKBYTE | input[i] >> 6 & MASKBITS));
result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
}
}
}
wchar_t* UTF8Decode2BytesUnicode(char* input)
{
int size = strlen(input);
wchar_t* result = (wchar_t*) malloc(size*sizeof(wchar_t));
int rindex = 0,
windex = 0;
while (rindex < size)
{
wchar_t ch;
// 1110xxxx 10xxxxxx 10xxxxxx
if((input[rindex] & MASK3BYTES) == MASK3BYTES)
{
ch = ((input[rindex] & 0x0F) << 12) | (
(input[rindex+1] & MASKBITS) << 6)
| (input[rindex+2] & MASKBITS);
rindex += 3;
}
// 110xxxxx 10xxxxxx
else if((input[rindex] & MASK2BYTES) == MASK2BYTES)
{
ch = ((input[rindex] & 0x1F) << 6) | (input[rindex+1] & MASKBITS);
rindex += 2;
}
// 0xxxxxxx
else if(input[rindex] < MASKBYTE)
{
ch = input[rindex];
rindex += 1;
}
result[windex] = ch;
}
}
char* getUnicodeToUTF8(wchar_t* myString) {
int size = sizeof(wchar_t);
if (size == 1)
return (char*) myString;
else if (size == 2)
return UTF8Encode2BytesUnicode((unsigned short*) myString);
else
return UTF8Encode4BytesUnicode((unsigned int*) myString);
}
回答by elcuco
I know I am bad... but you don't even take under consideration BOM! Most examples here will fail.
我知道我很坏……但你甚至没有考虑 BOM!这里的大多数例子都会失败。
EDIT:
编辑:
Byte Order Marks are a few bytes at the beginnig of the file, which can be used to identify the encoding of the file. Some editors add them, and many times they just break things in faboulous ways (I remember fighting a PHP headers problems for several minutes because of this issue).
字节顺序标记是文件开头的几个字节,可以用来标识文件的编码。一些编辑器添加了它们,很多时候他们只是以极好的方式破坏事物(我记得因为这个问题,我与 PHP 标头问题斗争了几分钟)。
Some RTFM: http://en.wikipedia.org/wiki/Byte_order_markhttp://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspxWhat is XML BOM and how do I detect it?
一些 RTFM:http: //en.wikipedia.org/wiki/Byte_order_mark http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx什么是 XML BOM,我如何检测它?
回答by pm100
just to settle the BOM argument. Here is a file from notepad
只是为了解决 BOM 争论。这是记事本中的文件
[paul@paul-es5 tests]$ od -t x1 /mnt/hgfs/cdrive/test.txt
0000000 ef bb bf 61 0d 0a 62 0d 0a 63
0000012
with a BOM at the start
以 BOM 开头
Personally I dont think there should be a BOM (since its a byte format) but thats not the point
我个人认为不应该有 BOM(因为它是字节格式),但这不是重点

