如何在 Linux 中使用包含非 Ascii 字符串的 wchar_t* 打开文件？

Question

提问by Cauly

Environment: Gcc/G++ Linux

环境：Gcc/G++ Linux

I have a non-ascii file in file system and I'm going to open it.

我在文件系统中有一个非 ascii 文件，我要打开它。

Now I have a wchar_t*, but I don't know how to open it. (my trusted fopen only opens char* file)

现在我有一个 wchar_t*，但我不知道如何打开它。（我信任的 fopen 只打开 char* 文件）

Please help. Thanks a lot.

请帮忙。非常感谢。

Answer 1

采纳答案by R.. GitHub STOP HELPING ICE

There are two possible answers:

有两种可能的答案：

If you want to make sure all Unicode filenames are representable, you can hard-code the assumption that the filesystem uses UTF-8 filenames. This is the "modern" Linux desktop-app approach. Just convert your strings from wchar_t(UTF-32) to UTF-8 with library functions (iconvwould work well) or your own implementation (but lookup the specs so you don't get it horribly wrong like Shelwien did), then use fopen.

如果您想确保所有 Unicode 文件名都可以表示，您可以对文件系统使用 UTF-8 文件名的假设进行硬编码。这是“现代”Linux 桌面应用程序方法。只需将您的字符串从wchar_t(UTF-32) 转换为带有库函数（iconv效果很好）或您自己的实现（但查找规范，这样您就不会像 Shelwien 所做的那样大错特错）的UTF-8 ，然后使用fopen.

If you want to do things the more standards-oriented way, you should use wcsrtombsto convert the wchar_tstring to a multibyte charstring in the locale's encoding (which hopefully is UTF-8 anyway on any modern system) and use fopen. Note that this requires that you previously set the locale with setlocale(LC_CTYPE, "")or setlocale(LC_ALL, "").

如果您想以更面向标准的方式做事，您应该使用wcsrtombs将wchar_t字符串转换char为语言环境编码中的多字节字符串（希望在任何现代系统上都是 UTF-8）并使用fopen. 请注意，这要求您之前使用setlocale(LC_CTYPE, "")或设置语言环境setlocale(LC_ALL, "")。

And finally, not exactly an answer but a recommendation:

最后，不完全是答案，而是建议：

Storing filenames as wchar_tstrings is probably a horrible mistake. You should instead store filenames as abstract byte strings, and only convert those to wchar_tjust-in-time for displaying them in the user interface (if it's even necessary for that; many UI toolkits use plain byte strings themselves and do the interpretation as characters for you). This way you eliminate a lot of possible nasty corner cases, and you never encounter a situation where some files are inaccessible due to their names.

将文件名存储为wchar_t字符串可能是一个可怕的错误。相反，您应该将文件名存储为抽象字节字符串，并且只将它们转换为wchar_t即时显示它们在用户界面中（如果有必要的话；许多 UI 工具包本身使用纯字节字符串并将其解释为字符你）。通过这种方式，您可以消除许多可能令人讨厌的极端情况，并且您永远不会遇到某些文件由于名称而无法访问的情况。

Answer 2

回答by Peon the Great

Check out this document

看看这个文件

http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

I think Linux follows POSIX standard, which treats all file names as UTF-8.

我认为 Linux 遵循 POSIX 标准，它将所有文件名视为 UTF-8。

Answer 3

回答by metamatt

I take it it's the name of the file that contains non-ascii characters, not the file itself, when you say "non-ascii file in file system". It doesn't really matter what the file contains.

当您说“文件系统中的非ascii文件”时，我认为它是包含非ascii字符的文件名，而不是文件本身。文件包含什么并不重要。

You can do this with normal fopen, but you'll have to match the encoding the filesystem uses.

您可以使用普通的 fopen 执行此操作，但您必须匹配文件系统使用的编码。

It depends on what version of Linux and what filesystem you're using and how you've set it up, but likely, if you're lucky, the filesystem uses UTF-8. So take your wchar_t (which is probably a UTF-16 encoded string?), convert it to a char string encoded in UTF-8, and pass that to fopen.

这取决于您使用的 Linux 版本和文件系统以及您如何设置它，但如果幸运的话，文件系统可能使用 UTF-8。所以拿你的 wchar_t （这可能是一个 UTF-16 编码的字符串？），把它转换成一个用 UTF-8 编码的字符字符串，然后把它传递给 fopen。

Answer 4

回答by Shelwien

Convert wchar string to utf8 char string, then use fopen.

将 wchar 字符串转换为 utf8 字符字符串，然后使用 fopen。

typedef unsigned int   uint;
typedef unsigned short word;
typedef unsigned char  byte;

int UTF16to8( wchar_t* w, char* s ) {
  uint  c;
  word* p = (word*)w;
  byte* q = (byte*)s; byte* q0 = q;
  while( 1 ) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x080 ) *q++ = c; else 
      if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else 
        *q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63);
  }
  *q = 0;
  return q-q0;
}

int UTF8to16( char* s, wchar_t* w ) {
  uint  cache,wait,c;
  byte* p = (byte*)s;
  word* q = (word*)w; word* q0 = q;
  while(1) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x80 ) cache=c,wait=0; else
      if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else 
        if( (c>=0xE0) ) cache=c&15,wait=2; else
          if( wait ) (cache<<=6)+=c&63,wait--;
    if( wait==0 ) *q++=cache;
  }
  *q = 0;
  return q-q0;
}

Answer 5

回答by DigitalRoss

Linux is not UTF-8, but it's your only choice for filenames anyway

Linux 不是 UTF-8，但无论如何它是文件名的唯一选择

(Files can have anything you want insidethem.)

（文件可以有你想要的东西里面他们。）

With respect to filenames, linux does not really have a string encoding to worry about. Filenames are byte strings that need to be null-terminated.

关于文件名，linux 并没有真正需要担心的字符串编码。文件名是需要以空字符结尾的字节字符串。

This doesn't precisely mean that Linux is UTF-8, but it does mean that it's not compatible with wide characters as they could have a zero in a byte that's not the end byte.

这并不完全意味着 Linux 是 UTF-8，但它确实意味着它与宽字符不兼容，因为它们可能在不是结束字节的字节中包含零。

But UTF-8 preserves the no-nulls-except-at-the-end model, so I have to believe that the practical approach is "convert to UTF-8" for filenames.

但是 UTF-8 保留了 no-nulls-except-at-the-end 模型，所以我不得不相信实用的方法是“转换为 UTF-8”的文件名。

The content of files is a matter for standards above the Linux kernel level, so here there isn't anything Linux-y that you can or want to do. The content of files will be solely the concern of the programs that read and write them. Linux just stores and returns the byte stream, and it can have all the embedded nuls you want.

文件的内容是 Linux 内核级别以上标准的问题，因此这里没有任何 Linux-y 可以或想要做的事情。文件的内容将完全由读取和写入它们的程序决定。Linux 只存储并返回字节流，它可以包含您想要的所有嵌入的 NULL。

Answer 6

回答by Tanzer

// locals
string file_to_read;           // any file
wstring file;                  // read ascii or non-ascii file here 
FILE *stream;
int read = 0;    
wchar_t buffer= '0';

if( fopen_s( &stream, file_to_read.c_str(), "r+b" ) == 0 )   // in binary mode
  {      
      while( !feof( stream ))
      { 
     // if ascii file second arg must be sizeof(char). if non ascii file sizeof( wchar_t)
        read = fread( & buffer, sizeof( char ), 1, stream );  
        file.append(1, buffer);
      }
  }

file.pop_back(); // since this code reads the last character twice.Throw the last one
fclose(stream);

// and the file is in wstring format.You can use it in any C++ wstring operation
// this code is fast enough i think, at least in my practice
// for windows because of fopen_s

如何在 Linux 中使用包含非 Ascii 字符串的 wchar_t* 打开文件？

提问by Cauly

采纳答案by R.. GitHub STOP HELPING ICE

回答by Peon the Great

回答by metamatt

回答by Shelwien

回答by DigitalRoss

Linux is not UTF-8, but it's your only choice for filenames anyway

Linux 不是 UTF-8，但无论如何它是文件名的唯一选择

回答by Tanzer

相关推荐

最近更新

标签

如何在 Linux 中使用包含非 Ascii 字符串的 wchar_t* 打开文件？

提问by Cauly

采纳答案by R.. GitHub STOP HELPING ICE

回答by Peon the Great

回答by metamatt

回答by Shelwien

回答by DigitalRoss

Linux is not UTF-8, but it's your only choice for filenames anyway

Linux 不是 UTF-8，但无论如何它是文件名的唯一选择

回答by Tanzer

相关推荐

C# 如何使用 .NET 安装打印机？

将 Winsock 移植到 Linux 套接字

对于较新的 linux 内核，刷新进程名称中的数字有什么意义？

Linux 如何为 Eclipse 安装 CDT

相关推荐

最近更新

标签