Windows 中的 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/166503/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 11:23:08  来源:igfitidea点击:

UTF-8 in Windows

cwindowswinapiunicodeutf-8

提问by Michael Platings

How do I set the code page to UTF-8 in a C Windows program?

如何在 C Windows 程序中将代码页设置为 UTF-8?

I have a third party library that uses fopen to open files. I can use wcstombs to convert my Unicode filenames to the current code page, however if the user has a filename with a character outside the code page then this breaks.

我有一个使用 fopen 打开文件的第三方库。我可以使用 wcstombs 将我的 Unicode 文件名转换为当前代码页,但是如果用户的文件名包含代码页之外的字符,那么这会中断。

Ideally I would just call _setmbcp(65001) to set the code page to UTF-8, however the MSDN documentation for _setmbcp states that UTF-8 is not supported.

理想情况下,我只会调用 _setmbcp(65001) 将代码页设置为 UTF-8,但是 _setmbcp 的 MSDN 文档指出不支持 UTF-8。

How can I get around this?

我怎样才能解决这个问题?

采纳答案by efotinis

Unfortunately, there is no way to make Unicode the current codepage in Windows. The CP_UTF7and CP_UTF8constants are pseudo-codepages, used only in MultiByteToWideCharand WideCharToMultiByteconversion functions, like Ben mentioned.

不幸的是,没有办法使 Unicode 成为 Windows 中的当前代码页。的CP_UTF7CP_UTF8常数是伪代码页,仅在所用的MultiByteToWideChar调用WideCharToMultiByte转换功能,像本提及。

Your problem is similar to that of the fstream C++ classes. The fstream constructors accept only char*names, making impossible to open a file with a true Unicode name. The only solution offered by VC was a hack: open the file separately and then set the handle to the stream object. I'm afraid this isn't an option for you, of course, since the third party library probably doesn't accept handles.

您的问题类似于 fstream C++ 类的问题。fstream 构造函数只接受char*名称,因此无法使用真正的 Unicode 名称打开文件。VC 提供的唯一解决方案是 hack:单独打开文件,然后将句柄设置为流对象。当然,恐怕这不是您的选择,因为第三方库可能不接受句柄。

The only solution I can think of is to create a temporary file with a non-Unicode name, which is hard-linked to the original, and use that as a parameter.

我能想到的唯一解决方案是创建一个非 Unicode 名称的临时文件,该文件与原始文件硬链接,并将其用作参数。

回答by Ben Straub

All Windows APIs think in UTF-16, so you're better off writing a wrapper around your library that converts at the boundaries.

所有 Windows API 都采用 UTF-16,因此最好在库周围编写一个包装器,以便在边界处进行转换。

Oddly enough, Windows thinks UTF-8 is a codepage for the purposes of conversion, so you use the same APIs as you would to convert between codepages:

奇怪的是,Windows 认为 UTF-8 是用于转换目的的代码页,因此您使用与在代码页之间转换相同的 API:

std::wstring Utf8ToUtf16(const char* u8string)
{
    int wcharcount = strlen(u8string);
    wchar_t *tempWstr = new wchar_t[wcharcount];
    MultiByteToWideChar(CP_UTF8, 0, u8string, -1, tempWstr, wcharcount);
    wstring w(tempWstr);
    delete [] tempWstr;
    return w;
}

And something of similar form to convert back.

和类似形式的东西转换回来。

回答by Arthur2e5

2018 update: Windows 10 has made the "65001" code page less "pseudo" in two steps:

2018 更新:Windows 10 分两步让“65001”代码页少了“伪”:

  1. conhostchanges: Windows Subsystem for Linux uses code page 65001 for its consoles. It is also possible to run chcp 65001in cmd.exesince WSL. (It has caused some pretty dumb Python bugs.)
  2. full-featured locale: Windows since build 17035 allows setting UTF-8 as the locale codepage. This is available from the April 2018 update.
  1. conhost更改:适用于 Linux 的 Windows 子系统对其控制台使用代码页 65001。它也可以运行chcp 65001cmd.exe自WSL。(它导致了一些非常愚蠢的 Python 错误。)
  2. 全功能的语言环境:自构建 17035 以来的 Windows允许将 UTF-8 设置为语言环境代码页。这可从 2018 年 4 月的更新中获得。

回答by R.. GitHub STOP HELPING ICE

Use cygwin (which provides a UTF-8 locale by default), or write your own libc hack for Windows that does the necessary UTF-8 to UTF-16 translations and wraps the nonstandard _wfopenetc. functions.

使用 cygwin(默认提供 UTF-8 语言环境),或为 Windows 编写自己的 libc hack,它执行必要的 UTF-8 到 UTF-16 转换并包装非标准_wfopen等功能。