C++ 在 Windows 控制台中正确打印 utf8 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10882277/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 14:34:53  来源:igfitidea点击:

Properly print utf8 characters in windows console

c++utf-8consolemingwwindows-xp-sp3

提问by rsk82

This is the way I try to do it:

这是我尝试这样做的方式:

#include <stdio.h>
#include <windows.h>
using namespace std;

int main() {
  SetConsoleOutputCP(CP_UTF8);
   //german chars won't appear
  char const* text = "a?bcdefghijklmno?pqrs?tuüvwxyz";
  int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
  wchar_t *unicode_text = new wchar_t[len];
  MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
  wprintf(L"%s", unicode_text);
}

And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.

效果是只显示我们的 ascii 字符。没有错误显示。源文件采用 utf8 编码。

So, what I'm doing wrong here ?

那么,我在这里做错了什么?

to WouterH:

对 WouterH:

int main() {
  SetConsoleOutputCP(CP_UTF8);
  const wchar_t *unicode_text = L"a?bcdefghijklmno?pqrs?tuüvwxyz";
  wprintf(L"%s", unicode_text);
}
  • this also doesn't work. Effect is just the same. My font is of course Lucida Console.
  • 这也行不通。效果是一样的。我的字体当然是 Lucida Console。

third take:

第三招:

#include <stdio.h>
#define _WIN32_WINNT 0x05010300
#include <windows.h>
#define _O_U16TEXT  0x20000
#include <fcntl.h>

using namespace std;

int main() {
    _setmode(_fileno(stdout), _O_U16TEXT);
    const wchar_t *u_text = L"a?bcdefghijklmno?pqrs?tuüvwxyz";
    wprintf(L"%s", u_text);
}

ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs?tu?vwxyz.

好的,有些东西开始起作用了,但输出是:ańbcdefghijklmno÷pqrs?tu?vwxyz

回答by bames53

By default the wide print functions on Windows do not handle characters outside the ascii range.

默认情况下,Windows 上的宽打印功能不处理 ascii 范围之外的字符。

There are a few ways to get Unicode data to the Windows console.

有几种方法可以将 Unicode 数据发送到 Windows 控制台。

  • use the console API directly, WriteConsoleW. You'll have to ensure you're actually writing to a console and use other means when the output is to something else.

  • set the mode of the standard output file descriptors to one of the 'Unicode' modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they're used on file descriptors that don't represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.

  • UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as basic_ostream<char>::operator<<(char*)don't work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.

  • 直接使用控制台 API,WriteConsoleW。您必须确保您实际上是在向控制台写入数据,并在输出到其他内容时使用其他方式。

  • 将标准输出文件描述符的模式设置为“Unicode”模式之一,_O_U16TEXT 或 _O_U8TEXT。这会导致宽字符输出函数将 Unicode 数据正确输出到 Windows 控制台。如果它们用于不代表控制台的文件描述符,那么它们会导致字节的输出流分别为 UTF-16 和 UTF-8。注意设置这些模式后,相应流上的非宽字符函数将无法使用并导致崩溃。您必须仅使用宽字符函数。

  • 如果使用正确的函数,可以通过将控制台输出代码页设置为 CP_UTF8 将 UTF-8 文本直接打印到控制台。大多数高级函数(例如)basic_ostream<char>::operator<<(char*)不能以这种方式工作,但是您可以使用低级函数或实现自己的 ostream 来解决标准函数的问题。

The problem with the third method is this:

第三种方法的问题是:

putc('2'); putc('0'); // doesn't work with CP_UTF8

puts("20"); // correctly writes UTF-8 data to Windows console with CP_UTF8 

Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It's a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.

与大多数操作系统不同,Windows 上的控制台不仅仅是另一个接受字节流的文件。它是由程序创建和拥有的特殊设备,并通过其自己独特的 WIN32 API 访问。问题在于,当写入控制台时,API 会准确地看到在使用其 API 时传递的数据的范围,并且从窄字符到宽字符的转换发生时没有考虑数据可能不完整。当使用多次调用控制台 API 传递多字节字符时,每个单独传递的部分都被视为非法编码,并被视为非法编码。

It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn't care.

解决这个问题应该很容易,但微软的 CRT 团队认为这不是他们的问题,而在控制台上工作的任何团队可能都不关心。

You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g., std::mbstate_t).

您可以通过实现自己的 streambuf 子类来解决它,该子类可以正确处理转换为 wchar_t 。即考虑到多字节字符的字节可能单独出现的事实,保持写入之间的转换状态(例如,std::mbstate_t)。

回答by huysentruitw

Another trick, instead of SetConsoleOutputCP, would be using _setmodeon stdout:

另一个技巧,而不是SetConsoleOutputCP,将使用_setmodeon stdout

// Includes needed for _setmode()
#include <io.h>
#include <fcntl.h>

int main() {
    _setmode(_fileno(stdout), _O_U16TEXT);  
    wchar_t * unicode_text = L"a?bcdefghijklmno?pqrs?tuüvwxyz";
    wprintf(L"%s", unicode_text);
    return 0;
}

Don't forget to remove the call to SetConsoleOutputCP(CP_UTF8);

不要忘记删除对 SetConsoleOutputCP(CP_UTF8);

回答by vladasimovic

//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main() {
  SetConsoleOutputCP(65001);
  const char unicode_text[]="a?bcdefghijklmno?pqrs?tuüvwxyz";
  printf("%s\n", unicode_text);
}

Result:
a?bcdefghijklmno?pqrs?tuüvwxyz

结果:
a?bcdefghijklmno?pqrs?tuüvwxyz

回答by Jan Turoň

Console can be set to display UTF-8 chars: @vladasimovic answers SetConsoleOutputCP(CP_UTF8)can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001or by system call system("chcp 65001 > nul")in the main program. Don't forget to save the source code in UTF-8 as well.

控制台可以设置为显示 UTF-8 字符:@vladasimovic 答案SetConsoleOutputCP(CP_UTF8)可用于此目的。或者,您可以通过 DOS 命令chcp 65001或通过system("chcp 65001 > nul")主程序中的系统调用来准备您的控制台。不要忘记以 UTF-8 格式保存源代码。

To check the UTF-8 support, run

要检查 UTF-8 支持,请运行

#include <stdio.h>
#include <windows.h>

BOOL CALLBACK showCPs(LPTSTR cp) {
  puts(cp);
  return true;
}

int main() {
  EnumSystemCodePages(showCPs,CP_SUPPORTED);
}

65001should appear in the list.

65001应该出现在列表中。

Windows console uses OEM codepagesby default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (@Devenec suggests Lucida Console in his answer).

Windows 控制台默认使用OEM 代码页,大多数默认光栅字体仅支持国家字符。Windows XP 和更新版本还支持 TrueType 字体,它应该显示丢失的字符(@Devenec 在他的回答中建议使用 Lucida Console)。

Why printf fails

为什么 printf 失败

As @bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printfmesses the job, putting the bytes to output buffer one by one. Try use sprintfand then putsthe result, or force to fflush only accumulated output buffer.

正如@bames53 在他的回答中指出的那样,Windows 控制台不是流设备,您需要写入多字节字符的所有字节。有时会printf弄乱工作,将字节一个一个地放入输出缓冲区。尝试使用sprintf然后puts结果,或者强制只刷新累积的输出缓冲区。

If everything fails

如果一切都失败了

Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:

请注意UTF-8 格式:一个字符显示为 1-5 个字节。使用此函数移至字符串中的下一个字符:

const char* ucshift(const char* str, int len=1) {
  for(int i=0; i<len; ++i) {
    if(*str==0) return str;
    if(*str<0) {
      unsigned char c = *str;
      while((c<<=1)&128) ++str;
    }
    ++str;
  }
  return str;
}

...and this function to transform the bytes into unicode number:

...这个函数将字节转换为 unicode 数:

int ucchar(const char* str) {
  if(!(*str&128)) return *str;
  unsigned char c = *str, bytes = 0;
  while((c<<=1)&128) ++bytes;
  int result = 0;
  for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
  int mask = 1;
  for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
  result|= (*str&mask)<<(6*bytes);
  return result;
}

Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale()before!)

然后你可以尝试使用一些像MultiByteToWideChar这样的wild/ancient/non-standard winAPI函数(setlocale()之前不要忘记调用!)

or you can use your own mapping from Unicode table to your active working codepage. Example:

或者您可以使用您自己的从 Unicode 表到您的活动工作代码页的映射。例子:

int main() {
  system("chcp 65001 > nul");
  char str[] = "p?í?erně"; // file saved in UTF-8
  for(const char* p=str; *p!=0; p=ucshift(p)) {
    int c = ucchar(p);
    if(c<128) printf("%c\n",c);
    else printf("%d\n",c);
  }
}

This should print

这应该打印

p
345
237
353
e
r
n
283

If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.

如果您的代码页不支持捷克语的间断,您可以映射 345=>r, 237=>i, 353=>s, 283=>e。至少有 5(!) 个不同的字符集仅用于捷克语。在不同的 Windows 语言环境中显示可读字符是一种恐怖。

回答by Matthew

I had similar problems, but none of the existing answers worked for me. Something else I observed is that, if I stick UTF-8 characters in a plainstring literal, they would print properly, but if I tried to use a UTF-8 literal (u8"text"), the characters get butchered by the compiler(proved by printing out their numeric values one byte at a time; the rawliteral had the correct UTF-8 bytes, as verified on a Linux machine, but the UTF-8 literal was garbage).

我遇到了类似的问题,但现有的答案都不适合我。我观察到的另一件事是,如果我将 UTF-8 字符粘贴在字符串文字中,它们会正确打印,但是如果我尝试使用 UTF-8 文字 ( u8"text"),这些字符会被编译器杀死(通过打印证明一次一个字节地输出它们的数值;原始文字具有正确的 UTF-8 字节,如在 Linux 机器上验证的那样,但 UTF-8 文字是垃圾)。

After some poking around, I found the solution: /utf-8. With that, everything Just Works; my sources are UTF-8, I can use explicit UTF-8 literals, and output works with no other changes needed.

一些闲逛之后,我找到了解决办法:/utf-8。有了它,一切正常;我的来源是 UTF-8,我可以使用显式的 UTF-8 文字,并且输出无需其他更改即可工作。

回答by Devenec

I solved the problem in the following way:

我通过以下方式解决了这个问题:

Lucida Console doesn't seem to support umlauts, so changing the console font to Consolas, for example, works.

Lucida Console 似乎不支持变音,因此例如将控制台字体更改为 Consolas 是可行的。

#include <stdio.h>
#include <Windows.h>

int main()
{
    SetConsoleOutputCP(CP_UTF8);

    // I'm using Visual Studio, so encoding the source file in UTF-8 won't work
    const char* message = "a" "\xC3\xA4" "bcdefghijklmno" "\xC3\xB6" "pqrs" "\xC3\x9F" "tu" "\xC3\xBC" "vwxyz";

    // Note the capital S in the first argument, when used with wprintf it
    // specifies a single-byte or multi-byte character string (at least on
    // Visual C, not sure about the C library MinGW is using)
    wprintf(L"%S", message);
}

EDIT: fixed stupid typos and the decoding of the string literal, sorry about those.

编辑:修复了愚蠢的拼写错误和字符串文字的解码,抱歉。

回答by Henrik Haftmann

UTF-8 doesn't work for Windows console. Period. I have tried all combinations with no success. Problems arise due to different ANSI/OEM character assignment so some answers say that there is no problem but such answers may come from programmers using 7-bit plain ASCII or have identical ANSI/OEM code pages (Chinese, Japanese).

UTF-8 不适用于 Windows 控制台。时期。我尝试了所有组合,但都没有成功。问题是由于不同的 ANSI/OEM 字符分配而出现的,因此一些答案说没有问题,但这些答案可能来自使用 7 位纯 ASCII 或具有相同 ANSI/OEM 代码页(中文、日文)的程序员。

Either you stick to use UTF-16 and the wide-char functions (but you are still restricted to the 256 characters of your OEM code page- except for Chinese/Japanese), or you use OEM code page ASCII strings in your source file.

您要么坚持使用 UTF-16 和宽字符函数(但您仍被限制为 OEM 代码页的 256 个字符——中文/日文除外),或者您在源文件中使用 OEM 代码页 ASCII 字符串。

Yes, it is a mess at all.

是的,这完全是一团糟。

For multilingual programs I use string resources, and wrote a LoadStringOem()function that auto-translates the UTF-16 resource to OEM string using WideCharToMultiByte()without intermediate buffer. As Windows auto-selects the right language out of the resource, it will hopefully load a string in a language that is convertible to the target OEM code page.

对于多语言程序,我使用字符串资源,并编写了一个LoadStringOem()函数,WideCharToMultiByte()无需中间缓冲区即可将 UTF-16 资源自动转换为 OEM 字符串。当 Windows 自动从资源中选择正确的语言时,它有望以可转换为目标 OEM 代码页的语言加载字符串。

As a consequence, you should not use 8-bit typographic characters for English-US language resource (as ellipsis … and quotes “”) as English-US is selected by Windows when no language match has been detected (i.e. fallback). As an example you have resources in German, Czech, Russian, and English-US, and the user has Chinese, he/she will see English plus garbage instead of your nicely made typographic if you made your text nice-looking.

因此,您不应将 8 位印刷字符用于英美语言资源(如省略号……和引号“”),因为当没有检测到语言匹配时(即回退),Windows 会选择英美。举个例子,你有德语、捷克语、俄语和英美资源,而用户有中文,如果你让你的文本看起来漂亮,他/她会看到英语和垃圾,而不是你精心制作的排版。

Now, on Windows 7 and 10, SetConsoleOutputCP(65001/*aka CP_UTF8*/)works as expected. You should keep your source file in UTF-8 without BOM, otherwise, your string literals will be recoded to ANSI by compiler. Moreover, the console font mustcontain desired characters, and must notbe "Terminal". Unluckily, there is no font covering both umlauts and Chinese characters, even when you install both language packs, so you cannot truly display all character shapes at once.

现在,在 Windows 7 和 10 上,SetConsoleOutputCP(65001/*aka CP_UTF8*/)按预期工作。您应该将源文件保存在没有 BOM 的 UTF-8 中,否则,您的字符串文字将被编译器重新编码为 ANSI。此外,控制台字体必须包含所需的字符,不能是“终端”。不幸的是,即使您安装了两个语言包,也没有涵盖变音和汉字的字体,因此您无法一次真正显示所有字符形状。