如何在 Windows 控制台上输出 Unicode 字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3130979/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to Output Unicode Strings on the Windows Console
提问by Philipp
there are already a few questions relating to this problem. I think my question is a bit different because I don't have an actual problem, I'm only asking out of academic interest. I know that Windows's implementation of UTF-16 is sometimes contradictory to the Unicode standard (e.g. collation) or closer to the old UCS-2 than to UTF-16, but I'll keep the “UTF-16” terminology here for reasons of simplicity.
已经有几个与这个问题有关的问题。我认为我的问题有点不同,因为我没有实际问题,我只是出于学术兴趣而询问。我知道 Windows 对 UTF-16 的实现有时与 Unicode 标准(例如排序规则)相矛盾,或者更接近旧的 UCS-2 而不是 UTF-16,但出于以下原因,我将在此处保留“UTF-16”术语简单。
Background: In Windows, everything is UTF-16. Regardless of whether you're dealing with the kernel, the graphics subsystem, the filesystem or whatever, you're passing UTF-16 strings. There are no locales or charsets in the Unix sense. For compatibility with medieval versions of Windows, there is a thing called “codepages” that is obsolete but nonetheless supported. AFAIK, there is only one correct and non-obsolete function to write strings to the console, namely WriteConsoleW
, which takes an UTF-16 string. Also, a similar discussion applies to input streams, which I'll ignore, too.
背景:在 Windows 中,一切都是 UTF-16。无论您是在处理内核、图形子系统、文件系统还是其他任何东西,您都在传递 UTF-16 字符串。没有 Unix 意义上的语言环境或字符集。为了与中世纪版本的 Windows 兼容,有一种叫做“代码页”的东西已经过时但仍然受支持。AFAIK,只有一个正确且非过时的函数可以将字符串写入控制台,即WriteConsoleW
UTF-16 字符串。此外,类似的讨论也适用于输入流,我也将忽略它。
However, I think this represents a design flaw in the Windows API: there is a generic function that can be used to write to all stream objects (files, pipes, consoles…) called WriteFile
, but this function is byte-oriented and doesn't accept UTF-16 strings. The documentation suggests using WriteConsoleW
for console output, which is text oriented, and WriteFile
for everything else, which is byte oriented. Since both console streams and file objects are represented by kernel object handles and console streams can be redirected, you have to call a function for every write to a standard output stream that checks whether the handle represents a console stream or a file, breaking polymorphy. OTOH, I do think that Windows's separation between text strings and raw bytes (which is mirrored in many other systems like Java or Python) is conceptually superior to Unix's char*
approach that ignores encodings and doesn't distinguish between strings and byte arrays.
不过,我认为这代表了Windows API中的一个设计缺陷:有可用于写入(文件,管道,游戏机...)呼吁所有流对象的通用功能WriteFile
,但这个功能是面向字节和不接受 UTF-16 字符串。文档建议使用WriteConsoleW
面向文本的控制台输出,并且WriteFile
对于其他一切,这是面向字节的。由于控制台流和文件对象都由内核对象句柄表示并且控制台流可以重定向,因此您必须为每次写入标准输出流调用一个函数,以检查句柄是代表控制台流还是文件,从而打破了多态性。OTOH,我确实认为 Windows 在文本字符串和原始字节之间的分离(在 Java 或 Python 等许多其他系统中镜像)在概念上优于 Unixchar*
忽略编码并且不区分字符串和字节数组的方法。
So my questions are: What to do in this situation? And why isn't this problem solved even in Microsoft's own libraries? Both the .NET Framework and the C and C++ libraries seem to adhere to the obsolete codepage model. How would you design the Windows API or an application framework to circumvent this issue?
所以我的问题是:在这种情况下该怎么办?为什么这个问题即使在微软自己的库中也没有解决?.NET Framework 和 C 和 C++ 库似乎都遵循过时的代码页模型。您将如何设计 Windows API 或应用程序框架来规避此问题?
I think that the general problem (which is not easy to solve) is that all libraries assume that all streams are byte-oriented, and implement text-oriented streams on top of that. However, we see that Windows does have special text-oriented streams on the OS level, and the libraries are unable to deal with this. So in any case we must introduce significant changes to all standard libraries. A quick and dirty way would be to treat the console as a special byte-oriented stream that accepts only one encoding. This still requires that the C and C++ standard libraries must be circumvented because they don't implement the WriteFile
/WriteConsoleW
switch. Is that correct?
我认为一般问题(不容易解决)是所有库都假设所有流都是面向字节的,并在此基础上实现面向文本的流。但是,我们看到 Windows 在操作系统级别确实具有特殊的面向文本的流,而库无法处理此问题。因此,无论如何我们必须对所有标准库进行重大更改。一种快速而肮脏的方法是将控制台视为只接受一种编码的特殊面向字节的流。这仍然要求必须绕过 C 和 C++ 标准库,因为它们没有实现WriteFile
/WriteConsoleW
开关。那是对的吗?
采纳答案by Albert
The general strategy I/we use in most (cross platform) applications/projects is: We just use UTF-8 (I mean the real standard) everywhere. We use std::string as the container and we just interpret everythingas UTF8. And we also handle all file IO this way, i.e. we expect UTF8 and save UTF8. In the case when we get a string from somewhere and we know that it is not UTF8, we will convert it to UTF8.
我/我们在大多数(跨平台)应用程序/项目中使用的一般策略是:我们只在任何地方使用 UTF-8(我的意思是真正的标准)。我们使用 std::string 作为容器,我们只是将所有内容解释为 UTF8。并且我们也以这种方式处理所有文件 IO,即我们期望 UTF8 并保存 UTF8。如果我们从某处得到一个字符串并且我们知道它不是 UTF8,我们会将其转换为 UTF8。
The most common case where we stumble upon WinUTF16 is for filenames. So for every filename handling, we will always convert the UTF8 string to WinUTF16. And also the other way if we search through a directory for files.
我们偶然发现 WinUTF16 的最常见情况是文件名。所以对于每个文件名处理,我们总是将 UTF8 字符串转换为 WinUTF16。还有另一种方式,如果我们在目录中搜索文件。
The console isn't really used in our Windows build (in the Windows build, all console output is wrapped into a file). As we have UTF8 everywhere, also our console output is UTF8 which is fine for most modern systems. And also the Windows console log file has its content in UTF8 and most text-editors on Windows can read that without problems.
在我们的 Windows 版本中并没有真正使用控制台(在 Windows 版本中,所有控制台输出都被包装到一个文件中)。因为我们到处都有 UTF8,所以我们的控制台输出也是 UTF8,这对于大多数现代系统来说都很好。而且 Windows 控制台日志文件的内容是 UTF8,Windows 上的大多数文本编辑器都可以毫无问题地读取它。
If we would use the WinConsole more and if we would care a lot that all special chars are displayed correctly, we maybe would write some automatic pipe handler which we install in between fileno=0
and the real stdout
which will use WriteConsoleW
as you have suggested (if there is really no easier way).
如果我们会更多地使用 WinConsole 并且如果我们非常关心所有特殊字符的正确显示,我们可能会编写一些自动管道处理程序,我们会在两者之间安装它,fileno=0
并按照您的建议stdout
使用真实的管道处理程序WriteConsoleW
(如果真的有没有更简单的方法)。
If you wonder about how to realize such automatic pipe handler: We have implemented such thing already for all POSIX-like systems. The code probably doesn't work on Windows as it is but I think it should be possible to port it. Our current pipe handler is similar to what tee
does. I.e. if you do a cout << "Hello" << endl
, it will both be printed on stdout
and in some log-file. Look at the codeif you are interested how this is done.
如果您想知道如何实现这样的自动管道处理程序:我们已经为所有类似 POSIX 的系统实现了这样的事情。该代码可能无法在 Windows 上正常工作,但我认为应该可以移植它。我们当前的管道处理程序类似于什么tee
。即,如果您执行 a cout << "Hello" << endl
,它将同时打印stdout
在某个日志文件中。如果您对这是如何完成的感兴趣,请查看代码。
回答by Artyom
Several points:
几点:
- One important difference between Windows "WriteConsoleW" and printf is that WriteConsoleW looks at console as GUI rather them text stream. For example if you use it and use pipe you would not capture output.
I would never said that code-pages are obsolete. Maybe windows developers would like them to be so, but they never would be. All world, but windows api, uses byte oriented streams to represent data: XML, HTML, HTTP, Unix, etc, etc use encodings and most popular and most powerful one is UTF-8. So you may use Wide strings internally but in external world you'll need something else.
Even when you print
wcout << L"Hello World" << endl
it is converted under the hood to byte oriented stream, on most systems (but windows) to UTF-8.My personal opinion, Microsoft did mistake when changed their API in every place to wide instead of supporting UTF-8 everywhere. Of course you may argue about it. But in fact you have to separate text and byte oriented streams and convert between them.
- Windows“WriteConsoleW”和 printf 之间的一个重要区别是 WriteConsoleW 将控制台视为 GUI 而不是文本流。例如,如果您使用它并使用管道,则不会捕获输出。
我永远不会说代码页已经过时了。也许 Windows 开发人员希望他们如此,但他们永远不会。所有世界,除了 windows api,使用面向字节的流来表示数据:XML、HTML、HTTP、Unix 等使用编码,最流行和最强大的一种是 UTF-8。因此,您可以在内部使用 Wide 字符串,但在外部世界中您将需要其他东西。
即使在打印时,
wcout << L"Hello World" << endl
它也会在引擎盖下转换为面向字节的流,在大多数系统(但 Windows)上转换为 UTF-8。我个人的看法是,微软在将每个地方的 API 改为宽而不是在所有地方都支持 UTF-8 时犯了错误。当然,你可以为此争论。但实际上,您必须将面向文本的流和面向字节的流分开并在它们之间进行转换。
回答by jveazey
To answer your first question, you can output Unicode strings to the Windows console using _setmode. Specific details regarding this can be found on Michael Kaplan's blog. By default, the console is not Unicode (UCS-2/UTF-16). It works in an Ansi (locale/code page) manner and must specifically be configured to use Unicode.
要回答您的第一个问题,您可以使用_setmode将 Unicode 字符串输出到 Windows 控制台。关于这方面的具体细节可以在Michael Kaplan 的博客上找到。默认情况下,控制台不是 Unicode (UCS-2/UTF-16)。它以 Ansi(语言环境/代码页)方式工作,并且必须专门配置为使用 Unicode。
Also, you have to change the console font, as the default font only supports Ansi characters. There are some minor exceptions here, such as zero-extended ASCII characters, but printing actual Unicode characters requires the use of _setmode.
此外,您必须更改控制台字体,因为默认字体仅支持 Ansi 字符。这里有一些小例外,例如零扩展的 ASCII 字符,但打印实际的 Unicode 字符需要使用 _setmode。
In Windows, everything is UTF-16. Regardless of whether you're dealing with the kernel, the graphics subsystem, the filesystem or whatever, you're passing UTF-16 strings. There are no locales or charsets in the Unix sense.
在 Windows 中,一切都是 UTF-16。无论您是在处理内核、图形子系统、文件系统还是其他任何东西,您都在传递 UTF-16 字符串。没有 Unix 意义上的语言环境或字符集。
This is not completely true. While the underlying core of Windows does use Unicode, there is a huge amount of interoperability that comes into play that lets Windows interact with a wide variety of software.
这并不完全正确。虽然 Windows 的底层核心确实使用 Unicode,但有大量的互操作性发挥作用,使 Windows 与各种软件进行交互。
Consider notepad (yes, notepad is far from a core component, but it gets my point across). Notepad has the ability to read files that contain Ansi (your current code page), Unicode or UTF-8. You might consider notepad to be a Unicode application, but that is not entirely accurate.
考虑记事本(是的,记事本远不是核心组件,但它让我明白了)。记事本能够读取包含 Ansi(您当前的代码页)、Unicode 或 UTF-8 的文件。您可能认为记事本是一个 Unicode 应用程序,但这并不完全准确。
A better example is drivers. Driverscan be written in either Unicode or Ansi. It really depends on the nature of the interface. To further this point, Microsoft provides the StrSafelibrary, which was specifically written with Kernel-mode driversin mind, and it includes both Unicode and Ansi versions. While the drivers are either Ansi or Unicode, the Windows kernel must interact with them - correctly - regardless of whatever form they take.
一个更好的例子是驱动程序。驱动程序可以用 Unicode 或 Ansi 编写。这实际上取决于接口的性质。为了进一步实现这一点,Microsoft 提供了StrSafe库,该库是专门针对内核模式驱动程序编写的,它包括Unicode 和 Ansi 版本。尽管驱动程序是 Ansi 或 Unicode,但 Windows 内核必须与它们进行交互——正确——无论它们采用什么形式。
The further away you get from the core of Windows, the more interoperability comes into play. This includes code pages and locales. You have to remember that not all software is written with Unicode in mind. Visual C++ 2010 still has the abilityto build using Ansi, Multi-Byte or Unicode. This includes the use of code pagesand locales, which are part of the C/C++ standard.
离 Windows 的核心越远,互操作性就越发挥作用。这包括代码页和语言环境。您必须记住,并非所有软件都是用 Unicode 编写的。Visual C++ 2010 仍然能够使用 Ansi、多字节或 Unicode 进行构建。这包括使用代码页和语言环境,它们是 C/C++ 标准的一部分。
However, I think this represents a design flaw in the Windows API
但是,我认为这代表了 Windows API 中的设计缺陷
the following two articles discuss this fairly well.
以下两篇文章对此进行了很好的讨论。
- Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT?
- Header files are not retarded, aka What the @#%&* is _O_WTEXT?
So my questions are: What to do in this situation? And why isn't this problem solved even in Microsoft's own libraries? Both the .NET Framework and the C and C++ libraries seem to adhere to the obsolete codepage model. How would you design the Windows API or an application framework to circumvent this issue?
所以我的问题是:在这种情况下该怎么办?为什么这个问题即使在微软自己的库中也没有解决?.NET Framework 和 C 和 C++ 库似乎都遵循过时的代码页模型。您将如何设计 Windows API 或应用程序框架来规避此问题?
On this point, I think you are looking at Windows in hindsight. Unicode did not come first, ASCIIdid. After ASCII, came code pages. After code pages, came DBCS. After DBCS came MBCS(and eventually UTF-8). After UTF-8, came Unicode(UTF-16/UCS-2).
在这一点上,我认为您是在事后看待 Windows 。Unicode 没有出现,ASCII 出现了。在 ASCII 之后,出现了代码页。在代码页之后,是DBCS。DBCS 之后是MBCS(最终是 UTF-8)。在 UTF-8 之后,出现了Unicode(UTF-16/UCS-2)。
Each of these technologies was incorporated into the Windows OS over the years. Each building on the last, but without breaking each other. Software was written with each of these in mind. While it may not seem like it sometimes, Microsoft puts a huge amount of effortinto notbreaking software it didn't write. Even now, you can write new software that takes advantage of any of these technologies and it will work.
多年来,这些技术中的每一项都被整合到 Windows 操作系统中。每个建筑都放在最后,但不互相破坏。编写软件时考虑到了这些。虽然有时看起来不像,但微软付出了巨大的努力来不破坏不是它编写的软件。即使是现在,您也可以编写利用这些技术中的任何一种的新软件,并且它会起作用。
The real answer here is "compatibility". Microsoft still uses these technologies and so do many other companies. There are an untold number of programs, components and libraries which have not been updated (or ever will be updated) to use Unicode. Even as newer technologies arise - like .NET - the older technologies must stick around. At the very least for interoperability.
这里真正的答案是“兼容性”。Microsoft 仍在使用这些技术,许多其他公司也在使用这些技术。有无数程序、组件和库尚未更新(或将更新)以使用 Unicode。即使出现了更新的技术(如 .NET),旧的技术也必须坚持下去。至少是为了互操作性。
For example, say you have a DLL that you need to interact with from .NET, but this DLL was written using Ansi (single byte code page localized). To make it worse, you don't have the source for the DLL. The only answer here is to use those obsolete features.
例如,假设您有一个需要从 .NET 与之交互的 DLL,但该 DLL 是使用 Ansi(本地化的单字节代码页)编写的。更糟糕的是,您没有 DLL 的源代码。这里唯一的答案是使用那些过时的功能。
回答by VITTUIX-MAN
How I correcty work is as follows:
我如何纠正工作如下:
- Use UTF-16 and wchar_t internally, this works nicely with filenames and Windows API in general.
- Set codepage to 65001, which is UTF-8. This ensures that when you read plaintext files, windows checks them for UTF-16 and BOM, ("the Windows standard"), and if no BOM, the text will be treated as UTF-8 ("the world standard") and translated to UTF-16 for your use.
- 在内部使用 UTF-16 和 wchar_t,这通常适用于文件名和 Windows API。
- 将代码页设置为 65001,即 UTF-8。这确保当您阅读纯文本文件时,windows 会检查它们的 UTF-16 和 BOM(“Windows 标准”),如果没有 BOM,则文本将被视为 UTF-8(“世界标准”)并进行翻译到 UTF-16 供您使用。