C++ wchar_t 和 wstrings 有什么“错误”?宽字符有哪些替代方案?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11107608/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 14:51:51  来源:igfitidea点击:

What's "wrong" with C++ wchar_t and wstrings? What are some alternatives to wide characters?

c++winapiunicodeinternationalizationwstring

提问by Ken Li

I have seen a lot of people in the C++ community(particularly ##c++ on freenode) resent the use of wstringsand wchar_t, and their use in the windows api. What is exactly "wrong" with wchar_tand wstring, and if I want to support internationalization, what are some alternatives to wide characters?

我看到很多人在C ++社区(freenode上特别是## C ++)怨恨使用wstringswchar_t,以及它们在Windows API的使用。wchar_tand到底有什么“错误” wstring,如果我想支持国际化,那么宽字符有哪些替代方案?

回答by bames53

What is wchar_t?

wchar_t 是什么?

wchar_t is defined such that any locale's char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint:

wchar_t 的定义使得任何语言环境的 char 编码都可以转换为 wchar_t 表示,其中每个 wchar_t 仅表示一个代码点:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

                                                                               — C++ [basic.fundamental] 3.9.1/5

类型 wchar_t 是一个不同的类型,其值可以表示支持的语言环境 (22.3.1) 中指定的最大扩展字符集的所有成员的不同代码。

                                                                               — C++ [basic.fundamental] 3.9.1/5

This does notrequire that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.1

并不需要wchar_t的足够大代表同时从所有区域设置的任何字符。也就是说,用于 wchar_t 的编码可能因地区而异。这意味着您不一定使用一种语言环境将字符串转换为 wchar_t,然后使用另一种语言环境将其转换回 char。1

Since using wchar_t as a common representation between all locales seems to be the primary use for wchar_t in practice you might wonder what it's good for if not that.

由于使用 wchar_t 作为所有语言环境之间的通用表示似乎是 wchar_t 在实践中的主要用途,因此您可能想知道如果不是这样,它有什么好处。

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of the same simple algorithms as are used with ascii strings to work with other languages.

wchar_t 的最初意图和目的是通过定义文本来简化文本处理,使其需要从字符串的代码单元到文本字符的一对一映射,从而允许使用与所使用的相同的简单算法与 ascii 字符串一起使用其他语言。

Unfortunately the wording of wchar_t's specification assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption2, so you can't safely use wchar_t for simple text algorithms either.

不幸的是,wchar_t 规范的措辞假设字符和代码点之间存在一对一的映射来实现这一点。Unicode 打破了假设2,因此您也不能安全地将 wchar_t 用于简单的文本算法。

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

这意味着便携式软件不能将 wchar_t 用作语言环境之间文本的通用表示,也不能使用简单的文本算法。

What use is wchar_t today?

wchar_t 今天有什么用?

Not much, for portable code anyway. If __STDC_ISO_10646__is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

无论如何,对于可移植代码来说并不多。如果__STDC_ISO_10646__已定义,则 wchar_t 的值直接表示在所有语言环境中具有相同值的 Unicode 代码点。这使得进行前面提到的区域间转换是安全的。但是,您不能仅仅依靠它来决定您可以以这种方式使用 wchar_t,因为虽然大多数 unix 平台都定义了它,但即使 Windows 在所有语言环境中使用相同的 wchar_t 语言环境,Windows 也不会。

The reason Windows doesn't define __STDC_ISO_10646__is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

Windows 没有定义的原因__STDC_ISO_10646__是因为 Windows 使用 UTF-16 作为其 wchar_t 编码,并且因为 UTF-16 使用代理对来表示大于 U+FFFF 的代码点,这意味着 UTF-16 不满足__STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

对于特定于平台的代码 wchar_t 可能更有用。它本质上在 Windows 上是必需的(例如,某些文件在不使用 wchar_t 文件名的情况下根本无法打开),尽管据我所知,Windows 是唯一可以实现这一点的平台(所以也许我们可以将 wchar_t 视为“Windows_char_t”)。

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes. Non-portable code may find it useful simply because some API requires it.

事后看来, wchar_t 显然对于简化文本处理或作为独立于语言环境的文本的存储没有用。可移植代码不应试图将其用于这些目的。非可移植代码可能仅仅因为某些 API 需要它而发现它很有用。

Alternatives

备择方案

The alternative I like is to use UTF-8 encoded C strings, even on platforms not particularly friendly toward UTF-8.

我喜欢的替代方法是使用 UTF-8 编码的 C 字符串,即使在对 UTF-8 不是特别友好的平台上也是如此。

This way one can write portable code using a common text representation across platforms, use standard datatypes for their intended purpose, get the language's support for those types (e.g. string literals, though some tricks are necessary to make it work for some compilers), some standard library support, debugger support (more tricks may be necessary), etc. With wide characters it's generally harder or impossible to get all of this, and you may get different pieces on different platforms.

通过这种方式,人们可以使用跨平台的通用文本表示编写可移植代码,将标准数据类型用于其预期目的,获得语言对这些类型的支持(例如字符串文字,尽管需要一些技巧才能使其适用于某些编译器),某些标准库支持、调试器支持(可能需要更多技巧)等。使用宽字符通常更难或不可能获得所有这些,并且您可能会在不同平台上获得不同的部分。

One thing UTF-8 does not provide is the ability to use simple text algorithms such as are possible with ASCII. In this UTF-8 is no worse than any other Unicode encoding. In fact it may be considered to be better because multi-code unit representations in UTF-8 are more common and so bugs in code handling such variable width representations of characters are more likely to be noticed and fixed than if you try to stick to UTF-32 with NFC or NFKC.

UTF-8 没有提供的一件事是能够使用简单的文本算法,例如 ASCII。在这方面 UTF-8 并不比任何其他 Unicode 编码差。事实上,它可能被认为更好,因为 UTF-8 中的多代码单元表示更常见,因此与尝试坚持使用 UTF 相比,处理这种可变宽度字符表示的代码中的错误更有可能被注意到和修复-32 使用 NFC 或 NFKC。

Many platforms use UTF-8 as their native char encoding and many programs do not require any significant text processing, and so writing an internationalized program on those platforms is little different from writing code without considering internationalization. Writing more widely portable code, or writing on other platforms requires inserting conversions at the boundaries of APIs that use other encodings.

许多平台使用 UTF-8 作为其原生字符编码,并且许多程序不需要任何重要的文本处理,因此在这些平台上编写国际化程序与不考虑国际化编写代码几乎没有什么不同。编写更广泛的可移植代码,或在其他平台上编写需要在使用其他编码的 API 的边界处插入转换。

Another alternative used by some software is to choose a cross-platform representation, such as unsigned short arrays holding UTF-16 data, and then to supply all the library support and simply live with the costs in language support, etc.

某些软件使用的另一种替代方法是选择跨平台表示,例如保存 UTF-16 数据的无符号短数组,然后提供所有库支持并简单地承担语言支持等方面的成本。

C++11 adds new kinds of wide characters as alternatives to wchar_t, char16_t and char32_t with attendant language/library features. These aren't actually guaranteed to be UTF-16 and UTF-32, but I don't imagine any major implementation will use anything else. C++11 also improves UTF-8 support, for example with UTF-8 string literals so it won't be necessary to trick VC++ into producing UTF-8 encoded strings (although I may continue to do so rather than use the u8prefix).

C++11 添加了新的宽字符作为 wchar_t、char16_t 和 char32_t 的替代品,并具有附带的语言/库功能。这些实际上并不能保证是 UTF-16 和 UTF-32,但我不认为任何主要实现会使用其他任何东西。C++11 还改进了 UTF-8 支持,例如使用 UTF-8 字符串文字,因此没有必要欺骗 VC++ 生成 UTF-8 编码的字符串(尽管我可能会继续这样做而不是使用u8前缀) .

Alternatives to avoid

避免的替代方法

TCHAR: TCHAR is for migrating ancient Windows programs that assume legacy encodings from char to wchar_t, and is best forgotten unless your program was written in some previous millennium. It's not portable and is inherently unspecific about its encoding and even its data type, making it unusable with any non-TCHAR based API. Since its purpose is migration to wchar_t, which we've seen above isn't a good idea, there is no value whatsoever in using TCHAR.

TCHAR:TCHAR 用于迁移采用传统编码的古老 Windows 程序,从 char 到 wchar_t,最好忘记,除非您的程序是在前一千年编写的。它不是可移植的,并且其编码甚至其数据类型本质上是不确定的,因此无法与任何基于非 TCHAR 的 API 一起使用。由于它的目的是迁移到 wchar_t,我们在上面看到这不是一个好主意,因此使用 TCHAR 没有任何价值。



1. Characters which are representable in wchar_t strings but which are not supported in any locale are not required to be represented with a single wchar_t value. This means that wchar_t could use a variable width encoding for certain characters, another clear violation of the intent of wchar_t. Although it's arguable that a character being representable by wchar_t is enough to say that the locale 'supports' that character, in which case variable-width encodings aren't legal and Window's use of UTF-16 is non-conformant.

1. 可以在 wchar_t 字符串中表示但在任何语言环境中都不支持的字符不需要用单个 wchar_t 值表示。这意味着 wchar_t 可以对某些字符使用可变宽度编码,这又明显违反了 wchar_t 的意图。尽管 wchar_t 可表示的字符足以说明语言环境“支持”该字符是有争议的,但在这种情况下,可变宽度编码是不合法的,并且 Window 对 UTF-16 的使用不符合标准。

2. Unicode allows many characters to be represented with multiple code points, which creates the same problems for simple text algorithms as variable width encodings. Even if one strictly maintains a composed normalization, some characters still require multiple code points. See: http://www.unicode.org/standard/where/

2. Unicode 允许用多个代码点表示许多字符,这对于简单的文本算法和可变宽度编码产生了同样的问题。即使严格维护组合规范化,某些字符仍然需要多个代码点。请参阅:http: //www.unicode.org/standard/where/

回答by paulsm4

There's nothing "wrong" with wchar_t. The problem is that, back in NT 3.x days, Microsoft decided that Unicode was Good (it is), and to implement Unicode as 16-bit, wchar_t characters. So most Microsoft literature from the mid-90's pretty much equated Unicode == utf16 == wchar_t.

wchar_t 没有任何“错误”。问题是,回到 NT 3.x 时代,Microsoft 认为 Unicode 很好(确实如此),并将 Unicode 实现为 16 位 wchar_t 字符。因此,90 年代中期的大多数 Microsoft 文献几乎等同于 Unicode == utf16 == wchar_t。

Which, sadly, is not at all the case. "Wide characters" are notnecessarily 2 bytes, on all platforms, under all circumstances.

可悲的是,事实并非如此。“宽字符”是不是一定是2个字节,在所有平台上,在任何情况下。

This is one of the best primers on "Unicode" (independent of this question, independent of C++) I've ever seen: I highlyrecommend it:

这是我见过的关于“Unicode”的最好的入门书之一(独立于这个问题,独立于 C++):我强烈推荐它:

And I honestly believe the best way to deal with "8-bit ASCII" vs "Win32 wide characters" vs "wchar_t-in-general" is simply to accept that "Windows is Different" ... and code accordingly.

老实说,我相信处理“8 位 ASCII”与“Win32 宽字符”与“wchar_t-in-general”的最佳方法就是接受“Windows 是不同的”……并相应地进行编码。

IMHO...

恕我直言...

PS:

PS:

I totally agree with jamesdlin above:

我完全同意上面的 jamesdlin:

On Windows, you don't really have a choice. Its internal APIs were designed for UCS-2, which was reasonable at the time since it was before the variable-length UTF-8 and UTF-16 encodings were standardized. But now that they support UTF-16, they've ended up with the worst of both worlds.

在 Windows 上,你真的没有选择。它的内部 API 是为 UCS-2 设计的,这在当时是合理的,因为它是在变长 UTF-8 和 UTF-16 编码标准化之前。但是现在他们支持 UTF-16,他们最终得到了两全其美的结果。