char vs wchar_t vs char16_t vs char32_t (c++11)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19068748/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 22:29:04  来源:igfitidea点击:

char vs wchar_t vs char16_t vs char32_t (c++11)

c++c++11

提问by user904963

From what I understand, a charis safe to house ASCII characters whereas char16_tand char32_tare safe to house characters from unicode, one for the 16-bit variety and another for the 32-bit variety (Should I have said "a" instead of "the"?). But I'm then left wondering what the purpose behind the wchar_tis. Should I ever use that type in new code, or is it simply there to support old code? What was the purpose of wchar_tin old code if, from what I understand, its size had no guarantee to be bigger than a char? Clarification would be nice!

从我的理解,一个char而安全地房子ASCII字符char16_t,并char32_t在安全的房子字符从Unicode,一个是16位的品种,另一个是32位的品种(我应该说“是”,而不是“中” ?)。但我想知道背后的目的wchar_t是什么。我应该在新代码中使用这种类型,还是只是为了支持旧代码?wchar_t如果根据我的理解,旧代码的大小不能保证大于 a ,那么它的目的是什么char?澄清会很好!

回答by bames53

charis for 8-bit code units, char16_tis for 16-bit code units, and char32_tis for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.

char用于 8 位代码单元,char16_t用于 16 位代码单元,char32_t用于 32 位代码单元。这些中的任何一个都可以用于“Unicode”;UTF-8 使用 8 位代码单元,UTF-16 使用 16 位代码单元,UTF-32 使用 32 位代码单元。



The guarantee made for wchar_twas that any character supported in a locale could be converted from charto wchar_t, and whatever representation was used for char, be it multiple bytes, shift codes, what have you, the wchar_twould be a single, distinct value. The purpose of this was that then you could manipulate wchar_tstrings just like the simple algorithms used with ASCII.

所提供的保证wchar_t是语言环境中支持的任何字符都可以从 转换charwchar_t,并且无论使用何种表示形式 ,无论是char多个字节、移位代码,还是您所拥有的,wchar_t都将是一个单一的、不同的值。这样做的目的是,您可以wchar_t像使用 ASCII 的简单算法一样操作字符串。

For example, converting ascii to upper case goes like:

例如,将 ascii 转换为大写如下:

auto loc = std::locale("");

char s[] = "hello";
for (char &c : s) {
  c = toupper(c, loc);
}

But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:

但这不会处理将 UTF-8 中的所有字符转换为大写,或将所有字符转换为其他一些编码(如 Shift-JIS)。人们希望能够像这样国际化这段代码:

auto loc = std::locale("");

wchar_t s[] = L"hello";
for (wchar_t &c : s) {
  c = toupper(c, loc);
}

So every wchar_tis a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ? where the uppercase version is actually the two characters SS instead of a single character.

所以每个wchar_t都是一个“字符”,如果它有一个大写版本,那么它可以直接转换。不幸的是,这并不是一直有效;例如,在某些语言中存在奇怪的现象,例如德语字母 ? 其中大写版本实际上是两个字符 SS 而不是单个字符。

So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_tintended. As such wchar_tand wide characters in general provide little value.

因此,国际化的文本处理本质上比 ASCII 更难,并且不能真正按照设计者的wchar_t意图进行简化。因此wchar_t,宽字符通常提供的价值很小。

The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.

使用它们的唯一原因是它们已经融入到一些 API 和平台中。但是,即使在此类平台上进行开发时,我也更愿意在自己的代码中坚持使用 UTF-8,并且仅在 API 边界处将其转换为所需的任何编码。

回答by Dietmar Kühl

The type wchar_twas put into the standard when Unicode promised to create a 16 bit representation. Most vendors choose to make wchar_t32 bits but one large vendor has chosen to to make it 16 bits. Since Unicode uses more than 16 bits (e.g., 20 bits) it was felt that we should have better character types.

wchar_t当 Unicode 承诺创建 16 位表示时,该类型被纳入标准。大多数供应商选择制作wchar_t32 位,但有一家大型供应商选择制作 16 位。由于 Unicode 使用超过 16 位(例如,20 位),因此我们认为我们应该有更好的字符类型。

The intent for char16_tis to represent UTF16 and char32_tis meant to directly represent Unicode characters. However, on systems using wchar_tas part of their fundamental interface, you'll be stuck with wchar_t. If you are unconstrained I would personally use charto represent Unicode using UTF8. The problem with char16_tand char32_tis that they are not fully supported, not even in the standard C++ library: for example, there are no streams supporting these types directly and it more work than just instantiating the stream for these types.

的意图char16_t是表示 UTF16 并char32_t旨在直接表示 Unicode 字符。但是,在wchar_t用作其基本接口一部分的系统上,您将被wchar_t. 如果您不受约束,我个人会char使用 UTF8 来表示 Unicode。这个问题char16_tchar32_t是它们不完全支持,甚至没有在标准C ++库:比如,没有直接支持这些类型的流,它不仅仅是实例为这些类型的流更多的工作。