char vs wchar_t vs char16_t vs char32_t (c++11)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19068748/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
char vs wchar_t vs char16_t vs char32_t (c++11)
提问by user904963
From what I understand, a char
is safe to house ASCII characters whereas char16_t
and char32_t
are safe to house characters from unicode, one for the 16-bit variety and another for the 32-bit variety (Should I have said "a" instead of "the"?). But I'm then left wondering what the purpose behind the wchar_t
is. Should I ever use that type in new code, or is it simply there to support old code? What was the purpose of wchar_t
in old code if, from what I understand, its size had no guarantee to be bigger than a char
? Clarification would be nice!
从我的理解,一个char
而安全地房子ASCII字符char16_t
,并char32_t
在安全的房子字符从Unicode,一个是16位的品种,另一个是32位的品种(我应该说“是”,而不是“中” ?)。但我想知道背后的目的wchar_t
是什么。我应该在新代码中使用这种类型,还是只是为了支持旧代码?wchar_t
如果根据我的理解,旧代码的大小不能保证大于 a ,那么它的目的是什么char
?澄清会很好!
回答by bames53
char
is for 8-bit code units, char16_t
is for 16-bit code units, and char32_t
is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.
char
用于 8 位代码单元,char16_t
用于 16 位代码单元,char32_t
用于 32 位代码单元。这些中的任何一个都可以用于“Unicode”;UTF-8 使用 8 位代码单元,UTF-16 使用 16 位代码单元,UTF-32 使用 32 位代码单元。
The guarantee made for wchar_t
was that any character supported in a locale could be converted from char
to wchar_t
, and whatever representation was used for char
, be it multiple bytes, shift codes, what have you, the wchar_t
would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t
strings just like the simple algorithms used with ASCII.
所提供的保证wchar_t
是语言环境中支持的任何字符都可以从 转换char
为wchar_t
,并且无论使用何种表示形式 ,无论是char
多个字节、移位代码,还是您所拥有的,wchar_t
都将是一个单一的、不同的值。这样做的目的是,您可以wchar_t
像使用 ASCII 的简单算法一样操作字符串。
For example, converting ascii to upper case goes like:
例如,将 ascii 转换为大写如下:
auto loc = std::locale("");
char s[] = "hello";
for (char &c : s) {
c = toupper(c, loc);
}
But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:
但这不会处理将 UTF-8 中的所有字符转换为大写,或将所有字符转换为其他一些编码(如 Shift-JIS)。人们希望能够像这样国际化这段代码:
auto loc = std::locale("");
wchar_t s[] = L"hello";
for (wchar_t &c : s) {
c = toupper(c, loc);
}
So every wchar_t
is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ? where the uppercase version is actually the two characters SS instead of a single character.
所以每个wchar_t
都是一个“字符”,如果它有一个大写版本,那么它可以直接转换。不幸的是,这并不是一直有效;例如,在某些语言中存在奇怪的现象,例如德语字母 ? 其中大写版本实际上是两个字符 SS 而不是单个字符。
So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t
intended. As such wchar_t
and wide characters in general provide little value.
因此,国际化的文本处理本质上比 ASCII 更难,并且不能真正按照设计者的wchar_t
意图进行简化。因此wchar_t
,宽字符通常提供的价值很小。
The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.
使用它们的唯一原因是它们已经融入到一些 API 和平台中。但是,即使在此类平台上进行开发时,我也更愿意在自己的代码中坚持使用 UTF-8,并且仅在 API 边界处将其转换为所需的任何编码。
回答by Dietmar Kühl
The type wchar_t
was put into the standard when Unicode promised to create a 16 bit representation. Most vendors choose to make wchar_t
32 bits but one large vendor has chosen to to make it 16 bits. Since Unicode uses more than 16 bits (e.g., 20 bits) it was felt that we should have better character types.
wchar_t
当 Unicode 承诺创建 16 位表示时,该类型被纳入标准。大多数供应商选择制作wchar_t
32 位,但有一家大型供应商选择制作 16 位。由于 Unicode 使用超过 16 位(例如,20 位),因此我们认为我们应该有更好的字符类型。
The intent for char16_t
is to represent UTF16 and char32_t
is meant to directly represent Unicode characters. However, on systems using wchar_t
as part of their fundamental interface, you'll be stuck with wchar_t
. If you are unconstrained I would personally use char
to represent Unicode using UTF8. The problem with char16_t
and char32_t
is that they are not fully supported, not even in the standard C++ library: for example, there are no streams supporting these types directly and it more work than just instantiating the stream for these types.
的意图char16_t
是表示 UTF16 并char32_t
旨在直接表示 Unicode 字符。但是,在wchar_t
用作其基本接口一部分的系统上,您将被wchar_t
. 如果您不受约束,我个人会char
使用 UTF8 来表示 Unicode。这个问题char16_t
和char32_t
是它们不完全支持,甚至没有在标准C ++库:比如,没有直接支持这些类型的流,它不仅仅是实例为这些类型的流更多的工作。