C++ std::wstring VS std::string
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/402283/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
std::wstring VS std::string
提问by paercebal
I am not able to understand the differences between std::string
and std::wstring
. I know wstring
supports wide characters such as Unicode characters. I have got the following questions:
我无法理解之间的差异std::string
和std::wstring
。我知道wstring
支持宽字符,例如 Unicode 字符。我有以下问题:
- When should I use
std::wstring
overstd::string
? - Can
std::string
hold the entire ASCII character set, including the special characters? - Is
std::wstring
supported by all popular C++ compilers? - What is exactly a "wide character"?
- 我应该什么时候使用
std::wstring
overstd::string
? - 能不能
std::string
容纳整个ASCII字符集,包括特殊字符? - 是
std::wstring
由所有流行的C ++编译器的支持? - 什么是“宽字符”?
回答by paercebal
string
? wstring
?
string
? wstring
?
std::string
is a basic_string
templated on a char
, and std::wstring
on a wchar_t
.
std::string
是一个basic_string
模板上char
,和std::wstring
上一个wchar_t
。
char
vs. wchar_t
char
对比 wchar_t
char
is supposed to hold a character, usually an 8-bit character.wchar_t
is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t
is 4 bytes, while on Windows, it's 2 bytes.
char
应该保存一个字符,通常是一个 8 位字符。wchar_t
应该包含一个宽字符,然后事情变得棘手:
在 Linux 上,awchar_t
是 4 个字节,而在 Windows 上,它是 2 个字节。
What about Unicode, then?
那么Unicode呢?
The problem is that neither char
nor wchar_t
is directly tied to unicode.
问题是既不直接绑定char
也不wchar_t
直接绑定到 unicode。
On Linux?
在 Linux 上?
Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8(i.e. Unicode string of chars). The following code:
让我们以 Linux 操作系统为例:我的 Ubuntu 系统已经支持 unicode。当我使用字符字符串时,它以UTF-8(即字符的 Unicode 字符串)本机编码。以下代码:
#include <cstring>
#include <iostream>
int main(int argc, char* argv[])
{
const char text[] = "olé" ;
std::cout << "sizeof(char) : " << sizeof(char) << std::endl ;
std::cout << "text : " << text << std::endl ;
std::cout << "sizeof(text) : " << sizeof(text) << std::endl ;
std::cout << "strlen(text) : " << strlen(text) << std::endl ;
std::cout << "text(ordinals) :" ;
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
std::cout << " " << static_cast<unsigned int>(
static_cast<unsigned char>(text[i])
);
}
std::cout << std::endl << std::endl ;
// - - -
const wchar_t wtext[] = L"olé" ;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl ;
//std::cout << "wtext : " << wtext << std::endl ; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << std::endl ;
std::wcout << L"wtext : " << wtext << std::endl;
std::cout << "sizeof(wtext) : " << sizeof(wtext) << std::endl ;
std::cout << "wcslen(wtext) : " << wcslen(wtext) << std::endl ;
std::cout << "wtext(ordinals) :" ;
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
std::cout << " " << static_cast<unsigned int>(
static_cast<unsigned short>(wtext[i])
);
}
std::cout << std::endl << std::endl ;
return 0;
}
outputs the following text:
输出以下文本:
sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169
sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol?
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233
You'll see the "olé" text in char
is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t
code as an exercise)
您会看到其中的“olé”文本char
实际上由四个字符构成:110、108、195和 169(不包括尾随零)。(我会让你学习wchar_t
代码作为练习)
So, when working with a char
on Linux, you should usually end up using Unicode without even knowing it. And as std::string
works with char
, so std::string
is already unicode-ready.
因此,char
在 Linux 上使用 a时,您通常应该在不知道的情况下最终使用 Unicode。并且与std::string
一起使用char
,因此std::string
已经准备好 unicode。
Note that std::string
, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.
请注意std::string
,与 C 字符串 API 一样,将认为“olé”字符串具有 4 个字符,而不是 3 个。因此,在截断/使用 unicode 字符时应谨慎,因为 UTF-8 中禁止某些字符组合。
On Windows?
在 Windows 上?
On Windows, this is a bit different. Win32 had to support a lot of application working with char
and on different charsets/codepagesproduced in all the world, before the advent of Unicode.
在 Windows 上,这有点不同。在 Unicode 出现之前,Win32 必须支持许多使用世界各地产生的char
不同字符集/代码页的应用程序。
So their solution was an interesting one: If an application works with char
, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.
所以他们的解决方案是一个有趣的解决方案:如果应用程序使用char
,那么字符字符串将使用机器上的本地字符集/代码页编码/打印/显示在 GUI 标签上。例如,“olé”在法语本地化的 Windows 中将是“olé”,但在西里尔文本地化的 Windows 中会有所不同(如果使用Windows-1251,则为“olй” )。因此,“历史应用程序”通常仍会以同样的旧方式工作。
For Unicode based applications, Windows uses wchar_t
, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC).
对于基于 Unicode 的应用程序,Windows 使用wchar_t
2 字节宽,并以UTF-16编码,这是在 2 字节字符上编码的 Unicode(或者至少是最兼容的 UCS-2,它几乎是同样的事情IIRC)。
Applications using char
are said "multibyte" (because each glyph is composed of one or more char
s), while applications using wchar_t
are said "widechar" (because each glyph is composed of one or two wchar_t
. See MultiByteToWideCharand WideCharToMultiByteWin32 conversion API for more info.
使用char
的应用程序称为“多字节”(因为每个字形由一个或多个char
s 组成),而使用wchar_t
的应用程序称为“widechar”(因为每个字形由一两个组成wchar_t
。有关详细信息,请参阅MultiByteToWideChar和WideCharToMultiByteWin32 转换 API。
Thus, if you work on Windows, you badly wantto use wchar_t
(unless you use a framework hiding that, like GTK+or QT...). The fact is that behind the scenes, Windows works with wchar_t
strings, so even historical applications will have their char
strings converted in wchar_t
when using API like SetWindowText()
(low level API function to set the label on a Win32 GUI).
因此,如果您在 Windows 上工作,您非常想使用wchar_t
(除非您使用隐藏它的框架,例如GTK+或QT...)。事实是,在幕后,Windows 使用wchar_t
字符串,因此即使是历史应用程序char
在wchar_t
使用 API SetWindowText()
(在 Win32 GUI 上设置标签的低级 API 函数)时也会转换其字符串。
Memory issues?
内存问题?
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).
UTF-32 是每个字符 4 个字节,所以没有什么可添加的,只要 UTF-8 文本和 UTF-16 文本总是比 UTF-32 文本使用更少或相同的内存量(通常更少)。
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.
如果存在内存问题,那么您应该知道,与大多数西方语言相比,UTF-8 文本将比相同的 UTF-16 文本使用更少的内存。
Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.
尽管如此,对于其他语言(中文、日语等),UTF-8 使用的内存将与 UTF-16 相同或略大。
All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.
总而言之,UTF-16 每个字符将主要使用 2 个字节,偶尔使用 4 个字节(除非您正在处理某种深奥的语言字形(克林贡语?精灵语?),而 UTF-8 将花费 1 到 4 个字节。
See http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16for more info.
有关更多信息,请参阅http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16。
Conclusion
结论
When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-platform code? Depends on your toolkit...(§) : unless you use a toolkit/framework saying otherwise
Can
std::string
hold all the ASCII character set including special characters?Notice: A
std::string
is suitable for holding a 'binary' buffer, where astd::wstring
is not!On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.Edit (After a comment from Johann Gerell):
astd::string
will be enough to handle allchar
-based strings (eachchar
being a number from 0 to 255). But:- ASCII is supposed to go from 0 to 127. Higher
char
s are NOT ASCII. - a
char
from 0 to 127 will be held correctly - a
char
from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
- ASCII is supposed to go from 0 to 127. Higher
Is
std::wstring
supported by almost all popular C++ compilers?Mostly, with the exception of GCC based compilers that are ported to Windows.
It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.What is exactly a wide character?
On C/C++, it's a character type written
wchar_t
which is larger than the simplechar
character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).
什么时候我应该使用 std::wstring 而不是 std::string?
在 Linux 上?几乎从不 (§)。
在 Windows 上?几乎总是 (§)。
关于跨平台代码?取决于你的工具包...(§) :除非您使用工具包/框架另有说明
可以
std::string
容纳包括特殊字符在内的所有 ASCII 字符集吗?注意:A
std::string
适合保存“二进制”缓冲区,而 astd::wstring
不是!在 Linux 上?是的。
在 Windows 上?只有特殊字符可用于 Windows 用户的当前区域设置。编辑(在Johann Gerell发表评论之后):
astd::string
将足以处理所有char
基于字符串的字符串(每个字符串char
都是 0 到 255 之间的数字)。但:- ASCII 应该从 0 到 127。更高的
char
s 不是 ASCII。 - a
char
从 0 到 127 将被正确保存 - a
char
从 128 到 255 将根据您的编码(unicode、非 unicode 等)具有含义,但只要它们以 UTF-8 编码,它就能够保存所有 Unicode 字形。
- ASCII 应该从 0 到 127。更高的
是
std::wstring
几乎所有流行的C ++编译器的支持?大多数情况下,移植到 Windows 的基于 GCC 的编译器除外。
它适用于我的 g++ 4.3.2(在 Linux 下),并且我从 Visual C++ 6 开始在 Win32 上使用 Unicode API。什么是宽字符?
在 C/C++ 上,它是一种
wchar_t
比简单char
字符类型大的字符类型。它应该用于放置索引(如 Unicode 字形)大于 255(或 127,取决于...)的字符。
回答by Pavel Radzivilovsky
I recommend avoiding std::wstring
on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.
我建议避免std::wstring
在 Windows 或其他地方,除非界面需要,或者在 Windows API 调用附近的任何地方,以及作为语法糖的相应编码转换。
My view is summarized in http://utf8everywhere.orgof which I am a co-author.
我的观点总结在http://utf8everywhere.org中,我是其中的合著者。
Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.
除非您的应用程序是以 API 调用为中心的,例如主要是 UI 应用程序,否则建议将 Unicode 字符串存储在 std::string 中并以 UTF-8 编码,在 API 调用附近执行转换。文章中概述的好处超过了转换的明显烦恼,尤其是在复杂的应用程序中。对于多平台和库开发来说更是如此。
And now, answering your questions:
现在,回答您的问题:
- A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
- This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how Youtreat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
- No.
- Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.
- 一些薄弱的原因。它的存在是出于历史原因,其中宽字符被认为是支持 Unicode 的正确方式。它现在用于连接更喜欢 UTF-16 字符串的 API。我仅在此类 API 调用的直接附近使用它们。
- 这与 std::string 无关。它可以保存您放入的任何编码。唯一的问题是您如何对待其内容。我的建议是 UTF-8,因此它能够正确保存所有 Unicode 字符。这是 Linux 上的常见做法,但我认为 Windows 程序也应该这样做。
- 不。
- 宽字符是一个令人困惑的名称。在 Unicode 的早期,人们相信一个字符可以用两个字节编码,因此得名。今天,它代表“两个字节长的字符的任何部分”。UTF-16 被视为此类字节对(又名宽字符)的序列。UTF-16 中的字符需要一对或两对。
回答by Frunsi
So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer[btw: thanks!].
所以,现在在座的每一位读者都应该对事实、情况有一个清醒的认识。如果没有,那么您必须阅读 Paercebal 非常全面的答案[顺便说一句:谢谢!]。
My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.
我的实用结论非常简单:所有 C++(和 STL)“字符编码”的东西基本上都已损坏且毫无用处。不管是否归咎于微软,这无论如何都无济于事。
My solution, after in-depth investigation, much frustration and the consequential experiences is the following:
我的解决方案,经过深入调查,非常沮丧和随之而来的经历如下:
accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)
use std::string for any UTF-8 encoded strings (just a
typedef std::string UTF8String
)accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).
use std::wstring for UCS-2 encoded strings (
typedef std::wstring UCS2String
) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.
add two utility functions to convert back & forth between UTF-8 and UCS-2:
UCS2String ConvertToUCS2( const UTF8String &str ); UTF8String ConvertToUTF8( const UCS2String &str );
接受,您必须自己负责编码和转换的内容(并且您会发现其中大部分内容相当琐碎)
将 std::string 用于任何 UTF-8 编码的字符串(只是一个
typedef std::string UTF8String
)接受这样的 UTF8String 对象只是一个愚蠢但便宜的容器。永远不要直接访问和/或操作其中的字符(没有搜索、替换等)。你可以,但你真的真的,真的不想浪费时间为多字节字符串编写文本操作算法!就算别人已经干过这种蠢事,也不要那样做!随它去!(好吧,有些场景是有意义的……只需使用 ICU 库即可)。
将 std::wstring 用于 UCS-2 编码字符串 (
typedef std::wstring UCS2String
) - 这是一种妥协,也是对 WIN32 API 引入的混乱的让步。UCS-2 对我们大多数人来说已经足够了(稍后会详细介绍......)。只要需要逐个字符的访问(读取、操作等),就使用 UCS2String 实例。任何基于字符的处理都应该在非多字节表示中完成。它简单、快速、容易。
添加两个实用函数以在 UTF-8 和 UCS-2 之间来回转换:
UCS2String ConvertToUCS2( const UTF8String &str ); UTF8String ConvertToUTF8( const UCS2String &str );
The conversions are straightforward, google should help here ...
转换很简单,谷歌应该在这里提供帮助......
That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.
就是这样。在内存宝贵的地方和所有 UTF-8 I/O 中使用 UTF8String。在必须解析和/或操作字符串的任何地方使用 UCS2String。您可以随时在这两种表示之间进行转换。
Alternatives & Improvements
替代方案和改进
conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g.
const wchar_t tt_iso88951[256] = {0,1,2,...};
and appropriate code for conversion to & from UCS2.if UCS-2 is not sufficient, than switch to UCS-4 (
typedef std::basic_string<uint32_t> UCS2String
)
从 & 到单字节字符编码(例如 ISO-8859-1)的转换可以在普通转换表的帮助下实现,例如
const wchar_t tt_iso88951[256] = {0,1,2,...};
和用于从 UCS2 转换到 & 的适当代码。如果 UCS-2 不够用,则切换到 UCS-4 (
typedef std::basic_string<uint32_t> UCS2String
)
ICU or other unicode libraries?
ICU 或其他 unicode 库?
回答by Johannes Schaub - litb
When you want to have wide characters stored in your string.
wide
depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can storeunicode strings fine intostd::string
using theutf-8
encoding too. But it won't understand the meaning of unicode code points. Sostr.size()
won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed aGlib::ustring
class that can handle utf-8.Ifyour wchar_t is 32 bits long, then you can use
utf-32
as an unicode encoding, and you can store andhandle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring'ss.size()
function will thenreturn the right amount of wchar_t elements andlogical characters.- Yes, char is always at least 8 bit long, which means it can store all ASCII values.
- Yes, all major compilers support it.
当您想在字符串中存储宽字符时。
wide
取决于实现。如果我没记错的话,Visual C++ 默认为 16 位,而 GCC 默认取决于目标。这里是 32 位长。请注意 wchar_t(宽字符类型)与 unicode 无关。它只是保证它可以存储实现由其语言环境支持的最大字符集的所有成员,并且至少与 char 一样长。您也可以将unicode 字符串很好地存储到std::string
使用utf-8
编码中。但它不会理解 unicode 代码点的含义。所以str.size()
不会为您提供字符串中逻辑字符的数量,而只会提供存储在该字符串/wstring 中的 char 或 wchar_t 元素的数量。出于这个原因,gtk/glib C++ 包装人员开发了一个Glib::ustring
可以处理 utf-8 的类。如果您的 wchar_t 是 32 位长,那么您可以
utf-32
用作 unicode 编码,并且您可以使用固定(utf-32 是固定长度)编码来存储和处理 unicode 字符串。这意味着你的wstring的s.size()
函数,然后返回wchar_t的元素适量和逻辑字符。- 是的,char 的长度至少为 8 位,这意味着它可以存储所有 ASCII 值。
- 是的,所有主要编译器都支持它。
回答by Johannes Schaub - litb
I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.
我经常使用 std::string 来保存 utf-8 字符而没有任何问题。我衷心建议在与使用 utf-8 作为本机字符串类型的 API 接口时这样做。
For example, I use utf-8 when interfacing my code with the Tcl interpreter.
例如,我在将代码与 Tcl 解释器连接时使用 utf-8。
The major caveat is the length of the std::string, is no longer the number of characters in the string.
主要的警告是 std::string 的长度,不再是字符串中的字符数。
回答by ChrisW
- When you want to store 'wide' (Unicode) characters.
- Yes: 255 of them (excluding 0).
- Yes.
- Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html
- 当您想要存储“宽”(Unicode)字符时。
- 是:其中 255 个(不包括 0 个)。
- 是的。
- 这是一篇介绍性文章:http: //www.joelonsoftware.com/articles/Unicode.html
回答by Greg Domjan
- when you want to use Unicode strings and not just ascii, helpful for internationalisation
- yes, but it doesn't play well with 0
- not aware of any that don't
- wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html
- 当你想使用 Unicode 字符串而不仅仅是 ascii 时,有助于国际化
- 是的,但它不能很好地与 0 一起使用
- 不知道任何不知道的
- 宽字符是处理unicode字符固定长度表示的编译器特定方式,对于MSVC,它是一个2字节的字符,对于gcc,我理解它是4个字节。和 +1 http://www.joelonsoftware.com/articles/Unicode.html
回答by Leiyi.China
A good question! I think DATA ENCODING(sometimes a CHARSETalso involved) is a MEMORY EXPRESSIONMECHANISM in order to save data to a file or transfer data via a network, so I answer this question as:
好问题!我认为数据编码(有时还涉及字符集)是一种内存表达机制,以便将数据保存到文件或通过网络传输数据,所以我回答这个问题:
1. When should I use std::wstring over std::string?
1. 我什么时候应该使用 std::wstring 而不是 std::string?
If the programming platform or API function is a single-byte one, and we want to process or parse some Unicode data, e.g read from Windows'.REG file or network 2-byte stream, we should declare std::wstring variable to easily process them. e.g.: wstring ws=L"chinaa"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.
如果编程平台或API函数是单字节的,我们想处理或解析一些Unicode数据,例如从Windows'.REG文件或网络2字节流中读取,我们应该声明std::wstring变量以方便处理它们。例如:wstring ws=L"chinaa"(6 octets memory: 0x4E2D 0x56FD 0x0061),我们可以使用 ws[0] 得到字符 '中' 和 ws[1] 得到字符 '国' 和 ws[2] 到获取字符“a”等。
2. Can std::string hold the entire ASCII character set, including the special characters?
2. std::string 能否保存整个 ASCII 字符集,包括特殊字符?
Yes. But notice: American ASCII, means each 0x00~0xFF octet stands for one character, including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset, e.g. Chinese, use 2 octets to stand for one character.
是的。但是注意:美式ASCII,表示每个0x00~0xFF octet代表一个字符,包括“123abc&*_&”之类的可打印文本,你说的特殊的,大多打印为'.' 避免混淆编辑器或终端。其他一些国家扩展了自己的“ASCII”字符集,例如中文,使用2 个八位字节表示一个字符。
3.Is std::wstring supported by all popular C++ compilers?
3.所有流行的C++编译器都支持std::wstring吗?
Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES
也许,或者大部分。我使用过:VC++6 和 GCC 3.3,是的
4. What is exactly a "wide character"?
4. 什么是“宽字符”?
a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters. 2 octet UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)
宽字符主要表示使用 2 个八位字节或 4 个八位字节来容纳所有国家的字符。2 个八位字节 UCS2 是一个代表性的样本,进一步例如英语 'a',它的内存是 0x0061 的 2 个八位字节(相对于 ASCII 'a 的内存是 1 个八位字节 0x61)
回答by Seppo Enarvi
Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8. Wide characters generally require more space than a variable-length encoding, but are faster to process. Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.
不满足于仅 256 个不同字符的应用程序可以选择使用宽字符(超过 8 位)或可变长度编码(C++ 术语中的多字节编码),例如 UTF-8。宽字符通常比可变长度编码需要更多空间,但处理速度更快。处理大量文本的多语言应用程序在处理文本时通常使用宽字符,但在将其存储到磁盘时将其转换为 UTF-8。
The only difference between a string
and a wstring
is the data type of the characters they store. A string stores char
s whose size is guaranteed to be at least 8 bits, so you can use strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about the character set or encoding.
astring
和 a之间的唯一区别wstring
是它们存储的字符的数据类型。字符串存储的char
s 的大小保证至少为 8 位,因此您可以使用字符串进行处理,例如 ASCII、ISO-8859-15 或 UTF-8 文本。该标准没有说明字符集或编码。
Practically every compiler uses a character set whose first 128 characters correspond with ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.
实际上,每个编译器都使用一个字符集,其前 128 个字符与 ASCII 对应。使用 UTF-8 编码的编译器也是如此。在 UTF-8 或其他一些可变长度编码中使用字符串时要注意的重要一点是,索引和长度以字节为单位,而不是字符。
The data type of a wstring is wchar_t
, whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for processing text in the implementation defined wide-character encoding. Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings. One cannot assume wstrings to have a fixed-length encoding either.
wstring 的数据类型是wchar_t
,其大小在标准中没有定义,只是它必须至少与 char 一样大,通常是 16 位或 32 位。wstring 可用于在实现定义的宽字符编码中处理文本。由于标准中没有定义编码,因此在字符串和 wstrings 之间进行转换并不简单。也不能假设 wstrings 具有固定长度的编码。
If you don't need multi-language support, you might be fine with using only regular strings. On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters. Then you probably want to use the same wide characters when processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume length()
to return the number of characters. If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy. Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.
如果您不需要多语言支持,则只使用常规字符串可能没问题。另一方面,如果您正在编写图形应用程序,通常情况下 API 仅支持宽字符。那么您可能希望在处理文本时使用相同的宽字符。请记住,UTF-16 是一种可变长度编码,这意味着您不能假设length()
返回字符数。如果 API 使用固定长度编码,例如 UCS-2,则处理变得容易。宽字符和 UTF-8 之间的转换很难以可移植的方式进行,但话说回来,您的用户界面 API 可能支持这种转换。
回答by Phil Rosenberg
There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux, basically the answer is to use UTF-8 encoded std::string
everywhere. On Windows/VS it gets more complex. Here is why. Windows expects strings stored using char
s to be encoded using the locale codepage. This is almost always the ASCII character set followed by 128 other special characters depending on your location. Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++. These are string literals, output to std::cout
using <<
and passing a filename to std::fstream
.
这里有一些非常好的答案,但我认为我可以添加一些关于 Windows/Visual Studio 的内容。这是基于我对 VS2015 的经验。在 Linux 上,基本上答案是在std::string
任何地方都使用 UTF-8 编码。在 Windows/VS 上,它变得更加复杂。这就是为什么。Windows 期望使用char
s存储的字符串使用区域设置代码页进行编码。这几乎总是 ASCII 字符集后跟 128 个其他特殊字符,具体取决于您的位置。我只想说,这不仅仅是在使用 Windows API 时,还有其他三个主要地方这些字符串与标准 C++ 交互。这些是字符串文字,输出到std::cout
using<<
并将文件名传递给std::fstream
.
I will be up front here that I am a programmer, not a language specialist. I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.
我将在此声明我是一名程序员,而不是语言专家。我很欣赏 USC2 和 UTF-16 不一样,但出于我的目的,它们足够接近可以互换,我在这里使用它们。我实际上不确定使用哪个 Windows,但我通常也不需要知道。我已经在这个答案中说明了 UCS2,如果我因为我对此事的无知而让任何人感到不安,我很高兴在我有问题时进行更改。
String literals
字符串文字
If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage. Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different. If you run your code on a computer using a different code page then I'm not sure if the character will change too.
如果您输入的字符串文字仅包含可以由您的代码页表示的字符,那么 VS 会根据您的代码页将它们存储在您的文件中,每个字符编码为 1 个字节。请注意,如果您更改代码页或将源代码提供给使用不同代码页的其他开发人员,那么我认为(但尚未测试)角色最终会有所不同。如果您在使用不同代码页的计算机上运行代码,那么我不确定字符是否也会改变。
If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes. This means if you give your source to someone else the source will look the same. However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?
.
如果您输入任何无法由您的代码页表示的字符串文字,VS 会要求您将文件另存为 Unicode。然后该文件将被编码为 UTF-8。这意味着所有非 ASCII 字符(包括代码页上的那些字符)都将由 2 个或更多字节表示。这意味着如果您将您的来源提供给其他人,则该来源看起来是一样的。但是,在将源代码传递给编译器之前,VS 会将 UTF-8 编码文本转换为代码页编码文本,并且代码页中缺少的任何字符都将替换为?
.
The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an L
making it a wide string literal. In this case VS will convert the UTF-8 encoded text from the file into UCS2. You then need to pass this string literal into a std::wstring
constructor or you need to convert it to utf-8 and put it in a std::string
. Or if you want you can use the Windows API functions to encode it using your code page to put it in a std::string
, but then you may as well have not used a wide string literal.
保证在 VS 中正确表示 Unicode 字符串文字的唯一方法是在字符串文字之前L
使其成为宽字符串文字。在这种情况下,VS 会将文件中的 UTF-8 编码文本转换为 UCS2。然后,您需要将此字符串文字传递给std::wstring
构造函数,或者您需要将其转换为 utf-8 并将其放入std::string
. 或者,如果您愿意,您可以使用 Windows API 函数使用您的代码页对其进行编码以将其放入 a 中std::string
,但是您可能还没有使用宽字符串文字。
std::cout
std::cout
When outputting to the console using <<
you can only use std::string
, not std::wstring
and the text must be encoded using your locale codepage. If you have a std::wstring
then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by ?
(maybe you can change the character, I can't remember).
当使用输出到控制台时,<<
您只能使用std::string
,而不是std::wstring
并且文本必须使用您的语言环境代码页进行编码。如果您有一个,std::wstring
那么您必须使用 Windows API 函数之一转换它,并且代码页上没有的任何字符都会被替换?
(也许您可以更改字符,我不记得了)。
std::fstream filenames
std::fstream 文件名
Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. But this means that to access or create files with characters not on your codepage you must use std::wstring
. There is no other way. This is a Microsoft specific extension to std::fstream
so probably won't compile on other systems. If you use std::string then you can only utilise filenames that only include characters on your codepage.
Windows 操作系统使用 UCS2/UTF-16 作为其文件名,因此无论您的代码页如何,您都可以拥有带有任何 Unicode 字符的文件。但这意味着要访问或创建包含不在代码页上的字符的文件,您必须使用std::wstring
. 没有其他办法。这是 Microsoft 特定的扩展,std::fstream
因此可能无法在其他系统上编译。如果您使用 std::string,那么您只能使用仅包含代码页上的字符的文件名。
Your options
您的选择
If you are just working on Linux then you probably didn't get this far. Just use UTF-8 std::string
everywhere.
如果您只是在 Linux 上工作,那么您可能还没有走到这一步。只需std::string
在任何地方使用 UTF-8 。
If you are just working on Windows just use UCS2 std::wstring
everywhere. Some purists may say use UTF8 then convert when needed, but why bother with the hassle.
如果您只是在 Windows 上工作,请在std::wstring
任何地方使用 UCS2 。一些纯粹主义者可能会说使用 UTF8 然后在需要时转换,但为什么要麻烦呢。
If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console. You can easily corrupt your strings there. If you use std::wstring
everywhere on Linux then you may not have access to the wide version of std::fstream
, so you have to do the conversion, but there is no risk of corruption. So personally I think this is a better option. Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.
如果你是跨平台的,那么坦率地说这是一团糟。如果您尝试在 Windows 上随处使用 UTF-8,那么您需要非常小心您的字符串文字和输出到控制台。您可以轻松地在那里损坏您的字符串。如果您std::wstring
在 Linux 上随处使用,那么您可能无法访问std::fstream
. 所以我个人认为这是一个更好的选择。许多人会不同意,但我并不孤单——例如,这是 wxWidgets 所采取的路径。
Another option could be to typedef unicodestring
as std::string
on Linux and std::wstring
on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the code
另一种选择是在 Linux 和Windows上键入unicodestring
定义,并有一个名为 UNI() 的宏,它在 Windows 上以 L 为前缀,在 Linux 上没有前缀,然后是代码std::string
std::wstring
#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>
#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
std::string result;
//Call WideCharToMultiByte to do the conversion
return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
return str;
}
#endif
int main()
{
unicodestring fileName(UNI("fileName"));
std::ofstream fout;
fout.open(fileName);
std::cout << formatForConsole(fileName) << std::endl;
return 0;
}
would be fine on either platform I think.
我认为在任何一个平台上都可以。
Answers
答案
So To answer your questions
所以要回答你的问题
1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific #ifdefs
to work around the differences, if just using Linux then never.
1)如果你一直在为 Windows 编程,那么如果是跨平台,那么可能一直都是,除非你想处理 Windows 上可能的损坏问题或编写一些特定#ifdefs
于平台的代码来解决差异,如果只是使用Linux然后永远不会。
2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it for all unicode if you choose to manually encode using UTF-8. But the Windows API and standard C++ classes will expect the std::string
to be encoded using the locale codepage. This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.
2)是的。此外,在 Linux 上,您也可以将它用于所有 Unicode。在 Windows 上,如果您选择使用 UTF-8 手动编码,则只能将其用于所有 unicode。但是 Windows API 和标准 C++ 类将期望std::string
使用区域设置代码页进行编码。这包括所有 ASCII 以及另外 128 个字符,这些字符会根据您的计算机设置使用的代码页而变化。
3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t
instead of char
3)我相信是这样,但如果不是,那么它只是一个 'std::basic_string' 的简单 typedef 使用wchar_t
而不是char
4)A wide character is a character type which is bigger than the 1 byte standard char
type. On Windows it is 2 bytes, on Linux it is 4 bytes.
4)宽字符是比1字节标准char
类型大的字符类型。在 Windows 上它是 2 个字节,在 Linux 上它是 4 个字节。