如何让 STL std::string 在 Windows 上使用 unicode?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3257263/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I get STL std::string to work with unicode on windows?
提问by NSA
At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
在我的公司,我们有一个跨平台(Linux 和 Windows)库,其中包含我们自己的 STL std::string 扩展,这个类在字符串之上提供了各种功能;split、format、to/from base64 等。最近我们被要求使这个字符串 unicode "友好" 基本上它需要支持来自中文、日语、阿拉伯语等的字符。经过初步研究,这在 Linux 方面似乎很好因为每件事本质上都是 UTF-8,但是我在 Windows 方面遇到了麻烦;有没有让 STL std::string 在 Windows 上作为 UTF-8 工作的技巧?甚至有可能吗?有没有更好的办法?理想情况下,我们会让自己基于 std::string,因为这是 Linux 中 string 类的基础。
Thank you,
谢谢,
回答by Thomas
There are several misconceptions in your question.
你的问题有几个误解。
Neither C++ nor the STL deal with encodings.
std::string
is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that allstring
functions also work on bytes, somyString.length()
will give you the number of bytes, not the number of characters.Linux is notinherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.
C++ 和 STL 都不处理编码。
std::string
本质上是一串字节,而不是字符。因此,将 UTF-8 编码的 Unicode 填充到其中应该没有问题。但是,请记住,所有string
函数也都对字节起作用,因此myString.length()
会为您提供字节数,而不是字符数。Linux本质上不是UTF-8。现在大多数发行版默认使用 UTF-8,但不应依赖它。
回答by Thanatos
Yes - by being more aware of locales and encodings.
是的 - 通过更加了解语言环境和编码。
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
对于需要文本的所有内容,Windows 有两个函数调用,一个 FoobarA() 和一个 FoobarW()。*W() 函数采用 UTF-16 编码的字符串,*A() 函数采用当前代码页中的字符串。但是,Windows 不支持 UTF-8 代码页,因此您不能直接在 *A() 函数的意义上使用它,您也不希望依赖于用户设置的。如果您想在 Windows 中使用“Unicode”,请使用支持 Unicode (*W) 的函数。那里有教程,谷歌搜索“Unicode Windows 教程”应该会给你一些。
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
如果您将 UTF-8 数据存储在 std::string 中,则在将其传递给 Windows 之前,将其转换为 UTF-16(Windows 提供了执行此操作的函数),然后将其传递给 Windows。
Many of these problems arise from C/C++ being generally encoding-agnostic. char
isn't really a character, it's just an integral type. Even using char
arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char
's signed-ness is left undefined by the standards. A statement like str[x] < 0x80
to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char
is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t
exactly, although unsigned char
works as well. Ideally then, I'd make a UTF-8 string an array of uint8_t
s, but due to old APIs, this is rarely done.
许多这些问题源于 C/C++ 通常与编码无关。char
不是真正的字符,它只是一个完整的类型。char
如果您需要访问单个代码单元,即使使用数组来存储 UTF-8 数据也会给您带来麻烦,因为char
标准未定义 的签名。像str[x] < 0x80
检查多字节字符这样的语句会很快引入错误。(如果char
有符号,则该语句始终为真。)UTF-8 代码单元是范围为 0-255 的无符号整数类型。这完全映射到 C 类型uint8_t
,尽管unsigned char
也可以。理想情况下,我会将 UTF-8 字符串uint8_t
设为 s的数组,但由于旧的 API,很少这样做。
Some people have recommended wchar_t
, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t
is no more Unicode than char
. The standard states:
有些人推荐wchar_t
,声称它是“一种Unicode字符类型”或类似的东西。同样,这里的标准和以前一样不可知,因为 C 旨在在任何地方工作,而任何地方都可能不使用 Unicode。因此,wchar_t
不比 Unicode 多char
。标准规定:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
这是一种整数类型,其值范围可以表示支持的语言环境中指定的最大扩展字符集的所有成员的不同代码
In Linux, a wchat_t
represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t
if you need portability. (§5.2)
在 Linux 中,awchat_t
代表一个 UTF-32 代码单元/代码点。因此它是 4 个字节。但是,在 Windows 中,它是一个 UTF-16 代码单元,只有 2 个字节。(我会说不符合上述规定,因为 2 字节不能代表所有的 Unicode,但这就是它的工作方式。)这种大小差异和数据编码的差异显然给可移植性带来了压力。wchar_t
如果您需要可移植性,Unicode 标准本身建议不要使用。(第 5.2 节)
The end lesson:I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I knowthat my strings are UTF-8. If I'm passing them to some other API, I must also knowthat that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char
array that doesn't know what encoding it is in is a recipe for disaster.
最后一课:我发现以某种明确声明的格式存储所有数据最容易。(通常是 UTF-8,通常在 std::string 中,但我真的想要更好的东西。)这里重要的不是 UTF-8 部分,而是,我知道我的字符串是 UTF-8。如果我将它们传递给其他一些 API,我还必须知道该 API 需要 UTF-8 字符串。如果没有,那么我必须转换它们。(因此,如果我使用 Window 的 API,我必须先将字符串转换为 UTF-16。)UTF-8 文本字符串是“orange”,而“latin1”文本字符串是“apple”。一个char
不知道它是什么编码的数组是灾难的秘诀。
回答by Jerry Coffin
Putting UTF-8 code points into an std::string
should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring
which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).
std::string
无论平台如何,将 UTF-8 代码点放入 an都应该没问题。Windows 上的问题是几乎没有其他东西可以使用 UTF-8 或可以使用 UTF-8——它期望使用 UTF-16 并可以使用。您可以切换到std::wstring
将存储 UTF-16 的(至少在大多数 Windows 编译器上),或者您可以编写其他接受 UTF-8 的例程(可能通过转换为 UTF-16,然后传递到操作系统)。
回答by Mark B
Have you looked at std::wstring
? It's a version of std::basic_string
for wchar_t
rather than the char
that std::string
uses.
你看了std::wstring
吗?它是std::basic_string
forwchar_t
而不是char
thatstd::string
使用的版本。
回答by atzz
No, there is no way to make Windows treat "narrow" strings as UTF-8.
不,没有办法让 Windows 将“窄”字符串视为 UTF-8。
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
这是在这种情况下最适合我的方法(具有 Windows 和 Linux 版本的跨平台应用程序)。
- Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
- In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
- In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).
- 在代码的跨平台部分使用 std::string。假设它总是包含 UTF-8 字符串。
- 在代码的 Windows 部分,明确使用“宽”版本的 Windows API,即编写例如 CreateFileW 而不是 CreateFile。这允许避免对构建系统配置的依赖。
- 在平台抽象层中,根据需要在 UTF-8 和 UTF-16 之间转换(MultiByteToWideChar/WideCharToMultiByte)。
Other approaches that I tried but don't like much:
我尝试过但不太喜欢的其他方法:
typedef std::basic_string<TCHAR> tstring;
then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.- Use
std::wstring
everywhere. Does not help much sincewchar_t
is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate. - Use ATL/WTL/MFC
CString
in the platfrom-specific portion; usestd::string
in cross-platfrom portion. This is actually a variant of what I recommend above.CString
is in many aspects superior tostd::string
(in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
typedef std::basic_string<TCHAR> tstring;
然后在业务代码中使用 tstring。可以通过包装器/重载来简化 std::string 和 std::tstring 之间的转换,但它仍然增加了很多痛苦。std::wstring
随处使用。由于wchar_t
在 Windows 上是 16 位,因此没有太大帮助,因此您要么必须将自己限制为 BMP,要么进行大量复杂操作以使处理 Unicode 跨平台的代码。在后一种情况下,与 UTF-8 相比的所有优势都消失了。CString
在平台特定部分使用ATL/WTL/MFC ;用于std::string
跨平台部分。这实际上是我上面推荐的一个变体。CString
在很多方面都优于std::string
(在我看来)。但它引入了额外的依赖性,因此并不总是可以接受或方便的。
回答by Philipp
If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath
). Avoid std::string
and std::fstream
.
如果您想避免头痛,请根本不要使用 STL 字符串类型。C++ 对 Unicode 或编码一无所知,因此为了可移植,最好使用为 Unicode 支持量身定制的库,例如 ICU 库。ICU 默认使用 UTF-16 字符串,因此不需要转换,并支持转换为许多其他重要编码,如 UTF-8。还尝试使用 Boost.Filesystem 之类的跨平台库来处理路径操作 ( boost::wpath
) 之类的事情。避免std::string
和std::fstream
。
回答by dan04
In the Windows API and C runtime library, char*
parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.
在 Windows API 和 C 运行时库中,char*
参数被解释为在“ANSI”代码页中编码。问题是不支持 UTF-8 作为 ANSI 代码页,我觉得这非常烦人。
I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
我处于类似的情况,正在将软件从 Windows 移植到 Linux,同时还使其能够识别 Unicode。我们为此采取的方法是:
- Use UTF-8 as the default encoding for strings.
- In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.
- 使用 UTF-8 作为字符串的默认编码。
- 在特定于 Windows 的代码中,始终调用“W”版本的函数,根据需要在 UTF-8 和 UTF-16 之间转换字符串参数。
This is also the approach Poco has taken.
这也是Poco 采取的方法。
回答by Swift - Friday Pie
It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
它确实依赖于平台,Unicode 令人头疼。取决于您使用的编译器。对于来自 MS 的旧版本(VS2010 或更早版本),您需要使用 MSDN 中描述的 API
for VS2015
对于 VS2015
std::string _old = u8"D:\Folder\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
根据他们的文档。我不能检查那个。
for mingw, gcc, etc.
用于 mingw、gcc 等。
std::string _old = u8"D:\Folder\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name...
输出包含正确的文件名...