C++ 在字符串、u16string 和 u32string 之间转换
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7232710/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert between string, u16string & u32string
提问by DrYap
I've been looking for a way to convert between the Unicode string types and came across this method. Not only do I not completely understand the method (there are no comments) but also the article implies that in future there will be better methods.
我一直在寻找一种在 Unicode 字符串类型之间进行转换的方法,并遇到了这种方法。我不仅没有完全理解方法(没有评论)而且文章暗示将来会有更好的方法。
If this is the best method, could you please point out what makes it work, and if not I would like to hear suggestions for better methods.
如果这是最好的方法,请您指出是什么使它起作用,如果不是,我想听听关于更好方法的建议。
回答by bames53
mbstowcs()
and wcstombs()
don't necessarily convert to UTF-16 or UTF-32, they convert to wchar_t
and whatever the locale wchar_t
encoding is. All Windows locales uses a two byte wchar_t
and UTF-16 as the encoding, but the other major platforms use a 4-byte wchar_t
with UTF-32 (or even a non-Unicode encoding for some locales). A platform that only supports single-byte encodings could even have a one byte wchar_t
and have the encoding differ by locale. So wchar_t
seems to me to be a bad choice for portability and Unicode. *
mbstowcs()
并且wcstombs()
不一定转换为 UTF-16 或 UTF-32,它们会转换为wchar_t
任何语言环境wchar_t
编码。所有 Windows 语言环境都使用 2 字节wchar_t
和 UTF-16 作为编码,但其他主要平台使用 4 字节wchar_t
和 UTF-32(甚至某些语言环境的非 Unicode 编码)。仅支持单字节编码的平台甚至可以有一个字节,wchar_t
并且编码因地区而异。所以wchar_t
在我看来对于可移植性和 Unicode 来说是一个糟糕的选择。*
Some better options have been introduced in C++11; new specializations of std::codecvt, new codecvt classes, and a new template to make using them for conversions very convienent.
C++11 中引入了一些更好的选项;std::codecvt 的新专业化、新的 codecvt 类和一个新模板,使使用它们进行转换非常方便。
First the new template class for using codecvt is std::wstring_convert. Once you've created an instance of a std::wstring_convert class you can easily convert between strings:
首先,使用 codecvt 的新模板类是 std::wstring_convert。创建 std::wstring_convert 类的实例后,您可以轻松地在字符串之间进行转换:
std::wstring_convert<...> convert; // ... filled in with a codecvt to do UTF-8 <-> UTF-16
std::string utf8_string = u8"This string has UTF-8 content";
std::u16string utf16_string = convert.from_bytes(utf8_string);
std::string another_utf8_string = convert.to_bytes(utf16_string);
In order to do different conversion you just need different template parameters, one of which is a codecvt facet. Here are some new facets that are easy to use with wstring_convert:
为了进行不同的转换,您只需要不同的模板参数,其中之一是 codecvt facet。以下是一些易于与 wstring_convert 一起使用的新方面:
std::codecvt_utf8_utf16<char16_t> // converts between UTF-8 <-> UTF-16
std::codecvt_utf8<char32_t> // converts between UTF-8 <-> UTF-32
std::codecvt_utf8<char16_t> // converts between UTF-8 <-> UCS-2 (warning, not UTF-16! Don't bother using this one)
Examples of using these:
使用这些的例子:
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string a = convert.to_bytes(u"This string has UTF-16 content");
std::u16string b = convert.from_bytes(u8"blah blah blah");
The new std::codecvt specializations are a bit harder to use because they have a protected destructor. To get around that you can define a subclass that has a destructor, or you can use the std::use_facet template function to get an existing codecvt instance. Also, an issue with these specializations is you can't use them in Visual Studio 2010 because template specialization doesn't work with typedef'd types and that compiler defines char16_t and char32_t as typedefs. Here's an example of defining your own subclass of codecvt:
新的 std::codecvt 特化有点难以使用,因为它们有一个受保护的析构函数。为了解决这个问题,您可以定义一个具有析构函数的子类,或者您可以使用 std::use_facet 模板函数来获取现有的 codecvt 实例。此外,这些特化的一个问题是您不能在 Visual Studio 2010 中使用它们,因为模板特化不适用于 typedef 类型,并且编译器将 char16_t 和 char32_t 定义为 typedef。这是定义自己的 codecvt 子类的示例:
template <class internT, class externT, class stateT>
struct codecvt : std::codecvt<internT,externT,stateT>
{ ~codecvt(){} };
std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>,char16_t> convert16;
std::wstring_convert<codecvt<char32_t,char,std::mbstate_t>,char32_t> convert32;
The char16_t specialization converts between UTF-16 and UTF-8. The char32_t specialization, UTF-32 and UTF-8.
char16_t 特化在 UTF-16 和 UTF-8 之间转换。char32_t 特化,UTF-32 和 UTF-8。
Note that these new conversions provided by C++11 don't include any way to convert directly between UTF-32 and UTF-16. Instead you just have to combine two instances of std::wstring_convert.
请注意,C++11 提供的这些新转换不包括任何直接在 UTF-32 和 UTF-16 之间转换的方法。相反,您只需要组合 std::wstring_convert 的两个实例。
***** I thought I'd add a note on wchar_t and its purpose, to emphasize why it should not generally be used for Unicode or portable internationalized code. The following is a short version of my answer https://stackoverflow.com/a/11107667/365496
***** 我想我会添加一个关于 wchar_t 及其用途的注释,以强调为什么它通常不应该用于 Unicode 或可移植的国际化代码。以下是我的回答的简短版本https://stackoverflow.com/a/11107667/365496
What is wchar_t?
wchar_t 是什么?
wchar_t is defined such that any locale's char encoding can be converted to wchar_t where every wchar_t represents exactly one codepoint:
wchar_t 被定义为可以将任何语言环境的 char 编码转换为 wchar_t,其中每个 wchar_t 代表一个代码点:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). -- [basic.fundamental] 3.9.1/5
类型 wchar_t 是一个不同的类型,其值可以表示支持的语言环境 (22.3.1) 中指定的最大扩展字符集的所有成员的不同代码。-- [基本.fundamental] 3.9.1/5
This does notrequire that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.
这并不需要wchar_t的足够大代表同时从所有区域设置的任何字符。也就是说,用于 wchar_t 的编码可能因地区而异。这意味着您不一定使用一种语言环境将字符串转换为 wchar_t,然后使用另一种语言环境将其转换回 char。
Since that seems to be the primary use in practice for wchar_t you might wonder what it's good for if not that.
由于这似乎是 wchar_t 在实践中的主要用途,您可能想知道它有什么好处。
The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of same simple algorithms used with ascii strings to work with other languages.
wchar_t 的最初意图和目的是通过定义它来简化文本处理,使其需要从字符串的代码单元到文本字符的一对一映射,从而允许使用与 ascii 字符串相同的简单算法与其他语言一起工作。
Unfortunately the requirements on wchar_t assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption, so you can't safely use wchar_t for simple text algorithms either.
不幸的是, wchar_t 的要求假设字符和代码点之间存在一对一的映射来实现这一点。Unicode 打破了这个假设,因此您也不能安全地将 wchar_t 用于简单的文本算法。
This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.
这意味着便携式软件不能将 wchar_t 用作语言环境之间文本的通用表示,也不能使用简单的文本算法。
What use is wchar_t today?
wchar_t 今天有什么用?
Not much, for portable code anyway. If __STDC_ISO_10646__
is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.
无论如何,对于可移植代码来说并不多。如果__STDC_ISO_10646__
已定义,则 wchar_t 的值直接表示在所有语言环境中具有相同值的 Unicode 代码点。这使得进行前面提到的区域间转换是安全的。但是,您不能仅仅依靠它来决定您可以以这种方式使用 wchar_t,因为虽然大多数 unix 平台都定义了它,但即使 Windows 在所有语言环境中使用相同的 wchar_t 语言环境,Windows 也不会。
The reason Windows doesn't define __STDC_ISO_10646__
I think is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__
.
__STDC_ISO_10646__
我认为Windows 没有定义的原因是因为 Windows 使用 UTF-16 作为其 wchar_t 编码,并且因为 UTF-16 使用代理对来表示大于 U+FFFF 的代码点,这意味着 UTF-16 不满足要求为__STDC_ISO_10646__
。
For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').
对于平台特定的代码 wchar_t 可能更有用。它本质上在 Windows 上是必需的(例如,某些文件在不使用 wchar_t 文件名的情况下根本无法打开),尽管据我所知,Windows 是唯一正确的平台(所以也许我们可以将 wchar_t 视为“Windows_char_t”)。
In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes.
事后看来, wchar_t 显然对于简化文本处理或作为独立于语言环境的文本的存储没有用。可移植代码不应试图将其用于这些目的。
回答by dimon4eg
I've written helper functions to convert to/from UTF8 strings (C++11):
我已经编写了辅助函数来转换为/从 UTF8 字符串(C++11):
#include <string>
#include <locale>
#include <codecvt>
using namespace std;
template <typename T>
string toUTF8(const basic_string<T, char_traits<T>, allocator<T>>& source)
{
string result;
wstring_convert<codecvt_utf8_utf16<T>, T> convertor;
result = convertor.to_bytes(source);
return result;
}
template <typename T>
void fromUTF8(const string& source, basic_string<T, char_traits<T>, allocator<T>>& result)
{
wstring_convert<codecvt_utf8_utf16<T>, T> convertor;
result = convertor.from_bytes(source);
}
Usage example:
用法示例:
// Unicode <-> UTF8
{
wstring uStr = L"Unicode string";
string str = toUTF8(uStr);
wstring after;
fromUTF8(str, after);
assert(uStr == after);
}
// UTF16 <-> UTF8
{
u16string uStr;
uStr.push_back('A');
string str = toUTF8(uStr);
u16string after;
fromUTF8(str, after);
assert(uStr == after);
}
回答by Raphael R.
As far as I know, C++ provides no standard methods to convert from or to UTF-32. However, for UTF-16 there are the methods mbstowcs(Multi-Byte to Wide character string), and the inverse, wcstombs.
据我所知,C++ 没有提供从 UTF-32 转换或到 UTF-32 转换的标准方法。但是,对于 UTF-16,有方法mbstowcs(多字节到宽字符串)和相反的方法wcstombs。
If you need UTF-32 too, you need iconv, which is in POSIX 2001 but not in standard C, so on Windows you'll need a replacement like libiconv.
如果您也需要 UTF-32,则需要iconv,它在 POSIX 2001 中但不在标准 C 中,因此在 Windows 上,您需要像libiconv这样的替代品。
Here's an example on how to use mbstowcs:
这是有关如何使用mbstowcs的示例:
#include <string>
#include <iostream>
#include <stdlib.h>
using namespace std;
wstring widestring(const string &text);
int main()
{
string text;
cout << "Enter something: ";
cin >> text;
wcout << L"You entered " << widestring(text) << ".\n";
return 0;
}
wstring widestring(const string &text)
{
wstring result;
result.resize(text.length());
mbstowcs(&result[0], &text[0], text.length());
return result;
}
The reverse goes like this:
反过来是这样的:
string mbstring(const wstring &text)
{
string result;
result.resize(text.length());
wcstombs(&result[0], &text[0], text.length());
return result;
}
Nitpick:Yes, I know, the size of wchar_t is implementation defined, so it couldbe 4 Bytes (UTF-32). However, I don't know a compiler which does that.
Nitpick:是的,我知道,wchar_t 的大小是实现定义的,所以它可能是 4 字节 (UTF-32)。但是,我不知道这样做的编译器。