C++ UTF8 与 STL 中的宽字符转换
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/148403/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF8 to/from wide char conversion in STL
提问by Vladimir Grigorov
Is it possible to convert UTF8 string in a std::string to std::wstring and vice versa in a platform independent manner? In a Windows application I would use MultiByteToWideChar and WideCharToMultiByte. However, the code is compiled for multiple OSes and I'm limited to standard C++ library.
是否可以以独立于平台的方式将 std::string 中的 UTF8 字符串转换为 std::wstring,反之亦然?在 Windows 应用程序中,我将使用 MultiByteToWideChar 和 WideCharToMultiByte。但是,代码是为多个操作系统编译的,我仅限于标准 C++ 库。
回答by Vladimir Grigorov
I've asked this question 5 years ago. This thread was very helpful for me back then, I came to a conclusion, then I moved on with my project. It is funny that I needed something similar recently, totally unrelated to that project from the past. As I was researching for possible solutions, I stumbled upon my own question :)
我 5 年前问过这个问题。这个帖子当时对我很有帮助,我得出了一个结论,然后我继续我的项目。有趣的是,我最近需要类似的东西,与过去的那个项目完全无关。在我研究可能的解决方案时,我偶然发现了我自己的问题:)
The solution I chose now is based on C++11. The boost libraries that Constantin mentions in his answerare now part of the standard. If we replace std::wstring with the new string type std::u16string, then the conversions will look like this:
我现在选择的解决方案是基于C++11。Constantin 在他的回答中提到的 boost 库现在是标准的一部分。如果我们用新的字符串类型 std::u16string 替换 std::wstring,那么转换将如下所示:
UTF-8 to UTF-16
UTF-8 到 UTF-16
std::string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::u16string dest = convert.from_bytes(source);
UTF-16 to UTF-8
UTF-16 到 UTF-8
std::u16string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string dest = convert.to_bytes(source);
As seen from the other answers, there are multiple approaches to the problem. That's why I refrain from picking an accepted answer.
从其他答案中可以看出,该问题有多种方法。这就是为什么我不选择一个公认的答案。
回答by Constantin
You can extract utf8_codecvt_facet
from Boost serialization library.
您可以utf8_codecvt_facet
从Boost 序列化库中提取。
Their usage example:
它们的用法示例:
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
// Set a New global locale
std::locale::global(utf8_locale);
// Send the UCS-4 data out, converting to UTF-8
{
std::wofstream ofs("data.ucd");
ofs.imbue(utf8_locale);
std::copy(ucs4_data.begin(),ucs4_data.end(),
std::ostream_iterator<ucs4_t,ucs4_t>(ofs));
}
// Read the UTF-8 data back in, converting to UCS-4 on the way in
std::vector<ucs4_t> from_file;
{
std::wifstream ifs("data.ucd");
ifs.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) from_file.push_back(item);
}
Look for utf8_codecvt_facet.hpp
and utf8_codecvt_facet.cpp
files in boost sources.
在 boost 源中查找utf8_codecvt_facet.hpp
和utf8_codecvt_facet.cpp
文件。
回答by Mark Ransom
The problem definition explicitly states that the 8-bit character encoding is UTF-8. That makes this a trivial problem; all it requires is a little bit-twiddling to convert from one UTF spec to another.
问题定义明确指出 8 位字符编码是 UTF-8。这使得这是一个微不足道的问题;它所需要的只是从一种 UTF 规范转换为另一种 UTF 规范。
Just look at the encodings on these Wikipedia pages for UTF-8, UTF-16, and UTF-32.
只需查看这些维基百科页面上的UTF-8、UTF-16和UTF-32 编码。
The principle is simple - go through the input and assemble a 32-bit Unicode code point according to one UTF spec, then emit the code point according to the other spec. The individual code points need no translation, as would be required with any other character encoding; that's what makes this a simple problem.
原理很简单 - 通过输入并根据一个 UTF 规范组装一个 32 位 Unicode 代码点,然后根据另一个规范发出代码点。单个代码点不需要翻译,因为任何其他字符编码都需要;这就是使这成为一个简单问题的原因。
Here's a quick implementation of wchar_t
to UTF-8 conversion and vice versa. It assumes that the input is already properly encoded - the old saying "Garbage in, garbage out" applies here. I believe that verifying the encoding is best done as a separate step.
这是wchar_t
UTF-8 转换的快速实现,反之亦然。它假设输入已经正确编码 - 古老的谚语“垃圾输入,垃圾输出”适用于此。我相信验证编码最好作为一个单独的步骤完成。
std::string wchar_to_UTF8(const wchar_t * in)
{
std::string out;
unsigned int codepoint = 0;
for (in; *in != 0; ++in)
{
if (*in >= 0xd800 && *in <= 0xdbff)
codepoint = ((*in - 0xd800) << 10) + 0x10000;
else
{
if (*in >= 0xdc00 && *in <= 0xdfff)
codepoint |= *in - 0xdc00;
else
codepoint = *in;
if (codepoint <= 0x7f)
out.append(1, static_cast<char>(codepoint));
else if (codepoint <= 0x7ff)
{
out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
else if (codepoint <= 0xffff)
{
out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
else
{
out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
codepoint = 0;
}
}
return out;
}
The above code works for both UTF-16 and UTF-32 input, simply because the range d800
through dfff
are invalid code points; they indicate that you're decoding UTF-16. If you know that wchar_t
is 32 bits then you could remove some code to optimize the function.
上面的代码适用于 UTF-16 和 UTF-32 输入,仅仅是因为范围d800
throughdfff
是无效的代码点;它们表明您正在解码 UTF-16。如果您知道这wchar_t
是 32 位,那么您可以删除一些代码来优化该功能。
std::wstring UTF8_to_wchar(const char * in)
{
std::wstring out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (sizeof(wchar_t) > 2)
out.append(1, static_cast<wchar_t>(codepoint));
else if (codepoint > 0xffff)
{
out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
}
else if (codepoint < 0xd800 || codepoint >= 0xe000)
out.append(1, static_cast<wchar_t>(codepoint));
}
}
return out;
}
Again if you know that wchar_t
is 32 bits you could remove some code from this function, but in this case it shouldn't make any difference. The expression sizeof(wchar_t) > 2
is known at compile time, so any decent compiler will recognize dead code and remove it.
同样,如果您知道这wchar_t
是 32 位,您可以从此函数中删除一些代码,但在这种情况下,它应该没有任何区别。该表达式sizeof(wchar_t) > 2
在编译时是已知的,因此任何体面的编译器都会识别死代码并将其删除。
回答by Ben Straub
There are several ways to do this, but the results depend on what the character encodings are in the string
and wstring
variables.
有几种方法可以做到这一点,但结果取决于string
和wstring
变量中的字符编码。
If you know the string
is ASCII, you can simply use wstring
's iterator constructor:
如果您知道string
is ASCII,则可以简单地使用wstring
的迭代器构造函数:
string s = "This is surely ASCII.";
wstring w(s.begin(), s.end());
If your string
has some other encoding, however, you'll get very bad results. If the encoding is Unicode, you could take a look at the ICU project, which provides a cross-platform set of libraries that convert to and from all sorts of Unicode encodings.
string
但是,如果您有一些其他编码,则会得到非常糟糕的结果。如果编码是 Unicode,您可以查看ICU 项目,它提供了一组跨平台的库,可以在各种 Unicode 编码之间进行转换。
If your string
contains characters in a code page, then may $DEITY have mercy on your soul.
如果您string
的代码页中包含字符,那么 $DEITY 可能会怜悯您的灵魂。
回答by Chris Jester-Young
You can use the codecvt
locale facet. There's a specific specialisation defined, codecvt<wchar_t, char, mbstate_t>
that may be of use to you, although, the behaviour of that is system-specific, and does not guarantee conversion to UTF-8 in any way.
您可以使用codecvt
语言环境方面。定义了一个特定的专业化,codecvt<wchar_t, char, mbstate_t>
这可能对您有用,尽管它的行为是特定于系统的,并且不保证以任何方式转换为 UTF-8。
回答by Trisch
UTFConverter- check out this library. It does such a convertion, but you need also ConvertUTF class - I've found it here
UTFConverter- 查看这个库。它进行了这样的转换,但您还需要 ConvertUTF 类 - 我在这里找到了
回答by TarmoPikaro
Created my own library for utf-8 to utf-16/utf-32 conversion - but decided to make a fork of existing project for that purpose.
为 utf-8 到 utf-16/utf-32 转换创建了我自己的库 - 但决定为此目的创建一个现有项目的分支。
https://github.com/tapika/cutf
https://github.com/tapika/cutf
(Originated from https://github.com/noct/cutf)
(源自https://github.com/noct/cutf)
API works with plain C as well as with C++.
API 适用于普通 C 以及 C++。
Function prototypes looks like this: (For full list see https://github.com/tapika/cutf/blob/master/cutf.h)
函数原型如下所示:(完整列表见https://github.com/tapika/cutf/blob/master/cutf.h)
//
// Converts utf-8 string to wide version.
//
// returns target string length.
//
size_t utf8towchar(const char* s, size_t inSize, wchar_t* out, size_t bufSize);
//
// Converts wide string to utf-8 string.
//
// returns filled buffer length (not string length)
//
size_t wchartoutf8(const wchar_t* s, size_t inSize, char* out, size_t outsize);
#ifdef __cplusplus
std::wstring utf8towide(const char* s);
std::wstring utf8towide(const std::string& s);
std::string widetoutf8(const wchar_t* ws);
std::string widetoutf8(const std::wstring& ws);
#endif
Sample usage / simple test application for utf conversion testing:
utf 转换测试的示例用法/简单测试应用程序:
#include "cutf.h"
#define ok(statement) \
if( !(statement) ) \
{ \
printf("Failed statement: %s\n", #statement); \
r = 1; \
}
int simpleStringTest()
{
const wchar_t* chineseText = L"主体";
auto s = widetoutf8(chineseText);
size_t r = 0;
printf("simple string test: ");
ok( s.length() == 6 );
uint8_t utf8_array[] = { 0xE4, 0xB8, 0xBB, 0xE4, 0xBD, 0x93 };
for(int i = 0; i < 6; i++)
ok(((uint8_t)s[i]) == utf8_array[i]);
auto ws = utf8towide(s);
ok(ws.length() == 2);
ok(ws == chineseText);
if( r == 0 )
printf("ok.\n");
return (int)r;
}
And if this library does not satisfy your needs - feel free to open following link:
如果这个库不能满足您的需求 - 请随时打开以下链接:
and scroll down at the end of page and pick up any heavier library which you like.
并在页面末尾向下滚动并选择您喜欢的任何较重的库。
回答by Martin Cote
I don't think there's a portable way of doing this. C++ doesn't know the encoding of its multibyte characters.
我不认为有一种可移植的方式来做到这一点。C++ 不知道其多字节字符的编码。
As Chris suggested, your best bet is to play with codecvt.
正如克里斯建议的那样,最好的办法是使用 codecvt。