C++ 和 Boost:编码/解码 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6140223/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 19:33:04  来源:igfitidea点击:

C++ & Boost: encode/decode UTF-8

c++boostunicodeutf-8

提问by sebulba

I'm trying to do a very simple task: take a unicode-aware wstringand convert it to a string, encoded as UTF8 bytes, and then the opposite way around: take a stringcontaining UTF8 bytes and convert it to unicode-aware wstring.

我正在尝试做一个非常简单的任务:获取 unicode-awarewstring并将其转换为string,编码为 UTF8 字节,然后相反的方式:获取string包含 UTF8 字节的a并将其转换为 unicode-aware wstring

The problem is, I need it cross-platform and I need it work with Boost... and I just can't seem to figure a way to make it work. I've been toying with

问题是,我需要它跨平台,我需要它与 Boost 一起工作......我似乎无法找到一种方法来让它工作。我一直在玩弄

Trying to convert the code to use stringstream/wstringstreaminstead of files of whatever, but nothing seems to work.

试图将代码转换为使用stringstream/wstringstream而不是任何文件,但似乎没有任何效果。

For instance, in Python it would look like so:

例如,在 Python 中它看起来像这样:

>>> u"????"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"????".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'

What I'm ultimately after is this:

我最终追求的是:

wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws); 
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}

I really don't want to add another dependency on the ICU or something in that spirit... but to my understanding, it should be possible with Boost.

我真的不想再增加对 ICU 的依赖或本着这种精神的东西……但据我所知,Boost 应该是可能的。

Some sample code would greatly be appreciated! Thanks

一些示例代码将不胜感激!谢谢

回答by sebulba

Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/-- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:

谢谢大家,但最终我求助于http://utfcpp.sourceforge.net/——它是一个非常轻量级且易于使用的仅标头库。我在这里分享一个演示代码,如果有人觉得它有用:

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}

Usage:

用法:

wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);

回答by Cubbi

There's already a boost link in the comments, but in the almost-standard C++0x, there is wstring_convertthat does this

评论中已经有一个 boost 链接,但是在几乎标准的 C++0x 中,就是wstring_convert这样做的

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
    wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    std::string s = conv.to_bytes(uchars);
    std::wstring ws2 = conv.from_bytes(s);
    std::cout << std::boolalpha
              << (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
              << (ws2 == uchars ) << '\n';
}

output when compiled with MS Visual Studio 2010 EE SP1 or with CLang++ 2.9

使用 MS Visual Studio 2010 EE SP1 或 CLang++ 2.9 编译时的输出

true 
true

回答by Diaa Sami

Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16

Boost.Locale 在 Boost 1.48(2011 年 11 月 15 日)中发布,使得与 UTF8/16 之间的转换更容易

Here are some convenient examples from the docs:

以下是文档中的一些方便示例:

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

Almost as easy as Python encoding/decoding :)

几乎和 Python 编码/解码一样简单:)

Note that Boost.Locale is not a header-only library.

请注意,Boost.Locale 不是仅包含头文件的库。

回答by Jakob Riedle

For a drop-in replacementfor std::string/std::wstringthat handles utf8, see TINYUTF8.

对于一个下拉更换std::string/ std::wstring,处理UTF8,看到TINYUTF8

In combination with <codecvt>you can convert pretty much from/to every encoding from/to utf8, which you then handle through the above library.

<codecvt>您结合使用,您可以将几乎所有编码从/转换为 utf8,然后您可以通过上述库进行处理。