C++ UTF8 与 STL 中的宽字符转换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/148403/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 13:14:27  来源:igfitidea点击:

UTF8 to/from wide char conversion in STL

c++unicodestlutf-8character-encoding

提问by Vladimir Grigorov

Is it possible to convert UTF8 string in a std::string to std::wstring and vice versa in a platform independent manner? In a Windows application I would use MultiByteToWideChar and WideCharToMultiByte. However, the code is compiled for multiple OSes and I'm limited to standard C++ library.

是否可以以独立于平台的方式将 std::string 中的 UTF8 字符串转换为 std::wstring,反之亦然?在 Windows 应用程序中,我将使用 MultiByteToWideChar 和 WideCharToMultiByte。但是,代码是为多个操作系统编译的,我仅限于标准 C++ 库。

回答by Vladimir Grigorov

I've asked this question 5 years ago. This thread was very helpful for me back then, I came to a conclusion, then I moved on with my project. It is funny that I needed something similar recently, totally unrelated to that project from the past. As I was researching for possible solutions, I stumbled upon my own question :)

我 5 年前问过这个问题。这个帖子当时对我很有帮助,我得出了一个结论,然后我继续我的项目。有趣的是,我最近需要类似的东西,与过去的那个项目完全无关。在我研究可能的解决方案时,我偶然发现了我自己的问题:)

The solution I chose now is based on C++11. The boost libraries that Constantin mentions in his answerare now part of the standard. If we replace std::wstring with the new string type std::u16string, then the conversions will look like this:

我现在选择的解决方案是基于C++11。Constantin 在他的回答中提到的 boost 库现在是标准的一部分。如果我们用新的字符串类型 std::u16string 替换 std::wstring,那么转换将如下所示:

UTF-8 to UTF-16

UTF-8 到 UTF-16

std::string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::u16string dest = convert.from_bytes(source);    

UTF-16 to UTF-8

UTF-16 到 UTF-8

std::u16string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string dest = convert.to_bytes(source);    

As seen from the other answers, there are multiple approaches to the problem. That's why I refrain from picking an accepted answer.

从其他答案中可以看出,该问题有多种方法。这就是为什么我不选择一个公认的答案。

回答by Constantin

You can extract utf8_codecvt_facetfrom Boost serialization library.

您可以utf8_codecvt_facetBoost 序列化库中提取。

Their usage example:

它们的用法示例:

  typedef wchar_t ucs4_t;

  std::locale old_locale;
  std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

  // Set a New global locale
  std::locale::global(utf8_locale);

  // Send the UCS-4 data out, converting to UTF-8
  {
    std::wofstream ofs("data.ucd");
    ofs.imbue(utf8_locale);
    std::copy(ucs4_data.begin(),ucs4_data.end(),
          std::ostream_iterator<ucs4_t,ucs4_t>(ofs));
  }

  // Read the UTF-8 data back in, converting to UCS-4 on the way in
  std::vector<ucs4_t> from_file;
  {
    std::wifstream ifs("data.ucd");
    ifs.imbue(utf8_locale);
    ucs4_t item = 0;
    while (ifs >> item) from_file.push_back(item);
  }

Look for utf8_codecvt_facet.hppand utf8_codecvt_facet.cppfiles in boost sources.

在 boost 源中查找utf8_codecvt_facet.hpputf8_codecvt_facet.cpp文件。

回答by Mark Ransom

The problem definition explicitly states that the 8-bit character encoding is UTF-8. That makes this a trivial problem; all it requires is a little bit-twiddling to convert from one UTF spec to another.

问题定义明确指出 8 位字符编码是 UTF-8。这使得这是一个微不足道的问题;它所需要的只是从一种 UTF 规范转换为另一种 UTF 规范。

Just look at the encodings on these Wikipedia pages for UTF-8, UTF-16, and UTF-32.

只需查看这些维基百科页面上的UTF-8UTF-16UTF-32 编码

The principle is simple - go through the input and assemble a 32-bit Unicode code point according to one UTF spec, then emit the code point according to the other spec. The individual code points need no translation, as would be required with any other character encoding; that's what makes this a simple problem.

原理很简单 - 通过输入并根据一个 UTF 规范组装一个 32 位 Unicode 代码点,然后根据另一个规范发出代码点。单个代码点不需要翻译,因为任何其他字符编码都需要;这就是使这成为一个简单问题的原因。

Here's a quick implementation of wchar_tto UTF-8 conversion and vice versa. It assumes that the input is already properly encoded - the old saying "Garbage in, garbage out" applies here. I believe that verifying the encoding is best done as a separate step.

这是wchar_tUTF-8 转换的快速实现,反之亦然。它假设输入已经正确编码 - 古老的谚语“垃圾输入,垃圾输出”适用于此。我相信验证编码最好作为一个单独的步骤完成。

std::string wchar_to_UTF8(const wchar_t * in)
{
    std::string out;
    unsigned int codepoint = 0;
    for (in;  *in != 0;  ++in)
    {
        if (*in >= 0xd800 && *in <= 0xdbff)
            codepoint = ((*in - 0xd800) << 10) + 0x10000;
        else
        {
            if (*in >= 0xdc00 && *in <= 0xdfff)
                codepoint |= *in - 0xdc00;
            else
                codepoint = *in;

            if (codepoint <= 0x7f)
                out.append(1, static_cast<char>(codepoint));
            else if (codepoint <= 0x7ff)
            {
                out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
                out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
            }
            else if (codepoint <= 0xffff)
            {
                out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
                out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
                out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
            }
            else
            {
                out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
                out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
                out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
                out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
            }
            codepoint = 0;
        }
    }
    return out;
}

The above code works for both UTF-16 and UTF-32 input, simply because the range d800through dfffare invalid code points; they indicate that you're decoding UTF-16. If you know that wchar_tis 32 bits then you could remove some code to optimize the function.

上面的代码适用于 UTF-16 和 UTF-32 输入,仅仅是因为范围d800throughdfff是无效的代码点;它们表明您正在解码 UTF-16。如果您知道这wchar_t是 32 位,那么您可以删除一些代码来优化该功能。

std::wstring UTF8_to_wchar(const char * in)
{
    std::wstring out;
    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            if (sizeof(wchar_t) > 2)
                out.append(1, static_cast<wchar_t>(codepoint));
            else if (codepoint > 0xffff)
            {
                out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
                out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
            }
            else if (codepoint < 0xd800 || codepoint >= 0xe000)
                out.append(1, static_cast<wchar_t>(codepoint));
        }
    }
    return out;
}

Again if you know that wchar_tis 32 bits you could remove some code from this function, but in this case it shouldn't make any difference. The expression sizeof(wchar_t) > 2is known at compile time, so any decent compiler will recognize dead code and remove it.

同样,如果您知道这wchar_t是 32 位,您可以从此函数中删除一些代码,但在这种情况下,它应该没有任何区别。该表达式sizeof(wchar_t) > 2在编译时是已知的,因此任何体面的编译器都会识别死代码并将其删除。

回答by Ben Straub

There are several ways to do this, but the results depend on what the character encodings are in the stringand wstringvariables.

有几种方法可以做到这一点,但结果取决于stringwstring变量中的字符编码。

If you know the stringis ASCII, you can simply use wstring's iterator constructor:

如果您知道stringis ASCII,则可以简单地使用wstring的迭代器构造函数:

string s = "This is surely ASCII.";
wstring w(s.begin(), s.end());

If your stringhas some other encoding, however, you'll get very bad results. If the encoding is Unicode, you could take a look at the ICU project, which provides a cross-platform set of libraries that convert to and from all sorts of Unicode encodings.

string但是,如果您有一些其他编码,则会得到非常糟糕的结果。如果编码是 Unicode,您可以查看ICU 项目,它提供了一组跨平台的库,可以在各种 Unicode 编码之间进行转换。

If your stringcontains characters in a code page, then may $DEITY have mercy on your soul.

如果您string的代码页中包含字符,那么 $DEITY 可能会怜悯您的灵魂。

回答by vharron

ConvertUTF.hConvertUTF.c

ConvertUTF.h ConvertUTF.c

Credit to bames53for providing updated versions

感谢bames53提供更新版本

回答by Chris Jester-Young

You can use the codecvtlocale facet. There's a specific specialisation defined, codecvt<wchar_t, char, mbstate_t>that may be of use to you, although, the behaviour of that is system-specific, and does not guarantee conversion to UTF-8 in any way.

您可以使用codecvt语言环境方面。定义了一个特定的专业化,codecvt<wchar_t, char, mbstate_t>这可能对您有用,尽管它的行为是特定于系统的,并且不保证以任何方式转换为 UTF-8。

回答by Trisch

UTFConverter- check out this library. It does such a convertion, but you need also ConvertUTF class - I've found it here

UTFConverter- 查看这个库。它进行了这样的转换,但您还需要 ConvertUTF 类 - 我在这里找到

回答by TarmoPikaro

Created my own library for utf-8 to utf-16/utf-32 conversion - but decided to make a fork of existing project for that purpose.

为 utf-8 到 utf-16/utf-32 转换创建了我自己的库 - 但决定为此目的创建一个现有项目的分支。

https://github.com/tapika/cutf

https://github.com/tapika/cutf

(Originated from https://github.com/noct/cutf)

(源自https://github.com/noct/cutf

API works with plain C as well as with C++.

API 适用于普通 C 以及 C++。

Function prototypes looks like this: (For full list see https://github.com/tapika/cutf/blob/master/cutf.h)

函数原型如下所示:(完整列表见https://github.com/tapika/cutf/blob/master/cutf.h

//
//  Converts utf-8 string to wide version.
//
//  returns target string length.
//
size_t utf8towchar(const char* s, size_t inSize, wchar_t* out, size_t bufSize);

//
//  Converts wide string to utf-8 string.
//
//  returns filled buffer length (not string length)
//
size_t wchartoutf8(const wchar_t* s, size_t inSize, char* out, size_t outsize);

#ifdef __cplusplus

std::wstring utf8towide(const char* s);
std::wstring utf8towide(const std::string& s);
std::string  widetoutf8(const wchar_t* ws);
std::string  widetoutf8(const std::wstring& ws);

#endif

Sample usage / simple test application for utf conversion testing:

utf 转换测试的示例用法/简单测试应用程序:

#include "cutf.h"

#define ok(statement)                                       \
    if( !(statement) )                                      \
    {                                                       \
        printf("Failed statement: %s\n", #statement);       \
        r = 1;                                              \
    }

int simpleStringTest()
{
    const wchar_t* chineseText = L"主体";
    auto s = widetoutf8(chineseText);
    size_t r = 0;

    printf("simple string test:  ");

    ok( s.length() == 6 );
    uint8_t utf8_array[] = { 0xE4, 0xB8, 0xBB, 0xE4, 0xBD, 0x93 };

    for(int i = 0; i < 6; i++)
        ok(((uint8_t)s[i]) == utf8_array[i]);

    auto ws = utf8towide(s);
    ok(ws.length() == 2);
    ok(ws == chineseText);

    if( r == 0 )
        printf("ok.\n");

    return (int)r;
}

And if this library does not satisfy your needs - feel free to open following link:

如果这个库不能满足您的需求 - 请随时打开以下链接:

http://utf8everywhere.org/

http://utf8everywhere.org/

and scroll down at the end of page and pick up any heavier library which you like.

并在页面末尾向下滚动并选择您喜欢的任何较重的库。

回答by Martin Cote

I don't think there's a portable way of doing this. C++ doesn't know the encoding of its multibyte characters.

我不认为有一种可移植的方式来做到这一点。C++ 不知道其多字节字符的编码。

As Chris suggested, your best bet is to play with codecvt.

正如克里斯建议的那样,最好的办法是使用 codecvt。