C++ 将 Unicode UTF-8 字符串存储在 std::string 中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23264818/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 00:23:13  来源:igfitidea点击:

Storing unicode UTF-8 string in std::string

c++windowsunicodeutf-8stdstring

提问by Pritesh Acharya

In response to discussion in

作为对讨论的回应

Cross-platform strings (and Unicode) in C++

C++ 中的跨平台字符串(和 Unicode)

How to deal with Unicode strings in C/C++ in a cross-platform friendly way?

如何以跨平台友好的方式处理 C/C++ 中的 Unicode 字符串?

I'm trying to assign a UTF-8string to a std::stringvariable in Visual Studio 2010environment

我正在尝试将UTF-8字符串分配给环境中的std::string变量Visual Studio 2010

std::string msg = "?????";

std::string msg = "?????";

However, when I view the string view debugger, I only see "?????" I have the file saved as Unicode (UTF-8 with Signature) and i'm using character set "use unicode character set"

但是,当我查看字符串视图调试器时,我只看到“?????” 我将文件保存为 Unicode(带签名的 UTF-8),并且我正在使用字符集“使用 unicode 字符集”

"?????" is a nepali language and it contains 5 characters and will occupy 15 bytes. But visual studio debugger shows msg size as 5

“??????” 是尼泊尔语,包含 5 个字符,占用 15 个字节。但是 Visual Studio 调试器将 msg 大小显示为 5

My question is:

我的问题是:

How do I use std::string to just store the utf-8 without needing to manipulate it?

如何使用 std::string 仅存储 utf-8 而无需操作它

回答by Remy Lebeau

If you were using C++11 then this would be easy:

如果您使用的是 C++11,那么这将很容易:

std::string msg = u8"?????";

But since you are not, you can use escape sequences and not rely on the source file's charset to manage the encoding for you, this way your code is more portable (in case you accidentally save it in a non-UTF8 format):

但既然你不是,你可以使用转义序列而不是依赖源文件的字符集来为你管理编码,这样你的代码更便携(以防你不小心将它保存为非 UTF8 格式):

std::string msg = "\xE0\xA4\xAE\xE0\xA4\xB9\xE0\xA4\xB8\xE0\xA5\x81\xE0\xA4\xB8"; // "?????"

Otherwise, you might consider doing a conversion at runtime instead:

否则,您可能会考虑在运行时进行转换:

std::string toUtf8(const std::wstring &str)
{
    std::string ret;
    int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
    if (len > 0)
    {
        ret.resize(len);
        WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
    }
    return ret;
}

std::string msg = toUtf8(L"?????");

回答by Sergey K.

You can write msg.c_str(), s8in the Watches window to see the UTF-8 string correctly.

您可以msg.c_str(), s8在 Watches 窗口中写入以正确查看 UTF-8 字符串。

回答by James Kanze

If you have C++11, you can write u8"?????". Otherwise, you'll have to write the actual byte sequence, using \xxxfor each byte in the UTF-8 sequence.

如果你有 C++11,你可以编写u8"?????". 否则,您必须编写实际的字节序列,使用UTF-8 序列中的每个字节。\xxx

Typically, you're better off reading such text from a configuration file.

通常,您最好从配置文件中读取此类文本。

回答by raymai97

If you set the system locale to English, and the file is in UTF-8 without BOM, VC will let you store the string as-is. I have written an article about this here.

如果您将系统区域设置为英语,并且文件是没有 BOM 的 UTF-8 格式,VC 会让您按原样存储字符串。我在这里写了一篇关于这个的文章。

enter image description here

在此处输入图片说明

回答by DNamto

There is a way to display the right values thanks to the ‘s8′ format specifier. If we append ‘,s8′ to the variable names, Visual Studio reparses the text in UTF-8 and renders the text correctly:

由于 's8'格式说明符,有一种方法可以显示正确的值。如果我们将 ',s8' 附加到变量名称,Visual Studio 会重新解析 UTF-8 中的文本并正确呈现文本:

In case, you are using Microsoft Visual Studio 2008 Service Pack 1, you need to apply hotfix

如果您使用的是 Microsoft Visual Studio 2008 Service Pack 1,则需要应用修补程序

http://support.microsoft.com/kb/980263

http://support.microsoft.com/kb/980263