C++ 中的 Unicode 和 std::string

Question

提问by Oystein

If I write a random string to file in C++ consisting of some unicode characters, I am told by my text editor that I have not created a valid UTF-8 file.

如果我在 C++ 中将一个随机字符串写入由一些 unicode 字符组成的文件，我的文本编辑器会告诉我我没有创建有效的 UTF-8 文件。

// Code example
const std::string charset = "abcdefgàèíü?à";
file << random_string(charset); // using std::fstream

What can I do to solve this? Do I have to do lots of additional manual encoding? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file?

我能做些什么来解决这个问题？我是否必须进行大量额外的手动编码？按照我的理解，std::string 不关心编码，只关心字节，所以当我传递一个 unicode 字符串并将其写入文件时，该文件肯定应该包含相同的字节并被识别为 UTF- 8 编码文件？

Answer 1

回答by Fred Foo

random_stringis likely to be the culprit; I wonder how it's implemented. If your string is indeed UTF-8-encoded and random_stringlooks like

random_string很可能是罪魁祸首；我想知道它是如何实施的。如果您的字符串确实是 UTF-8 编码并且random_string看起来像

std::string random_string(std::string const &charset)
{
    const int N = 10;
    std::string result(N);
    for (int i=0; i<N; i++)
        result[i] = charset[rand() % charset.size()];
    return result;
}

then it will take random chars from charset, which in UTF-8 (as other posters have pointed out) are not Unicode code points, but simple bytes. If it selects a random byte from the middle of a UTF-8 multibyte character as the first byte (or puts that after an 7-bit ASCII-compatible character), then your output will not be valid UTF-8. See Wikipediaand RFC 3629.

然后它将char从中获取随机s charset，在 UTF-8 中（正如其他海报指出的那样）不是 Unicode 代码点，而是简单的字节。如果它从 UTF-8 多字节字符的中间选择一个随机字节作为第一个字节（或将其放在 7 位 ASCII 兼容字符之后），那么您的输出将不是有效的 UTF-8。请参阅维基百科和RFC 3629。

The solution might be to transformto and from UTF-32 in random_string. I believe wchar_tand std::wstringuse UTF-32 on Linux. UTF-16 would also be safe, as long as you stay within the Basic Multilingual Plane.

该解决方案可能是变换到和从UTF-32 random_string。我相信wchar_t并std::wstring在 Linux 上使用 UTF-32。UTF-16 也是安全的，只要您保持在基本多语言平面内。

Answer 2

回答by Charles Salvia

What can I do to solve this? Do I have to do lots of additional manual encoding? The way I understand it, std::string does not care about the encoding, only the bytes, so when I pass it a unicode string and write it to file, surely that file should contain the same bytes and be recognized as a UTF-8 encoded file?

我能做些什么来解决这个问题？我是否必须进行大量额外的手动编码？按照我的理解，std::string 不关心编码，只关心字节，所以当我传递一个 unicode 字符串并将其写入文件时，该文件肯定应该包含相同的字节并被识别为 UTF- 8 编码文件？

You are correct that std::stringis encoding agnostic. It simply holds an array of charelements. How these charelements are interpreted as text depends on the environment. If your locale is not set to some form of Unicode (i.e. UTF-8 or UTF-16), then when you output a string it will not be displayed/interpreted as Unicode.

您是正确的，std::string编码不可知。它只是包含一个char元素数组。如何将这些char元素解释为文本取决于环境。如果您的语言环境未设置为某种形式的 Unicode（即 UTF-8 或 UTF-16），那么当您输出字符串时，它不会显示/解释为 Unicode。

Are you sure your string literal "abcdefgàèíü?à" is actuallyUnicode and not, for example, Latin-1? (ISO-8859-1 or possible Windows-1252)? You need to determine what locale your platform is currently configured to use.

您确定您的字符串文字 "abcdefgàèíü?à"实际上是Unicode 而不是例如Latin-1吗？（ISO-8859-1 或可能的 Windows-1252）？您需要确定您的平台当前配置使用的语言环境。

-----------EDIT-----------

- - - - - -编辑 - - - - - -

I think I know your problem: some of those Unicode characters in your charsetstring literal, like the accented character "à", are two-bytecharacters (assuming a UTF-8 encoding). When you address the character-set string using the []operator in your random_stringfunction, you are returning halfof a Unicode character. Thus the random-stringfunction creates an invalid character string.

我想我知道你的问题：你的charset字符串文字中的一些 Unicode 字符，比如重音字符“à”，是两字节字符（假设是 UTF-8 编码）。当您[]在random_string函数中使用运算符处理字符集字符串时，您将返回Unicode 字符的一半。因此该random-string函数创建了一个无效的字符串。

For example, consider the following code:

例如，考虑以下代码：

std::string s = "à";
std::cout << s.length() << std::endl;

In an environment where the string literal is interpreted as UTF-8, this program will output 2. Therefore, the first character of the string (s[0]) is only halfof a Unicode character, and therefore not valid. Since your random_stringfunction is addressing the string by single bytes using the []operator, you're creating invalid random strings.

在字符串文字被解释为 UTF-8 的环境中，该程序将输出2. 因此，字符串 ( s[0])的第一个字符仅为Unicode 字符的一半，因此无效。由于您的random_string函数使用[]运算符按单个字节寻址字符串，因此您正在创建无效的随机字符串。

So yes, you need to use std::wstring, and create your charset string-literal using the Lprefix.

所以是的，您需要使用std::wstring, 并使用L前缀创建您的字符集字符串文字。

Answer 3

回答by Diego Sevilla

In your code sample, the std::string charsetstores what you write. That is, if you have used a UTF-8 text editor to write this, what you will receive at output in file would be exactly that UTF-8 text.

在您的代码示例中，std::string charset存储您编写的内容。也就是说，如果您使用 UTF-8 文本编辑器来编写此文件，那么您将在文件中输出时收到的正是该 UTF-8 文本。

UTF-8 is just a coding scheme in which different chars use different byte sizes. However, if you use a UTF-8 editor, it will codify, say '?' with two bytes, and, when you write it to file, it will have that two bytes (being again UTF-8 compliant).

UTF-8 只是一种编码方案，其中不同的字符使用不同的字节大小。但是，如果您使用 UTF-8 编辑器，它会编码，例如“？” 有两个字节，并且，当您将其写入文件时，它将具有这两个字节（再次符合 UTF-8 标准）。

The problem may be the editor you used to create the source C++ file. It may use latin1 or some other encoding.

问题可能是您用于创建源 C++ 文件的编辑器。它可能使用 latin1 或其他一些编码。

Answer 4

回答by Marcelo Cantos

To write UTF-8, you need to use a codecvt facet like this one. An example of how to use it can be seen here.

要编写 UTF-8，您需要使用像这样的 codecvt facet 。可以在此处查看如何使用它的示例。

C++ 中的 Unicode 和 std::string

提问by Oystein

回答by Fred Foo

回答by Charles Salvia

回答by Diego Sevilla

回答by Marcelo Cantos

相关推荐

最近更新

标签

C++ 中的 Unicode 和 std::string

提问by Oystein

回答by Fred Foo

回答by Charles Salvia

回答by Diego Sevilla

回答by Marcelo Cantos

相关推荐

C++ 使用 ncurses 创建一个函数来检查 unix 中的按键

C++ 带有浮点数的 std::cout

C++ 输出参数并通过引用传递

C++ 如何仅使用 OpenCV HighGui 一键制作简单窗口？

相关推荐

最近更新

标签