C++11 中字符串文字的 Unicode 编码

Question

提问by Kerrek SB

Following a related question, I'd like to ask about the new character and string literal types in C++11. It seems that we now have four sorts of characters and five sorts of string literals. The character types:

在一个相关的问题之后，我想问一下 C++11 中新的字符和字符串文字类型。看起来我们现在有四种字符和五种字符串文字。字符类型：

char     a =  '\x30';         // character, no semantics
wchar_t  b = L'\xFFEF';       // wide character, no semantics
char16_t c = u'\u00F6';       // 16-bit, assumed UTF16?
char32_t d = U'\U0010FFFF';   // 32-bit, assumed UCS-4

And the string literals:

和字符串文字：

char     A[] =  "Hello\x0A";         // byte string, "narrow encoding"
wchar_t  B[] = L"Hell\xF6\x0A";      // wide string, impl-def'd encoding
char16_t C[] = u"Hell\u00F6";        // (1)
char32_t D[] = U"Hell\U000000F6\U0010FFFF"; // (2)
auto     E[] = u8"\u00F6\U0010FFFF"; // (3)

The question is this: Are the \x/\u/\Ucharacter references freely combinable with all string types? Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\Ureferences get expanded into a variable number of bytes? Do u""and u8""strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence? And similarly for u8? In (1), can I write lone surrogates with \u? Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?

现在的问题是：是\x/ \u/\U字符引用的所有字符串类型自由组合？是否所有的字符串类型的固定宽度，即正是数组包含尽可能多的元素出现在文字，或\x/ \u/\U引用得到扩展成字节数量可变的？DOu""和u8""串具有编码的语义，比如我可以说char16_t x[] = u"\U0010FFFF"，与非BMP代码点被编码成两部分的UTF16序列？同样对于u8? 在 (1) 中，我可以用写单独的代理\u吗？最后，是否有任何字符串函数编码感知（即它们感知字符并且可以检测无效的字节序列）？

This is a bit of an open-ended question, but I'd like to get as complete a picture as possible of the new UTF-encoding and type facilities of the new C++11.

这是一个开放式问题，但我想尽可能完整地了解新 C++11 的新 UTF 编码和类型工具。

Answer 1

采纳答案by Nicol Bolas

Are the \x/\u/\U character references freely combinable with all string types?

\x/\u/\U 字符引用是否可以与所有字符串类型自由组合？

No. \xcan be used in anything, but \uand \Ucan only be used in strings that are specifically UTF-encoded. However, for any UTF-encoded string, \uand \Ucan be used as you see fit.

号\x可以在任何可以使用，但\u并\U只能在那些专门UTF编码字符串中使用。然而，对于任何UTF编码字符串，\u并且\U可以作为您认为合适的使用。

Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\U references get expanded into a variable number of bytes?

是否所有字符串类型都是固定宽度的，即数组包含与文本中出现的元素一样多的元素，或者 \x/\u/\U 引用被扩展为可变数量的字节？

Not in the way you mean. \x, \u, and \Uare converted based on the string encoding. The number of those "code units" (using Unicode terms. A char16_tis a UTF-16 code unit) values depends on the encoding of the containing string. The literal u8"\u1024"would create a string containing 2 chars plus a null terminator. The literal u"\u1024"would create a string containing 1 char16_tplus a null terminator.

不是你说的那样。\x, \u, 和\U根据字符串编码进行转换。这些“代码单元”（使用 Unicode 术语。Achar16_t是 UTF-16 代码单元）值的数量取决于包含字符串的编码。文字u8"\u1024"将创建一个包含 2 chars 加上一个空终止符的字符串。文字u"\u1024"将创建一个包含 1char16_t和空终止符的字符串。

The number of code units used is based on the Unicode encoding.

使用的代码单元数基于 Unicode 编码。

Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence?

u"" 和 u8"" 字符串是否具有编码语义，例如我可以说 char16_t x[] = u"\U0010FFFF"，并且非 BMP 代码点被编码为两个单元的 UTF16 序列吗？

u""creates a UTF-16 encoded string. u8""creates a UTF-8 encoded string. They will be encoded per the Unicode specification.

u""创建一个 UTF-16 编码的字符串。u8""创建一个 UTF-8 编码的字符串。它们将按照 Unicode 规范进行编码。

In (1), can I write lone surrogates with \u?

在 (1) 中，我可以用 \u 写单独的代理吗？

Absolutely not. The specification expressly forbids using the UTF-16 surrogate pairs (0xD800-0xDFFF) as codepoints for \uor \U.

绝对不。该规范明确禁止使用 UTF-16 代理对 (0xD800-0xDFFF) 作为\u或的代码点\U。

Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?

最后，是否有任何字符串函数编码感知（即它们感知字符并且可以检测无效的字节序列）？

Absolutely not. Well, allow me to rephrase that.

绝对不。好吧，让我重新表述一下。

std::basic_stringdoesn't deal with Unicode encodings. They certainly can storeUTF-encoded strings. But they can only think of them as sequences of char, char16_t, or char32_t; they can't think of them as a sequence of Unicode codepoints that are encoded with a particular mechanism. basic_string::length()will return the number of code units, not code points. And obviously, the C standard library string functions are totally useless

std::basic_string不处理 Unicode 编码。它们当然可以存储UTF 编码的字符串。但他们只能将它们视为char、char16_t、或的序列char32_t；他们不能将它们视为使用特定机制编码的 Unicode 代码点序列。basic_string::length()将返回代码单元的数量，而不是代码点。显然，C 标准库字符串函数完全没用

It should be noted however that "length" for a Unicode string does not mean the number of codepoints. Some code points are combining "characters" (an unfortunate name), which combine with the previous codepoint. So multiple codepoints can map to a single visual character.

然而应该注意的是，Unicode 字符串的“长度”并不意味着代码点的数量。一些代码点正在组合“字符”（一个不幸的名字），它与之前的代码点组合。因此多个代码点可以映射到单个视觉字符。

Iostreams can in fact read/write Unicode-encoded values. To do so, you will have to use a locale to specify the encoding and properly imbue it into the various places. This is easier said than done, and I don't have any code on me to show you how.

Iostreams 实际上可以读/写 Unicode 编码的值。为此，您必须使用语言环境来指定编码并将其正确地注入各个地方。这说起来容易做起来难，而且我没有任何代码可以向您展示如何操作。

C++11 中字符串文字的 Unicode 编码

提问by Kerrek SB

采纳答案by Nicol Bolas

相关推荐

最近更新

标签

C++11 中字符串文字的 Unicode 编码

提问by Kerrek SB

采纳答案by Nicol Bolas

相关推荐

C++ 如何计算 std::vector<int> 中的值的总和

c ++编译器错误“未在此范围内声明”

C++ 指针的大小是多少？

C++ 编辑 QDomElement 的值？

相关推荐

最近更新

标签