C++11 中是否需要 u8 字符串文字

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13444930/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 17:19:46  来源:igfitidea点击:

Is the u8 string literal necessary in C++11

c++utf-8c++11literalsstring-literals

提问by Lukas Schmelzeisen

From Wikipedia:

来自维基百科

For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.

为了在 C++ 编译器中增强对 Unicode 的支持,已将类型 char 的定义修改为至少存储 UTF-8 的八位编码所需的大小。

I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this

我想知道这对于编写便携式应用程序到底意味着什么。写这个有什么区别吗

const char[] str = "Test String";

or this?

或这个?

const char[] str = u8"Test String";

Is there be any reason not to use the latter for every string literal in your code?

是否有任何理由不对代码中的每个字符串文字使用后者?

What happens when there are non-ASCII-Characters inside the TestString?

当 TestString 中有非 ASCII 字符时会发生什么?

回答by Kerrek SB

The encoding of "Test String"is the implementation-defined system encoding (the narrow, possibly multibyte one).

的编码"Test String"是实现定义的系统编码(窄的,可能是多字节编码)。

The encoding of u8"Test String"is always UTF-8.

的编码u8"Test String"始终为 UTF-8。

The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.

这些例子并不是很能说明问题。如果您\U0010FFFF在字符串中包含了一些 Unicode 文字(例如),那么您总是会得到那些(编码为 UTF-8),但是它们是否可以在系统编码的字符串中表示,如果是,它们的值是什么,是实现定义的。

If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.

如果有帮助,请想象您正在 EBCDIC 机器上创作源代码。然后文字“测试字符串”始终在源文件本身中进行 EBCDIC 编码,但u8-initialized 数组包含 UTF-8 编码值,而第一个数组包含 EBCDIC 编码值。

回答by Cheers and hth. - Alf

You quote Wikipedia:

你引用维基百科:

For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.

为了在 C++ 编译器中增强对 Unicode 的支持,已将类型 char 的定义修改为至少存储 UTF-8 的八位编码所需的大小。

Well, the “For the purpose of” is not true. charhas always been guaranteed to be at least 8 bits, that is, CHAR_BIThas always been required to be ≥8, due to the range required for charin the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.

嗯,“为目的”是不正确的。由于C 标准中要求的范围,char始终保证至少为 8 位,即CHAR_BIT始终要求为 ≥8 char。这是(引用 C++11 §17.5.1.5/1)“并入”到 C++ 标准中。

If I should guess about the purposeof that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.

如果我应该猜测这种措辞变化的目的,那就是为那些不知道对 C 标准的依赖性的读者澄清一些事情。

Regarding the effect of the u8literal prefix, it

关于u8文字前缀的作用,它

  • affects the encoding of the string in the executable, but

  • unfortunately it does notaffect the type.

  • 影响可执行文件中字符串的编码,但

  • 不幸的是,它不会影响类型。

Thus, in both cases "t?rrfisk"and u8"t?rrfisk"you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “?” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.

因此,在这两种情况下"t?rrfisk"u8"t?rrfisk"您都会得到一个. 但在前一种文字中,编码是为编译器选择的任何内容,例如,对于字符 8 个字节加上一个空字节,对于数组大小为 9 的拉丁文 1(或 Windows ANSI Western)。而在后一种文字中,编码是保证是UTF-8,哪里有“?” 将用 2 或 3 个字节(我不记得确切)编码,以获得稍大的数组大小。char const[n]

回答by Roi Danton

If the execution character setof the compiler is set to UTF-8, it makes no difference if u8is used or not, since the compiler converts the characters to UTF-8 in both cases.

如果编译器的执行字符集设置为UTF-8,u8使用与不使用没有区别,因为编译器在两种情况下都会将字符转换为UTF-8。

However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8is omitted. For example, the conversion to wide strings will crash e.g. in VS15:

但是,如果编译器执行字符集是系统的非 UTF8 代码页(例如 Visual C++ 的默认值),则在u8省略时可能无法正确处理非 ASCII 字符。例如,转换为宽字符串会在 VS15 中崩溃:

std::string narrowJapanese("スタークラフト");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.

回答by Dietmar Kühl

The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..."strings.

编译器选择平台自然的本机编码。在典型的 POSIX 系统上,它可能会选择 ASCII 和一些可能取决于环境对 ASCII 范围之外的字符值的设置。在大型机上,它可能会选择 EBCDIC。比较从文件或命令行收到的字符串可能最适合本地字符集。但是,在处理使用 UTF-8 显式编码的文件时,最好使用u8"..."字符串。

That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string where each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.

这就是说,随着近来有关字符编码串的处理在C和C ++一个基本的假设的变化得到了破坏:每个内部角色对象(charwchar_t等)用于表示一个字符。对于 UTF-8 字符串来说,这显然不再正确,其中每个字符对象仅代表某个字符的一个字节。因此,所有的字符串操作、字符分类等功能都不一定适用于这些字符串。我们没有任何好的库来处理此类字符串以包含在标准中。