C++ 像 L"Hello World" 这样以 L 开头的宽字符串文字是否保证以 Unicode 编码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1810343/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 21:13:13  来源:igfitidea点击:

Is a wide character string literal starting with L like L"Hello World" guaranteed to be encoded in Unicode?

c++unicode

提问by Peter

I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right?

我最近试图全面了解创建支持 unicode 的独立于平台的 C++ 应用程序需要哪些步骤。令我感到困惑的一件事是,大多数 howtos 和东西都使字符编码(即 ANSI 或 Unicode)和字符类型(char 或 wchar_t)相等。到目前为止,我已经了解到,这些是不同的东西,可能存在以 Unicode 编码但由 std::string 表示的字符序列,以及以 ANSI 编码但以 std::wstring 表示的字符序列,对吗?

So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with Lor does it just say it's of type wchar_t with implementation specific character encoding?

因此,我想到的问题是 C++ 标准是否对以 开头的字符串文字的编码提供了任何保证,L还是只是说它是具有特定于实现的字符编码的 wchar_t 类型?

If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way? What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?

如果没有这样的保证,这是否意味着我需要某种外部资源系统以独立于平台的方式为我的应用程序提供非 ASCII 字符串文字?对此的首选方法是什么?资源系统或源文件的正确编码加上正确的编译器选项?

回答by Charles Salvia

The Lsymbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.

L字符串文字前面的符号仅表示字符串中的每个字符都将存储为wchar_t. 但这并不一定意味着 Unicode。例如,您可以使用宽字符串来编码GB 18030,这是在china使用的类似于 Unicode 的字符集。C++03 标准对 Unicode 没有任何说明(但是 C++11 定义了Unicode 字符类型和字符串文字),因此在 C++03 中正确表示 Unicode 字符串取决于您。

Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc"will be represented as a 3-byte string (not counting the null), and L"abc"will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.

关于字符串文字,C++ 标准的第 2 章(词法约定)提到了“基本源字符集”,它基本上等同于 ASCII。所以这基本上保证了"abc"将被表示为一个 3 字节的字符串(不包括空值),L"abc"并将被表示为一个3 * sizeof(wchar_t)宽字符的-byte 字符串。

The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXXhexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytesby using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.

该标准还提到了“通用字符名称”,它允许您使用\uXXXX十六进制表示法来引用非 ASCII 字符。这些“通用字符名称”通常直接映射到 Unicode 值,但标准并不保证它们必须如此。但是,您至少可以通过使用通用字符名称来保证您的字符串将表示为特定的字节序列。如果运行时环境支持 Unicode、安装了适当的字体等,这将保证 Unicode 输出。

As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXXnotation in your string literals.

至于 C++03 源文件中的字符串文字,同样没有保证。如果您的代码中有包含 ASCII 范围之外的字符的 Unicode 字符串文字,则由编译器决定如何解释这些字符。如果您想明确保证编译器会“做正确的事情”,则需要\uXXXX在字符串文字中使用符号。

回答by eidolon

The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.

C++03 没有提到 unicode(未来的 C++0x 会提到)。目前,您必须使用外部库(ICUUTF-CPP等)或使用特定于平台的代码构建您自己的解决方案。正如其他人所提到的,没有指定 wchar_t 编码(甚至大小)。因此,字符串文字编码是特定于实现的。但是,您可以使用 \x \u \U 转义在字符串文字中提供 unicode 代码点。

Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.

通常,Windows 中的 unicode 应用程序使用 wchar_t(使用 UTF-16 编码)作为内部字符格式,因为它使 Windows API 的使用更容易,因为 Windows 本身使用 UTF-16。Unix/Linux unicode 应用程序通常在内部使用 char(使用 UTF-8 编码)。如果要在不同平台之间交换数据,UTF-8 是数据传输编码的常用选择。

回答by Martin York

The standard makes no mention of encoding formats for strings.

该标准没有提及字符串的编码格式。

Take a look at ICU from IBM (its free). http://site.icu-project.org/

看看 IBM 的 ICU(免费)。http://site.icu-project.org/