在 C++ 源代码中使用 Unicode
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/331690/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Unicode in C++ source code
提问by Kresimir Cosic
What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?
C++源代码的标准编码是什么?C++ 标准甚至对此有什么说明吗?我可以用 Unicode 编写 C++ 源代码吗?
For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)
例如,我可以在评论中使用非 ASCII 字符,例如汉字吗?如果是这样,是否允许使用完整的 Unicode 或只是 Unicode 的一个子集?(例如,那个 16 位的第一页或它叫什么。)
Furthermore, can I use Unicode for strings? For example:
此外,我可以将 Unicode 用于字符串吗?例如:
Wstring str=L"Strange chars: a? ??? ě ";
采纳答案by Johannes Schaub - litb
Encoding in C++ is quite a bit complicated. Here is my understanding of it.
C++ 中的编码相当复杂。这是我对它的理解。
Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char
. In addition implementations have to support a way to name other characters using a way called universal-character-names
and look like \uffff
or \Uffffffff
and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).
每个实现都必须支持来自基本源字符集的字符。其中包括 §2.2/1(C++11 中的 §2.3/1)中列出的常见字符。这些字符都应该合二为一char
。此外,实现必须支持一种使用称为universal-character-names
和外观的方式来命名其他字符的方式,\uffff
或者\Uffffffff
可以用来指代 Unicode 字符。它们的一个子集可用于标识符(在附件 E 中列出)。
This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):
这一切都很好,但是从文件中的字符到源字符(在编译时使用)的映射是实现定义的。这构成了所使用的编码。这是它的字面意思(C++98 版本):
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that des- ignates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)
如有必要,物理源文件字符以实现定义的方式映射到基本源字符集(为行尾指示符引入换行符)。Trigraph 序列 (2.3) 被相应的单字符内部表示替换。任何不在基本源字符集 (2.2) 中的源文件字符都将替换为指定该字符的通用字符名称。(实现可以使用任何内部编码,只要源文件中遇到的实际扩展字符,以及源文件中作为通用字符名称(即使用 \uXXXX 符号)表示的相同扩展字符,都被处理等价。)
For gcc, you can change it using the option -finput-charset=charset
. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is -fexec-charset=charset
for char (it defaults to utf-8
) and -fwide-exec-charset=charset
(which defaults to either utf-16
or utf-32
depending on the size of wchar_t
).
对于 gcc,您可以使用选项更改它-finput-charset=charset
。此外,您可以更改用于在运行时重新设置值的执行字符。正确的选项是-fexec-charset=charset
char(默认为utf-8
)和-fwide-exec-charset=charset
(默认为utf-16
或utf-32
取决于 的大小wchar_t
)。
回答by MSalters
In addition to litb's post, MSVC++ supports Unicode too. I understand it gets the Unicode encoding from the BOM. It definitely supports code like int (*?)();
or const std::set<int> ?;
If you're really into code obfuscuation:
除了 litb 的帖子,MSVC++ 也支持 Unicode。我知道它从 BOM 中获取 Unicode 编码。它绝对支持像int (*?)();
orconst std::set<int> ?;
如果你真的喜欢代码混淆的代码:
typedef void ?; // Also known as \u203C
class oo? {
operator ?() {}
};
回答by Head Geek
The C++ standard doesn't say anything about source-code file encoding, so far as I know.
据我所知,C++ 标准没有说明源代码文件编码。
The usual encoding is (or used to be) 7-bit ASCII -- some compilers (Borland's, for instance) would balk at ASCII characters that used the high-bit. There's no technical reason that Unicode characters can't be used, if your compiler and editor accept them -- most modern Linux-based tools, and many of the better Windows-based editors, handle UTF-8 encoding with no problem, though I'm not sure that Microsoft's compiler will.
通常的编码是(或曾经是)7 位 ASCII——一些编译器(例如 Borland 的)会拒绝使用高位的 ASCII 字符。没有技术原因不能使用 Unicode 字符,如果你的编译器和编辑器接受它们——大多数现代基于 Linux 的工具,以及许多更好的基于 Windows 的编辑器,处理 UTF-8 编码没有问题,尽管我不确定微软的编译器会不会。
EDIT: It looks like Microsoft's compilers will accept Unicode-encoded files, but will sometimes produce errors on 8-bit ASCII too:
编辑:看起来微软的编译器会接受 Unicode 编码的文件,但有时也会在 8 位 ASCII 上产生错误:
warning C4819: The file contains a character that cannot be represented
in the current code page (932). Save the file in Unicode format to prevent
data loss.
回答by Max Lybbert
There are two issues at play here. The first is what characters are allowed in C++ code (and comments), such as variable names. The second is what characters are allowed in strings and string literals.
这里有两个问题在起作用。第一个是 C++ 代码(和注释)中允许使用的字符,例如变量名。第二个是字符串和字符串文字中允许使用的字符。
As noted, C++ compilers mustsupport a very restricted ASCII-based character set for the characters allowed in code and comments. In practice, this character set didn't work very well with some European character sets (and especially with some European keyboards that didn't have a few characters -- like square brackets -- available), so the concept of digraphs and trigraphs was introduced. Many compilers accept more than this character set at this time, but there isn't any guarantee.
如前所述,C++ 编译器必须为代码和注释中允许的字符支持非常受限的基于 ASCII 的字符集。在实践中,这个字符集在一些欧洲字符集(尤其是一些没有几个字符——比如方括号——可用)的欧洲键盘上工作得不是很好,所以二合字母和三合字母的概念是介绍。目前,许多编译器接受的字符集不止这个字符集,但没有任何保证。
As for strings and string literals, C++ has the concept of a wide character and wide character string. However, the encoding for that character set is undefined. In practice it's almost always Unicode, but I don't think there's any guarantee here. Wide character string literals look like L"string literal", and these can be assigned to std::wstring's.
对于字符串和字符串文字,C++ 有宽字符和宽字符串的概念。但是,该字符集的编码未定义。实际上,它几乎总是 Unicode,但我认为这里没有任何保证。宽字符串文字看起来像 L“字符串文字”,这些可以分配给 std::wstring 的。
C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian.
C++11 添加了对 Unicode 字符串和字符串文字的显式支持,编码为 UTF-8、UTF-16 大端、UTF-16 小端、UTF-32 大端和 UTF-32 小端。
回答by Rob
For encoding in strings I think you are meant to use the \unotation, e.g.:
对于字符串编码,我认为您应该使用\u符号,例如:
std::wstring str = L"\u20AC"; // Euro character
回答by raidsan
In this context, if you get MSVC++ warning C4819, just change the source file coding to "UTF-8 with Bom".
在这种情况下,如果您收到 MSVC++ 警告 C4819,只需将源文件编码更改为“UTF-8 with Bom”。
GCC 4.1 doesn't support this, but GCC 4.4 does, and the latest Qt version uses GCC 4.4, so use "UTF-8 with Bom" as source file coding.
GCC 4.1 不支持,但 GCC 4.4 支持,最新的 Qt 版本使用 GCC 4.4,所以使用“UTF-8 with Bom”作为源文件编码。
回答by coppro
It's also worth noting that wide characters in C++ aren't really Unicode strings as such. They are just strings of larger characters, usually 16, but sometimes 32 bits. This is implementation-defined, though, IIRC you can have an 8-bit wchar_t
You have no real guarantee as to the encoding in them, so if you are trying to do something like text processing, you will probably want a typedef to the most suitable integer type to your Unicode entity.
还值得注意的是,C++ 中的宽字符并不是真正的 Unicode 字符串。它们只是较大字符的字符串,通常为 16 位,但有时为 32 位。这是实现定义的,但是,IIRC 你可以有一个 8 位wchar_t
你不能真正保证它们的编码,所以如果你试图做一些像文本处理这样的事情,你可能需要一个最合适的 typedef Unicode 实体的整数类型。
C++1x has additional unicode support in the form of UTF-8 encoding string literals (u8"text"
), and UTF-16 and UTF-32 data types (char16_t
and char32_t
IIRC) as well as corresponding string constants (u"text"
and U"text"
). The encoding on characters specified without \uxxxx
or \Uxxxxxxxx
constants is still implementation-defined, though (and there is no encoding support for complex string types outside the literals)
C++1x 以 UTF-8 编码字符串文字 ( u8"text"
)、UTF-16 和 UTF-32 数据类型(char16_t
和char32_t
IIRC)以及相应的字符串常量(u"text"
和U"text"
)的形式提供了额外的 unicode 支持。但是,不使用\uxxxx
或\Uxxxxxxxx
常量指定的字符的编码仍然是实现定义的(并且没有对文字之外的复杂字符串类型的编码支持)
回答by Klaim
AFAIK It's not standardized as you can put any type of characters in wide strings. You just have to check that your compiler is set to Unicode source code to make it work right.
AFAIK 这不是标准化的,因为您可以将任何类型的字符放在宽字符串中。您只需要检查您的编译器是否设置为 Unicode 源代码以使其正常工作。