C++11 对 Unicode 的支持程度如何?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17103925/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 20:56:05  来源:igfitidea点击:

How well is Unicode supported in C++11?

c++unicodec++11

提问by Ralph Tandetzky

I've read and heard that C++11 supports Unicode. A few questions on that:

我听说过 C++11 支持 Unicode。几个问题:

  • How well does the C++ standard library support Unicode?
  • Does std::stringdo what it should?
  • How do I use it?
  • Where are potential problems?
  • C++ 标准库对 Unicode 的支持程度如何?
  • 不会std::string做它应注意什么?
  • 我如何使用它?
  • 潜在的问题在哪里?

回答by R. Martinho Fernandes

How well does the C++ standard library support unicode?

C++ 标准库对 unicode 的支持程度如何?

Terribly.

可怕。

A quick scan through the library facilities that might provide Unicode support gives me this list:

快速浏览一下可能提供 Unicode 支持的库设施,我得到了这个列表:

  • Strings library
  • Localization library
  • Input/output library
  • Regular expressions library
  • 字符串库
  • 本地化库
  • 输入/输出库
  • 正则表达式库

I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.

我认为除了第一个之外的所有其他人都提供了可怕的支持。在快速绕过您的其他问题后,我将更详细地回到它。

Does std::stringdo what it should?

不会std::string做它应注意什么?

Yes. According to the C++ standard, this is what std::stringand its siblings should do:

是的。根据 C++ 标准,这是std::string它及其兄弟应该做的:

The class template basic_stringdescribes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.

类模板basic_string描述的对象可以存储由不同数量的任意类似字符的对象组成的序列,其中序列的第一个元素位于零位置。

Well, std::stringdoes that just fine. Does that provide any Unicode-specific functionality? No.

嗯,std::string这样就好了。这是否提供任何特定于 Unicode 的功能?不。

Should it? Probably not. std::stringis fine as a sequence of charobjects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.

应该是?可能不是。std::string作为一个char对象序列很好。这很有用;唯一的烦恼是它是一种非常低级的文本视图,标准 C++ 不提供更高级别的视图。

How do I use it?

我如何使用它?

Use it as a sequence of charobjects; pretending it is something else is bound to end in pain.

将其用作char对象序列;假装是别的东西必然会以痛苦告终。

Where are potential problems?

潜在的问题在哪里?

All over the place? Let's see...

到处都是?让我们来看看...

Strings library

字符串库

The strings library provides us basic_string, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.

字符串库为我们提供了basic_string,这仅仅是标准称为“类似字符的对象”的序列。我称它们为代码单元。如果您想要文本的高级视图,这不是您要找的。这是适合序列化/反序列化/存储的文本视图。

It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb/mbrtoc16and c32rtomb/mbrtoc32.

它还提供了一些来自 C 库的工具,可用于弥合狭隘世界和 Unicode 世界之间的差距:c16rtomb/mbrtoc16c32rtomb/ mbrtoc32

Localization library

本地化库

The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.

本地化库仍然认为那些“类似字符的对象”之一等于一个“字符”。这当然是愚蠢的,并且除了像 ASCII 这样的 Unicode 的一个小子集之外,很多东西都无法正常工作。

Consider, for example, what the standard calls "convenience interfaces" in the <locale>header:

例如,考虑标准在<locale>标题中称为“便利接口”的内容:

template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...

How do you expect any of these functions to properly categorize, say, U+1F34C ??????, as in u8""or u8"\U0001F34C"? There's no way it will ever work, because those functions take only one code unit as input.

您如何期望这些函数中的任何一个正确分类,例如 U+1F34C ??????,如 inu8""u8"\U0001F34C"? 它永远不会工作,因为这些函数只需要一个代码单元作为输入。

This could work with an appropriate locale if you used char32_tonly: U'\U0001F34C'is a single code unit in UTF-32.

如果您char32_t只使用:U'\U0001F34C'是 UTF-32 中的单个代码单元,这可以与适当的语言环境一起使用。

However, that still means you only get the simple casing transformations with toupperand tolower, which, for example, are not good enough for some German locales: "?" uppercases to "SS"? but touppercan only return one charactercode unit.

但是,这仍然意味着您只能使用toupperand进行简单的大小写转换tolower,例如,这对于某些德语语言环境来说不够好:“?” 大写到“SS”?但toupper只能返回一个字符代码单元。

Next up, wstring_convert/wbuffer_convertand the standard code conversion facets.

接下来是wstring_convert/wbuffer_convert和标准代码转换方面。

wstring_convertis used to convert between strings in one given encoding into strings in another given encoding. There are two string types involved in this transformation, which the standard calls a byte string and a wide string. Since these terms are really misleading, I prefer to use "serialized" and "deserialized", respectively, instead†.

wstring_convert用于将一种给定编码的字符串转换为另一种给定编码的字符串。此转换涉及两种字符串类型,标准将其称为字节字符串和宽字符串。由于这些术语确实具有误导性,我更喜欢分别使用“序列化”和“反序列化”来代替†。

The encodings to convert between are decided by a codecvt (a code conversion facet) passed as a template type argument to wstring_convert.

要在它们之间进行转换的编码由作为模板类型参数传递给 的 codecvt(代码转换方面)决定wstring_convert

wbuffer_convertperforms a similar function but as a widedeserialized stream buffer that wraps a byteserialized stream buffer. Any I/O is performed through the underlying byteserialized stream buffer with conversions to and from the encodings given by the codecvt argument. Writing serializes into that buffer, and then writes from it, and reading reads into the buffer and then deserializes from it.

wbuffer_convert执行类似的功能,但作为包装字节序列化流缓冲区的反序列化流缓冲区。任何 I/O 都通过底层字节序列化流缓冲区执行,并与 codecvt 参数给出的编码进行转换。写入序列化到该缓冲区,然后从中写入,读取读取到缓冲区,然后从中反序列化。

The standard provides some codecvt class templates for use with these facilities: codecvt_utf8, codecvt_utf16, codecvt_utf8_utf16, and some codecvtspecializations. Together these standard facets provide all the following conversions. (Note: in the following list, the encoding on the left is always the serialized string/streambuf, and the encoding on the right is always the deserialized string/streambuf; the standard allows conversions in both directions).

本标准规定了一些的codecvt类模板与这些设施的使用:codecvt_utf8codecvt_utf16codecvt_utf8_utf16,和一些codecvt专业。这些标准方面共同提供以下所有转换。(注意:在下面的列表中,左边的编码总是序列化的字符串/streambuf,右边的编码总是反序列化的字符串/streambuf;标准允许双向转换)。

  • UTF-8 ? UCS-2 with codecvt_utf8<char16_t>, and codecvt_utf8<wchar_t>where sizeof(wchar_t) == 2;
  • UTF-8 ? UTF-32 with codecvt_utf8<char32_t>, codecvt<char32_t, char, mbstate_t>, and codecvt_utf8<wchar_t>where sizeof(wchar_t) == 4;
  • UTF-16 ? UCS-2 with codecvt_utf16<char16_t>, and codecvt_utf16<wchar_t>where sizeof(wchar_t) == 2;
  • UTF-16 ? UTF-32 with codecvt_utf16<char32_t>, and codecvt_utf16<wchar_t>where sizeof(wchar_t) == 4;
  • UTF-8 ? UTF-16 with codecvt_utf8_utf16<char16_t>, codecvt<char16_t, char, mbstate_t>, and codecvt_utf8_utf16<wchar_t>where sizeof(wchar_t) == 2;
  • narrow ? wide with codecvt<wchar_t, char_t, mbstate_t>
  • no-op with codecvt<char, char, mbstate_t>.
  • UTF-8 ? UCS-2 codecvt_utf8<char16_t>,和codecvt_utf8<wchar_t>其中sizeof(wchar_t) == 2;
  • UTF-8 ? UTF-32 codecvt_utf8<char32_t>codecvt<char32_t, char, mbstate_t>以及codecvt_utf8<wchar_t>其中sizeof(wchar_t) == 4;
  • UTF-16 ? UCS-2 codecvt_utf16<char16_t>,和codecvt_utf16<wchar_t>其中sizeof(wchar_t) == 2;
  • UTF-16 ? UTF-32 codecvt_utf16<char32_t>,和codecvt_utf16<wchar_t>其中sizeof(wchar_t) == 4;
  • UTF-8 ? UTF-16 codecvt_utf8_utf16<char16_t>codecvt<char16_t, char, mbstate_t>以及codecvt_utf8_utf16<wchar_t>其中sizeof(wchar_t) == 2;
  • 狭窄的 ?宽与codecvt<wchar_t, char_t, mbstate_t>
  • 无操作codecvt<char, char, mbstate_t>

Several of these are useful, but there is a lot of awkward stuff here.

其中一些很有用,但这里有很多尴尬的东西。

First off—holy high surrogate! that naming scheme is messy.

首先——神圣的代孕!那个命名方案很混乱。

Then, there's a lot of UCS-2 support. UCS-2 is an encoding from Unicode 1.0 that was superseded in 1996 because it only supports the basic multilingual plane. Why the committee thought desirable to focus on an encoding that was superseded over 20 years ago, I don't know‡. It's not like support for more encodings is bad or anything, but UCS-2 shows up too often here.

然后,有很多 UCS-2 支持。UCS-2 是 Unicode 1.0 的一种编码,在 1996 年被取代,因为它只支持基本的多语言平面。为什么委员会认为需要关注 20 多年前被取代的编码,我不知道‡。对更多编码的支持并不是坏事,而是 UCS-2 在这里经常出现。

I would say that char16_tis obviously meant for storing UTF-16 code units. However, this is one part of the standard that thinks otherwise. codecvt_utf8<char16_t>has nothing to do with UTF-16. For example, wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\U0001F34C")will compile fine, but will fail unconditionally: the input will be treated as the UCS-2 string u"\xD83C\xDF4C", which cannot be converted to UTF-8 because UTF-8 cannot encode any value in the range 0xD800-0xDFFF.

我会说这char16_t显然是为了存储 UTF-16 代码单元。然而,这是另一种看法的标准的一部分。codecvt_utf8<char16_t>与 UTF-16 无关。例如,wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\U0001F34C")可以正常编译,但会无条件地失败:输入将被视为 UCS-2 字符串u"\xD83C\xDF4C",无法将其转换为 UTF-8,因为 UTF-8 无法对 0xD800-0xDFFF 范围内的任何值进行编码。

Still on the UCS-2 front, there is no way to read from an UTF-16 byte stream into an UTF-16 string with these facets. If you have a sequence of UTF-16 bytes you can't deserialize it into a string of char16_t. This is surprising, because it is more or less an identity conversion. Even more suprising, though, is the fact that there is support for deserializing from an UTF-16 stream into an UCS-2 string with codecvt_utf16<char16_t>, which is actually a lossy conversion.

仍然在 UCS-2 方面,没有办法从 UTF-16 字节流读取到具有这些方面的 UTF-16 字符串。如果您有一系列 UTF-16 字节,则无法将其反序列化为char16_t. 这是令人惊讶的,因为它或多或少是身份转换。然而,更令人惊讶的是,支持从 UTF-16 流反序列化为带有 的 UCS-2 字符串codecvt_utf16<char16_t>,这实际上是一种有损转换。

The UTF-16-as-bytes support is quite good, though: it supports detecting endianess from a BOM, or selecting it explicitly in code. It also supports producing output with and without a BOM.

不过,UTF-16-as-bytes 支持非常好:它支持从 BOM 中检测字节顺序,或在代码中明确选择它。它还支持带和不带 BOM 的生产输出。

There are some more interesting conversion possibilities absent. There is no way to deserialize from an UTF-16 byte stream or string into a UTF-8 string, since UTF-8 is never supported as the deserialized form.

缺少一些更有趣的转换可能性。无法将 UTF-16 字节流或字符串反序列化为 UTF-8 字符串,因为 UTF-8 从未被支持作为反序列化形式。

And here the narrow/wide world is completely separate from the UTF/UCS world. There are no conversions between the old-style narrow/wide encodings and any Unicode encodings.

在这里,窄/宽世界与 UTF/UCS 世界完全分开。旧式窄/宽编码和任何 Unicode 编码之间没有转换。

Input/output library

输入/输出库

The I/O library can be used to read and write text in Unicode encodings using the wstring_convertand wbuffer_convertfacilities described above. I don't think there's much else that would need to be supported by this part of the standard library.

I/O 库可用于使用上述wstring_convertwbuffer_convert工具以 Unicode 编码读取和写入文本。我认为标准库的这一部分不需要支持太多其他内容。

Regular expressions library

正则表达式库

I have expounded upon problems with C++ regexes and Unicodeon Stack Overflow before. I will not repeat all those points here, but merely state that C++ regexes don't have level 1 Unicode support, which is the bare minimum to make them usable without resorting to using UTF-32 everywhere.

我之前已经在 Stack Overflow 上阐述过C++ 正则表达式和 Unicode 的问题。我不会在这里重复所有这些要点,而只是说明 C++ 正则表达式没有 1 级 Unicode 支持,这是使它们可用而无需在任何地方使用 UTF-32 的最低限度。

That's it?

就是这样?

Yes, that's it. That's the existing functionality. There's lots of Unicode functionality that is nowhere to be seen like normalization or text segmentation algorithms.

对,就是那样。这就是现有的功能。有许多 Unicode 功能与规范化或文本分割算法一样无处可寻。

U+1F4A9. Is there any way to get some better Unicode support in C++?

U+1F4A9。有没有办法在 C++ 中获得更好的 Unicode 支持?

The usual suspects: ICUand Boost.Locale.

通常的嫌疑人:ICUBoost.Locale



† A byte string is, unsurprisingly, a string of bytes, i.e., charobjects. However, unlike a wide string literal, which is always an array of wchar_tobjects, a "wide string" in this context is not necessarily a string of wchar_tobjects. In fact, the standard never explicitly defines what a "wide string" means, so we're left to guess the meaning from usage. Since the standard terminology is sloppy and confusing, I use my own, in the name of clarity.

† 毫无疑问,字节串是一串字节,即char对象。但是,与始终是对象数组的宽字符串文字不同wchar_t,此上下文中的“宽字符串”不一定是wchar_t对象字符串。事实上,标准从未明确定义“宽字符串”的含义,因此我们只能从用法中猜测其含义。由于标准术语是草率和混乱的,为了清楚起见,我使用我自己的术语。

Encodings like UTF-16 can be stored as sequences of char16_t, which then have no endianness; or they can be stored as sequences of bytes, which have endianness (each consecutive pair of bytes can represent a different char16_tvalue depending on endianness). The standard supports both of these forms. A sequence of char16_tis more useful for internal manipulation in the program. A sequence of bytes is the way to exchange such strings with the external world. The terms I'll use instead of "byte" and "wide" are thus "serialized" and "deserialized".

像 UTF-16 这样的编码可以存储为 的序列char16_t,然后没有字节序;或者它们可以存储为字节序列,它们具有字节序(每个连续的字节对可以表示不同的char16_t值,具体取决于字节序)。该标准支持这两种形式。序列char16_t对于程序中的内部操作更有用。字节序列是与外部世界交换此类字符串的方式。我将使用的术语而不是“字节”和“宽”因此是“序列化”和“反序列化”。

‡ If you are about to say "but Windows!" hold your . All versions of Windows since Windows 2000 use UTF-16.

‡ 如果您要说“但是 Windows!” 拿着你的. 自 Windows 2000 以来的所有 Windows 版本都使用 UTF-16。

? Yes, I know about the gro?es Eszett(?), but even if you were to change all German locales overnight to have ? uppercase to ?, there's still plenty of other cases where this would fail. Try uppercasing U+FB00 ????? s???? ???????? ??. There is no ????? ??????? ???????? ??; it just uppercases to two Fs. Or U+01F0 ????? s???? ?????? ? ???? ?????; there's no precomposed capital; it just uppercases to a capital J and a combining caron.

? 是的,我知道gro?es Eszett(?),但即使您要在一夜之间更改所有德国语言环境以拥有 ? 大写到 ?,还有很多其他情况会失败。尝试大写 U+FB00 ????? ???? ????????? ??. 没有吗??????????? ????????? ??; 它只是大写为两个 F。或者 U+01F0 ????? ???? ?????? ? ????????; 没有预先构成的资本;它只是大写字母 J 和组合卡隆。

回答by Matthieu M.

Unicode is not supported by Standard Library(for any reasonable meaning of supported).

标准库不支持 Unicode (对于支持的任何合理含义)。

std::stringis no better than std::vector<char>: it is completely oblivious to Unicode (or any other representation/encoding) and simply treat its content as a blobof bytes.

std::string没有更好的比std::vector<char>:它是完全无视的Unicode(或任何其它表示/编码)和简单地对待它作为内容的blob的字节。

If you only need to store and catenate blobs, it works pretty well; but as soon as you wish for Unicode functionality (number of code points, number of graphemesetc) you are out of luck.

如果你只需要存储和连接blob,它工作得很好;但只要你想对Unicode的功能(数码点,数量字形等),你是幸运的了。

The only comprehensive library I know of for this is ICU. The C++ interface was derived from the Java one though, so it's far from being idiomatic.

我所知道的唯一一个综合性图书馆是ICU。尽管 C++ 接口是从 Java 接口派生的,因此它远非惯用的。

回答by uckelman

You can safely store UTF-8 in a std::string(or in a char[]or char*, for that matter), due to the fact that a Unicode NUL (U+0000) is a null byte in UTF-8 and that this is the sole way a null byte can occur in UTF-8. Hence, your UTF-8 strings will be properly terminated according to all of the C and C++ string functions, and you can sling them around with C++ iostreams (including std::coutand std::cerr, so long as your locale is UTF-8).

您可以UTF-8在安全存储std::string(或者在一个char[]char*,对于这个问题),由于是一个Unicode NUL(U + 0000)是UTF-8空字节,这是唯一的办法空字节可以出现在 UTF-8 中。因此,您的 UTF-8 字符串将根据所有 C 和 C++ 字符串函数正确终止,并且您可以使用 C++ iostreams(包括std::coutstd::cerr,只要您的语言环境为 UTF-8)。

What you cannot do with std::stringfor UTF-8 is get length in code points. std::string::size()will tell you the string length in bytes, which is only equal to the number of code points when you're within the ASCII subset of UTF-8.

std::string对于 UTF-8,你不能做的是获取代码点的长度。std::string::size()会告诉您以字节为单位的字符串长度,当您在 UTF-8 的 ASCII 子集中时,它仅等于代码点的数量。

If you need to operate on UTF-8 strings at the code pointlevel (i.e. not just store and print them) or if you're dealing with UTF-16, which is likely to have many internal null bytes, you need to look into the wide character string types.

如果您需要在代码点级别对 UTF-8 字符串进行操作(即不仅仅是存储和打印它们),或者如果您正在处理 UTF-16,它可能有许多内部空字节,您需要查看宽字符串类型。

回答by Some programmer dude

C++11 has a couple of new literal string typesfor Unicode.

C++11 有几个的 Unicode文字字符串类型

Unfortunately the support in the standard library for non-uniform encodings (like UTF-8) is still bad. For example there is no nice way to get the length (in code-points) of an UTF-8 string.

不幸的是,标准库中对非统一编码(如 UTF-8)的支持仍然很糟糕。例如,没有很好的方法来获取 UTF-8 字符串的长度(以代码点为单位)。

回答by Jakob Riedle

However, there is a pretty useful library called tiny-utf8, which is basically a drop-in replacementfor std::string/std::wstring. It aims to fill the gap of the still missing utf8-string container class.

然而,有一个叫非常有用的库微小-UTF8,这基本上是一个简易替换std::string/ std::wstring。它旨在填补仍然缺失的 utf8-string 容器类的空白。

This might be the most comfortable way of 'dealing' with utf8 strings (that is, without unicode normalization and similar stuff). You comfortably operate on codepoints, while your string stays encoded in run-length-encoded chars.

这可能是“处理” utf8 字符串(即没有 unicode 规范化和类似的东西)的最舒适的方式。您可以轻松地对codepoints进行操作,而您的字符串则保持在 run-length-encoded chars 中编码。