C++ 中的字符串和字符编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3950588/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 14:07:23  来源:igfitidea点击:

Strings and character encoding in C++

c++stringunicodeutf-8character-encoding

提问by nassar

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like:

我阅读了一些关于 C++ 中字符串和字符编码的最佳实践的文章,但我在寻找一种在我看来相当简单和正确的通用方法方面有点挣扎。我可以就以下问题征求意见吗?我倾向于使用 UTF-8 和 UTF-32,并定义如下内容:

typedef std::string string8;
typedef std::basic_string<uint32_t> string32;

The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8.

string8 类将用于 UTF-8,具有单独的类型只是编码的提示。另一种方法是让 string8 成为 std::string 的子类,并删除不太适合 UTF-8 的方法。

The string32 class would be used for UTF-32 when a fixed character size is desired.

当需要固定字符大小时,string32 类将用于 UTF-32。

The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two.

UTF-8 CPP 函数 utf8::utf8to32() 和 utf8::utf32to8() 甚至更简单的包装函数将用于在两者之间进行转换。

采纳答案by Matthieu M.

If you plan on just passing strings around and never inspect them, you can use plain std::stringthough it's a poor man job.

如果你打算只传递字符串而不检查它们,你可以使用普通的,std::string尽管这是一个穷人的工作。

The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.

问题是大多数框架,甚至是标准,都愚蠢地(我认为)在内存中强制编码。我说愚蠢是因为编码应该只在接口上重要,而那些编码不适用于数据的内存操作。

Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.

此外,编码很容易(这是一个简单的代码点 -> 字节和相反的转换),而主要的困难实际上是关于操纵数据。

With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::stringnor std::wstringare aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.

对于 8 位或 16 位,您可能会在中间切掉一个字符,因为既不知道std::string也不std::wstring知道什么是 Unicode 字符。更糟糕的是,即使使用 32 位编码,也存在将字符与适用于它的变音符号分开的风险,这也是愚蠢的。

The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.

因此,就标准而言,C++ 中对 Unicode 的支持非常低。

If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICUlibrary, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.

如果您真的希望操作 Unicode 字符串,则需要一个支持 Unicode 的容器。通常的方法是使用ICU库,尽管它的界面是真正的 C 语言。但是,您将获得使用多种语言在 Unicode 中实际工作所需的一切。

回答by skimobear

The traits approach described heremight be helpful. It's an old but useful technique.

这里描述的特征方法可能会有所帮助。这是一种古老但有用的技术。

回答by cytrinox

It's not specified what character encoding must be used for string, wstring etc. The common way is to use unicode in wide strings. What types and encodings should be used depends on your requirements.

没有规定字符串、wstring 等必须使用什么字符编码。常用的方法是在宽字符串中使用 unicode。应该使用什么类型和编码取决于您的要求。

If you only need to pass data from A to B, choose std::string with UTF-8 encoding (don't introduce a new type, just use std::string). If you must work with strings (extract, concat, sort, ...) choose std::wstring and as encoding UCS2/UTF-16 (BMP only) on Windows and UCS4/UTF-32 on Linux. The benefit is the fixed size: each character has a size of 2 (or 4 for UCS4) bytes while std::string with UTF-8 returns wrong length() results.

如果只需要将数据从 A 传递到 B,请选择 std::string UTF-8 编码(不要引入新类型,只需使用 std::string)。如果您必须使用字符串(提取、连接、排序等),请选择 std::wstring 并在 Windows 上作为编码 UCS2/UTF-16(仅限 BMP)和在 Linux 上使用 UCS4/UTF-32。好处是固定大小:每个字符的大小为 2 个(对于 UCS4 为 4 个)字节,而带有 UTF-8 的 std::string 返回错误的 length() 结果。

For conversion, you can check sizeof(std::wstring::value_type) == 2 or 4 to choose UCS2 or UCS4. I'm using the ICU library, but there may be simple wrapper libs.

对于转换,您可以检查 sizeof(std::wstring::value_type) == 2 或 4 以选择 UCS2 或 UCS4。我正在使用 ICU 库,但可能有简单的包装库。

Deriving from std::string is not recommended because basic_string is not designed for (lacks of virtual members etc..). If you really really really need your own type like std::basic_string< my_char_type > write a custom specialization for this.

不推荐从 std::string 派生,因为 basic_string 不是为(缺乏虚拟成员等)设计的。如果你真的真的真的需要你自己的类型,比如 std::basic_string< my_char_type > 为此写一个自定义的专业化。

The new C++0x standard defines wstring_convert<> and wbuffer_convert<> to convert with a std::codecvt from a narrow charset to a wide charset (for example UTF-8 to UCS2). Visual Studio 2010 has already implemented this, afaik.

新的 C++0x 标准定义 wstring_convert<> 和 wbuffer_convert<> 以使用 std::codecvt 从窄字符集转换为宽字符集(例如 UTF-8 到 UCS2)。Visual Studio 2010 已经实现了这一点,afaik。