如何在 C++ 中使用 utf8 字符数组?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6072342/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 19:26:26  来源:igfitidea点击:

How to use utf8 character arrays in c++?

c++utf-8

提问by sekmet64

Is it possible to have char *s to work with utf8 encoding in C++ (VC2010)?

是否可以让char *s 在 C++ (VC2010) 中使用 utf8 编码?

For example if my source file is saved in utf8 and I write something like this:

例如,如果我的源文件保存在 utf8 中,并且我编写如下内容:

const char* c = "a?áé??";

Is this possible to make it utf-8 encoded? And if yes, how is it possible to use

这有可能使它编码为 utf-8 吗?如果是,如何使用

char* c2 = new char[strlen("a?áé??")];

for dynamic allocation if characters can be variable length?

用于动态分配,如果字符可以是可变长度?

采纳答案by James Kanze

The encoding for narrow character string literals is implementation defined, so you'd really have to read the documentation (if you can find it). A quick experiment shows that both VC++ (VC8, anyway) and g++ (4.4.2, anyway) actually just copy the bytes from the source file; the string literal will be in whatever encoding your editor saved it in. (This is clearly in violation of the standard, but it seems to be common practice.)

窄字符串文字的编码是实现定义的,所以你真的必须阅读文档(如果你能找到的话)。一个快速的实验表明 VC++(无论如何是 VC8)和 g++(无论如何是 4.4.2)实际上只是从源文件中复制字节;字符串文字将采用您的编辑器保存它的任何编码。(这显然违反了标准,但这似乎是常见的做法。)

C++11 has UTF-8 string literals, which would allow you to write u8"text", and be ensured that "text"was encoded in UTF-8. But I don't really expect it to work reliably: the problem is that in order to do this, the compiler has to know what encoding your source file has. In all probability, compiler writers will continue to ignore the issue, just copying the bytes from the source file, and achieve conformance simply be documenting that the source file must be in UTF-8 for these features to work.

C++11 具有 UTF-8 字符串文字,这将允许您编写u8"text",并确保"text"以 UTF-8 编码。但我并不真正期望它可靠地工作:问题是为了做到这一点,编译器必须知道您的源文件具有什么编码。很可能,编译器编写者将继续忽略这个问题,只是从源文件中复制字节,并通过记录源文件必须是 UTF-8 格式才能使这些功能正常工作来实现一致性。

回答by Klaim

If the text you want to put in the string is in your source code, make sure your source code file is in UTF-8.

如果您想放入字符串中的文本在您的源代码中,请确保您的源代码文件是 UTF-8。

If that don't work, try maybe using \u1234with 1234 being a code point value.

如果这不起作用,请尝试使用\u12341234 作为代码点值。

You can also try to use UTF8-CPPmaybe.

您也可以尝试使用UTF8-CPP

Take a look at this answer : Using Unicode in C++ source code

看看这个答案:Using Unicode in C++ source code

回答by vladasimovic

It is possible, save the file in UTF-8 without BOMsignature encoding.

可以将文件保存为 UTF-8而不使用 BOM签名编码。

//Save As UTF8 without BOM signature
#include<stdio.h>
#include<windows.h>
int main(){
    SetConsoleOutputCP(65001);
    char *c1 = "a?áé??";
    char *c2 = new char[strlen("a?áé??")];
    strcpy(c2,c1);
    printf("%s\n",c1);
    printf("%s\n",c2);
}

Result:

结果:

 D:\Debug>program
a?áé??
a?áé??

The result of redirection program is really UTF8 encoded file.
UTF8 file
This is compiler - independent answer (compile on Windows).
(A similar question.)

重定向程序的结果是真正的UTF8编码文件。
UTF8 文件
这是编译器独立的答案(在 Windows 上编译)。
(一个类似的问题。)

回答by yasouser

See this MSDN article which talks about converting between string types (that should give you examples on how to use them). The strings types that are covered include char *, wchar_t*, _bstr_t, CComBSTR, CString, basic_string, and System.String:

请参阅此 MSDN 文章,其中讨论了字符串类型之间的转换(应该为您提供有关如何使用它们的示例)。涵盖的字符串类型包括 char *、wchar_t*、_bstr_t、CComBSTR、CString、basic_string 和 System.String:

How to: Convert Between Various String Types

如何:在各种字符串类型之间转换

回答by Zoner

There is a hotfix for VisualStudio 2010 SP1 which can help: http://support.microsoft.com/kb/980263.

VisualStudio 2010 SP1 的修补程序可以提供帮助:http: //support.microsoft.com/kb/980263

The hotfix adds a pragma to override visual studio's control the character encoding for the char type:

修补程序添加了一个编译指示来覆盖 Visual Studio 对 char 类型的字符编码的控制:

#pragma execution_character_set("utf-8")

Without the pragma, char* based literals are typically interpreted as the default code page (typically 1252)

如果没有 pragma,基于 char* 的文字通常被解释为默认代码页(通常为 1252)

This should all be superseded eventually by new string literal prefix modifiers specified by C++0x (u8, u, and U for utf-8, utf-16, and utf-32 respectively), which ideally will be supprted in the next major version of Visual Studio after 2010.

这最终都应该被 C++0x 指定的新字符串字面量前缀修饰符(分别为 utf-8、utf-16 和 utf-32 的 u8、u 和 U)取代,理想情况下将在下一个主要版本中支持2010 年之后的 Visual Studio 版本。