Linux 在 C++ 中处理 UTF-8
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8513249/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Handling UTF-8 in C++
提问by Lanbo
To find out if C++ is the right language for a project of mine, I wanna test the UTF-8 capabilities. According to references, I built this example:
为了确定 C++ 是否适合我的项目,我想测试 UTF-8 功能。根据参考资料,我构建了这个例子:
#include <string>
#include <iostream>
using namespace std;
int main() {
wstring str;
while(getline(wcin, str)) {
wcout << str << endl;
if(str.empty()) break;
}
return 0;
}
But when I type in an UTF-8 character, it misbehaves:
但是当我输入一个 UTF-8 字符时,它的行为异常:
$ > ./utf8
Hello
Hello
für
f
$ >
Not only it doesn't print the ü
, but also quits immediately. gdb
told me there was no crash, but a normal exit, yet I find that hard to believe.
它不仅不打印ü
,而且会立即退出。gdb
告诉我没有崩溃,而是正常退出,但我觉得很难相信。
采纳答案by robert petranovic
Don't use wstring on Linux.
不要在 Linux 上使用 wstring。
Take a look at first answer. I'm sure it answers your question.
看看第一个答案。我相信它回答了你的问题。
- When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
- 什么时候我应该使用 std::wstring 而不是 std::string?
在 Linux 上?几乎从不 (§)。
在 Windows 上?几乎总是 (§)。
回答by vitakot
The language itself has nothing to do with unicode or any other character coding. It is tied to operating system. Windows uses UTF16 for unicode support which implies using wide chars (16-bit wide chars) - wchar_t or std:wstring. Each Win Api function operating with strings requires wide char input.
该语言本身与 unicode 或任何其他字符编码无关。它与操作系统有关。Windows 使用 UTF16 来支持 unicode,这意味着使用宽字符(16 位宽字符)- wchar_t 或 std:wstring。每个使用字符串操作的 Win Api 函数都需要宽字符输入。
But unix-based systems i.e. Mac OS X or Linux use UTF8. Of course - it is only a matter of how you handle bytes in the array, so you can have UTF16 string stored in common C array or std:string container. This is why you do not see any wstrings in cross-platform code; instead all strings are handled as UTF8 and re-encoded when necessary to UTF16 (on windows).
但是基于 Unix 的系统,即 Mac OS X 或 Linux 使用 UTF8。当然 - 这只是您如何处理数组中的字节的问题,因此您可以将 UTF16 字符串存储在公共 C 数组或 std:string 容器中。这就是为什么您在跨平台代码中看不到任何 wstrings 的原因;相反,所有字符串都作为 UTF8 处理,并在必要时重新编码为 UTF16(在 Windows 上)。
You have more options how to handle this a bit confusing stuff. I personally do it as mentioned above - by strictly using UTF8 coding in all the application, re-encoding strings when interacting with Windows Api and directly using them on Mac OS X. For the win re-encoding I use great conversion helpers:
你有更多的选择如何处理这个有点混乱的东西。我个人按照上面提到的那样做 - 通过在所有应用程序中严格使用 UTF8 编码,在与 Windows Api 交互时重新编码字符串并直接在 Mac OS X 上使用它们。对于 win 重新编码,我使用了很好的转换助手:
C++ UTF-8 Conversion Helpers(on MSDN, available under the Apache License, Version 2.0).
C++ UTF-8 转换助手(在 MSDN 上,在 Apache 许可证下可用,版本 2.0)。
You can also use cross-platform Qt String which defines conversion functions from UTF8 to/from UTF16 and other codings (ANSI, Latin...).
您还可以使用跨平台的 Qt 字符串,它定义了从 UTF8 到/从 UTF16 和其他编码(ANSI、拉丁语...)的转换函数。
So the answer above - on unix use always UTF8 (std::string, char), on Windows UTF16 (std::wstring, wchar_t) is true.
所以上面的答案 - 在 unix 上总是使用 UTF8 (std::string, char),在 Windows UTF16 (std::wstring, wchar_t) 上是正确的。
回答by nick
Remember that on startup of the main program, the "C" locale is selected as default. You probably don't want this if you handle utf-8.
Calling setlocale(LC_CTYPE, "")
turns off this default, and you get whatever is defined in the environment (presumably a utf-8 locale).
请记住,在主程序启动时,默认选择“C”语言环境。如果您处理 utf-8,您可能不想要这个。调用setlocale(LC_CTYPE, "")
会关闭此默认设置,您将获得环境中定义的任何内容(大概是 utf-8 语言环境)。