如何在 C++ 中使用 Unicode?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3010739/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use Unicode in C++?
提问by Dox
Assuming a very simple program that:
假设一个非常简单的程序:
- ask a name.
- store the name in a variable.
- display the variable content on the screen.
- 问一个名字。
- 将名称存储在变量中。
- 在屏幕上显示变量内容。
It's so simple that is the first thing that one learns.
它是如此简单,这是人们学习的第一件事。
But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.
但我的问题是,如果我使用日语字符输入名称,我不知道如何做同样的事情。
So, if you know how to do this in C++, please show me an example (that I can compile and test)
所以,如果你知道如何在 C++ 中做到这一点,请给我看一个例子(我可以编译和测试)
Thanks.
谢谢。
user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.
user362981:感谢您的帮助。我编译了你写的代码没有问题,控制台窗口出现,我不能在上面输入任何日语字符(使用 IME)。此外,如果我将代码中的一个单词(“hello”)更改为包含日语字符的单词,它也不会显示这些。
Svisstack : Also thanks for your help. But when I compile your code I get the following error:
Svisstack:也感谢您的帮助。但是当我编译你的代码时,我收到以下错误:
warning: deprecated conversion from string constant to 'wchar_t*'
error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)'
error: at this point in file
warning: deprecated conversion from string constant to 'wchar_t*'
回答by Thanatos
You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t
do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char
. wchar_t
is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:
你会得到很多关于宽字符的答案。宽字符,特别wchar_t
是不等于 Unicode。您可以使用它们(有一些陷阱)来存储 Unicode,就像您可以使用unsigned char
. wchar_t
非常依赖系统。引用Unicode 标准 5.2 版第 5 章:
With the
wchar_t
wide character type, ANSI/ISO C provides for inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.
对于
wchar_t
宽字符类型,ANSI/ISO C 允许包含固定宽度的宽字符。ANSI/ISO C 将宽字符集的语义留给特定的实现,但要求可移植 C 执行集中的字符通过零扩展对应于它们的宽字符等效项。
and that
然后
The width of
wchar_t
is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not usewchar_t
for storing Unicode text. Thewchar_t
type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.
的宽度
wchar_t
是特定于编译器的,可以小到 8 位。因此,需要跨任何 C 或 C++ 编译器移植的程序不wchar_t
应用于存储 Unicode 文本。该wchar_t
类型用于存储编译器定义的宽字符,在某些编译器中可能是 Unicode 字符。
So, it's implementation defined. Here's two implementations: On Linux, wchar_t
is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t
, and represents UTF-16 code units with them. Completely different.
所以,它是实现定义的。这里有两个实现: 在 Linux 上,wchar_t
是 4 个字节宽,并以 UTF-32 编码表示文本(无论当前语言环境如何)。(BE 或 LE 取决于您的系统,以本机为准。)然而,Windows 有一个 2 字节宽的wchar_t
,并用它们表示 UTF-16 代码单元。完全不同。
A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setupto use UTF-8 (Unicode), the following program will use Unicode:
更好的方法:了解语言环境,因为您需要了解这一点。例如,由于我的环境设置为使用 UTF-8 (Unicode),因此以下程序将使用 Unicode:
#include <iostream>
int main()
{
setlocale(LC_ALL, "");
std::cout << "What's your name? ";
std::string name;
std::getline(std::cin, name);
std::cout << "Hello there, " << name << "." << std::endl;
return 0;
}
...
...
$ ./uni_test
What's your name? 佐藤 幹夫
Hello there, 佐藤 幹夫.
$ echo $LANG
en_US.UTF-8
But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.
但是没有关于它的Unicode。它仅读取字符,这些字符以 UTF-8 形式出现,因为我的环境是这样设置的。我可以很容易地说“哎呀,我是捷克人,让我们使用 ISO-8859-2”:突然,该程序正在输入 ISO-8859-2,但由于它只是反刍,所以没关系,程序仍将正确执行。
Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?>
at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)
现在,如果那个例子读了我的名字,然后试图把它写到一个 XML 文件中,并愚蠢地写<?xml version="1.0" encoding="UTF-8" ?>
在顶部,当我的终端是 UTF-8 时是正确的,但当我的终端是错误的ISO-8859-2。在后一种情况下,它需要在将其序列化为 XML 文件之前对其进行转换。(或者,只需编写 ISO-8859-2 作为 XML 文件的编码。)
On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout
will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8
) to my terminal, I'll see a "?" (0xE8
). Likewise, if you output a UTF-8 "è" (0xC3 0xA8
), I'll see (ISO-8859-2) "?¨" (0xC3 0xA8
). This barfing of incorrect characters has been called Mojibake.
在许多 POSIX 系统上,当前的语言环境通常是 UTF-8,因为它为用户提供了几个优势,但这并不能保证。仅stdout
将UTF-8 输出到通常是正确的,但并非总是如此。假设我使用的是 ISO-8859-2:如果您无意识地将 ISO-8859-1 "è" ( 0xE8
) 输出到我的终端,我会看到一个 "?" ( 0xE8
). 同样,如果您输出 UTF-8 "è" ( 0xC3 0xA8
),我会看到 (ISO-8859-2) "?¨" ( 0xC3 0xA8
)。这种不正确字符的呕吐被称为Mojibake。
Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)
通常,您只是在打乱数据,这无关紧要。这通常在您需要序列化数据时发挥作用。(许多 Internet 协议使用 UTF-8 或 UTF-16,例如:如果您从 ISO-8859-2 终端获取数据,或以 Windows-1252 编码的文本文件,则必须对其进行转换,否则您将正在发送Mojibake。)
Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.
可悲的是,这与 C 和 C++ 中的 Unicode 支持状态有关。您必须记住:这些语言实际上与系统无关,并且不绑定到任何特定的执行方式。这包括字符集。然而,有大量的库用于处理 Unicode 和其他字符集。
In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout
or std::wcout
. In my examples, stdin
or std::cin
and stdout
/std::cout
were sometimes in UTF-8, sometimes ISO-8859-2.
最后,其实并没有那么复杂:知道你的数据采用什么编码,知道你的输出应该采用什么编码。如果它们不一样,你需要做一个转换。无论您使用的是std::cout
还是std::wcout
. 在我的例子中,stdin
orstd::cin
和stdout
/std::cout
有时使用 UTF-8,有时使用 ISO-8859-2。
回答by EvanED
Try replacing cout with wcout, cin with wcin, and string with wstring. Depending on your platform, this may work:
尝试用 wcout 替换 cout,用 wcin 替换 cin,用 wstring 替换 string。根据您的平台,这可能有效:
#include <iostream>
#include <string>
int main() {
std::wstring name;
std::wcout << L"Enter your name: ";
std::wcin >> name;
std::wcout << L"Hello, " << name << std::endl;
}
There are other ways, but this is sort of the "minimal change" answer.
还有其他方法,但这是“最小变化”的答案。
回答by Svisstack
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t name[256];
wprintf(L"Type a name: ");
wscanf(L"%s", name);
wprintf(L"Typed name is: %s\n", name);
return 0;
}
回答by Nick Bastin
You can do simple things with the generic wide character support in your OS of choice, but generally C++ doesn't have good built-in support for unicode, so you'll be better off in the long run looking into something like ICU.
您可以在您选择的操作系统中使用通用宽字符支持做一些简单的事情,但通常 C++ 没有对 unicode 的良好内置支持,因此从长远来看,您会更好地研究ICU 之类的东西。
回答by zar
Pre-requisite: http://www.joelonsoftware.com/articles/Unicode.html
先决条件:http: //www.joelonsoftware.com/articles/Unicode.html
The above article is a must read which explains what unicode is but few lingering questions remains. Yes UNICODE has a unique code point for every character in every language and furthermore they can be encoded and stored in memory potentially differently from what the actual code is. This way we can save memory by for example using UTF-8 encoding which is great if the language supported is just English and so the memory representation is essentially same as ASCII – this of course knowing the encoding itself. In theory if we know the encoding, we can store these longer UNICODE characters however we like and read it back. But real world is a little different.
上面的文章是必读的,它解释了 unicode 是什么,但仍然存在一些挥之不去的问题。是的,UNICODE 对每种语言中的每个字符都有一个唯一的代码点,此外,它们的编码和存储可能与实际代码不同。通过这种方式,我们可以通过例如使用 UTF-8 编码来节省内存,如果支持的语言只是英语,那么内存表示基本上与 ASCII 相同——这当然知道编码本身。理论上,如果我们知道编码,我们可以存储这些更长的 UNICODE 字符,但是我们喜欢并读取它。但现实世界有点不同。
How do you store a UNICODE character/string in a C++ program? Which encoding do you use? The answer is you don't use any encoding but you directly store the UNICODE code points in a unicode character string just like you store ASCII characters in ASCII string. The question is what character size should you use since UNICODE characters has no fixed size. The simple answer is you choose character size which is wide enough to hold the highest character code point (language) that you want to support.
如何在 C++ 程序中存储 UNICODE 字符/字符串?您使用哪种编码?答案是您不使用任何编码,而是直接将 UNICODE 代码点存储在 unicode 字符串中,就像将 ASCII 字符存储在 ASCII 字符串中一样。问题是您应该使用什么字符大小,因为 UNICODE 字符没有固定大小。简单的答案是您选择的字符大小足够宽以容纳您想要支持的最高字符代码点(语言)。
The theory that a UNICODE character can take 2 bytes or more still holds true and this can create some confusion. Shouldn't we be storing code points in 3 or 4 bytes than which is really what represents all unicode characters? Why is Visual C++ storing unicode in wchar_t then which is only 2 bytes, clearly not enough to store every UNICODE code point?
UNICODE 字符可以占用 2 个字节或更多字节的理论仍然适用,这可能会造成一些混淆。我们不应该将代码点存储在 3 或 4 个字节中,而不是真正代表所有 unicode 字符的内容吗?为什么 Visual C++ 将 unicode 存储在 wchar_t 中,那么它只有 2 个字节,显然不足以存储每个 UNICODE 代码点?
The reason we store UNICODE character code point in 2 bytes in Visual C++ is actually exactly the same reason why we were storing ASCII (=English) character into one byte. At that time, we were thinking of only English so one byte was enough. Now we are thinking of most international languages out there but not all so we are using 2 bytes which is enough. Yes it's true this representation will not allow us to represent those code points which takes 3 bytes or more but we don't care about those yet because those folks haven't even bought a computer yet. Yes we are not using 3 or 4 bytes because we are still stingy with memory, why store the extra 0(zero) byte with every character when we are never going to use it (that language). Again this is exactly the same reasons why ASCII was storing each character in one byte, why store a character in 2 or more bytes when English can be represented in one byte and room to spare for those extra special characters!
我们在 Visual C++ 中将 UNICODE 字符代码点存储在 2 个字节中的原因实际上与我们将 ASCII(=英文)字符存储到一个字节中的原因完全相同。当时我们只想到英文,一个字节就够了。现在我们正在考虑大多数国际语言,但不是全部,所以我们使用 2 个字节就足够了。是的,这种表示确实不允许我们表示那些占用 3 个字节或更多字节的代码点,但我们还不关心这些,因为那些人甚至还没有购买计算机。是的,我们没有使用 3 或 4 个字节,因为我们仍然对内存很吝啬,为什么在我们永远不会使用它(那种语言)时为每个字符存储额外的 0(零)字节。同样,这与 ASCII 将每个字符存储在一个字节中的原因完全相同,
In theory 2 bytes are not enough to present every Unicode code point but it is enough to hold anything that we may ever care about for now. A true UNICODE string representation could store each character in 4 bytes but we just don't care about those languages.
理论上,2 个字节不足以表示每个 Unicode 代码点,但足以容纳我们现在可能关心的任何内容。真正的 UNICODE 字符串表示可以将每个字符存储在 4 个字节中,但我们并不关心这些语言。
Imagine 1000 years from now when we find friendly aliens and in abundance and want to communicate with them incorporating their countless languages. A single unicode character size will grow further perhaps to 8 bytes to accommodate all their code points. It doesn't mean we should start using 8 bytes for each unicode character now. Memory is limited resource, we allocate what what we need.
想象一下 1000 年后,当我们发现友好的外星人并且数量众多,并希望与他们使用无数语言进行交流时。单个 unicode 字符大小可能会进一步增长到 8 个字节,以容纳它们的所有代码点。这并不意味着我们现在应该开始为每个 unicode 字符使用 8 个字节。内存是有限的资源,我们分配我们需要的。
Can I handle UNICODE string as C Style string?
我可以将 UNICODE 字符串作为 C 样式字符串处理吗?
In C++ an ASCII strings could still be handled in C++ and that's fairly common by grabbing it by its char * pointer where C functions can be applied. However applying current C style string functions on a UNICODE string will not make any sense because it could have a single NULL bytes in it which terminates a C string.
在 C++ 中,仍然可以在 C++ 中处理 ASCII 字符串,通过可以应用 C 函数的 char * 指针抓取它是相当常见的。但是,在 UNICODE 字符串上应用当前的 C 样式字符串函数没有任何意义,因为它可能有一个 NULL 字节来终止 C 字符串。
A UNICODE string is no longer a plain buffer of text, well it is but now more complicated than a stream of single byte characters terminating with a NULL byte. This buffer could be handled by its pointer even in C but it will require a UNICODE compatible calls or a C library which could than read and write those strings and perform operations.
UNICODE 字符串不再是纯文本缓冲区,但现在它比以 NULL 字节结尾的单字节字符流更复杂。即使在 C 中,这个缓冲区也可以由它的指针处理,但它需要一个 UNICODE 兼容调用或一个 C 库,它可以读取和写入这些字符串并执行操作。
This is made easier in C++ with a specialized class that represents a UNICODE string. This class handles complexity of the unicode string buffer and provide an easy interface. This class also decides if each character of the unicode string is 2 bytes or more – these are implementation details. Today it may use wchar_t (2 bytes) but tomorrow it may use 4 bytes for each character to support more (less known) language. This is why it is always better to use TCHAR than a fixed size which maps to the right size when implementation changes.
这在 C++ 中使用表示 UNICODE 字符串的专用类变得更容易。此类处理 unicode 字符串缓冲区的复杂性并提供简单的接口。此类还决定 unicode 字符串的每个字符是否为 2 个字节或更多——这些是实现细节。今天它可能使用 wchar_t(2 个字节),但明天它可能为每个字符使用 4 个字节以支持更多(鲜为人知)的语言。这就是为什么使用 TCHAR 总是比固定大小更好的原因,固定大小在实现更改时映射到正确的大小。
How do I index a UNICODE string?
如何索引 UNICODE 字符串?
It is also worth noting and particularly in C style handling of strings that they use index to traverse or find sub string in a string. This index in ASCII string directly corresponded to the position of item in that string but it has no meaning in a UNICODE string and should be avoided.
还值得注意的是,特别是在 C 风格的字符串处理中,他们使用索引来遍历或查找字符串中的子字符串。ASCII 字符串中的这个索引直接对应于该字符串中 item 的位置,但它在 UNICODE 字符串中没有意义,应该避免。
What happens to the string terminating NULL byte?
终止 NULL 字节的字符串会发生什么?
Are UNICODE strings still terminated by NULL byte? Is a single NULL byte enough to terminate the string? This is an implementation question but a NULL byte is still one unicode code point and like every other code point, it must still be of same size as any other(specially when no encoding). So the NULL character must be two bytes as well if unicode string implementation is based on wchar_t. All UNICODE code points will be represented by same size irrespective if its a null byte or any other.
UNICODE 字符串是否仍以 NULL 字节结尾?单个 NULL 字节是否足以终止字符串?这是一个实现问题,但 NULL 字节仍然是一个 unicode 代码点,并且与其他所有代码点一样,它的大小仍然必须与其他任何代码点相同(特别是在没有编码时)。因此,如果 unicode 字符串实现基于 wchar_t,则 NULL 字符也必须是两个字节。所有 UNICODE 代码点都将用相同的大小表示,无论它是空字节还是任何其他字节。
Does Visual C++ Debugger shows UNICODE text?
Visual C++ 调试器是否显示 UNICODE 文本?
Yes, if text buffer is type LPWSTR or any other type that supports UNICODE, Visual Studio 2005 and up support displaying the international text in debugger watch window (provided fonts and language packs are installed of course).
是的,如果文本缓冲区是 LPWSTR 类型或任何其他支持 UNICODE 的类型,Visual Studio 2005 和更高版本支持在调试器监视窗口中显示国际文本(当然,前提是安装了字体和语言包)。
Summary:
概括:
C++ doesn't use any encoding to store unicode characters but it directly stores the UNICODE code points for each character in a string. It must pick character size large enough to hold the largest character of desirable languages (loosely speaking) and that character size will be fixed and used for all characters in the string.
C++ 不使用任何编码来存储 unicode 字符,而是直接存储字符串中每个字符的 UNICODE 代码点。它必须选择足够大的字符大小以容纳所需语言的最大字符(粗略地说),并且字符大小将固定并用于字符串中的所有字符。
Right now, 2 bytes are sufficient to represent most languages that we care about, this is why 2 bytes are used to represent code point. In future if a new friendly space colony was discovered that want to communicate with them, we will have to assign new unicode code pionts to their language and use larger character size to store those strings.
目前,2 个字节足以表示我们关心的大多数语言,这就是为什么使用 2 个字节来表示代码点。将来如果发现一个新的友好太空殖民地想要与他们通信,我们将不得不为他们的语言分配新的 unicode 代码点,并使用更大的字符大小来存储这些字符串。