Windows 上的 MBCS 和 UTF-8 之间的区别

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3298569/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 07:28:21  来源:igfitidea点击:

Difference between MBCS and UTF-8 on Windows

windowsunicodecharacter-encodingmbcs

提问by Naveen

I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ? What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ? Also, I found the following quote in MSDN:

我正在阅读有关 Windows 上的字符集和编码的信息。我注意到 Visual Studio 编译器(用于 C++)中有两个编译器标志,称为 MBCS 和 UNICODE。它们之间有什么区别?我没有得到的是 UTF-8 在概念上与 MBCS 编码有何不同?另外,我在MSDN 中找到了以下引用:

Unicode is a 16-bit character encoding

Unicode 是一种 16 位字符编码

This negates whatever I read about the Unicode. I thought unicode can be encoded with different encodings such as UTF-8 and UTF-16. Can somebody shed some more light on this confusion?

这否定了我读到的关于 Unicode 的任何内容。我认为 unicode 可以使用不同的编码进行编码,例如 UTF-8 和 UTF-16。有人可以对这种混淆有更多的了解吗?

回答by dan04

I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ?

我注意到 Visual Studio 编译器(用于 C++)中有两个编译器标志,称为 MBCS 和 UNICODE。它们之间有什么区别?

Many functions in the Windows API come in two versions: One that takes charparameters (in a locale-specific code page) and one that takes wchar_tparameters (in UTF-16).

Windows API 中的许多函数有两种版本:一种采用char参数(在特定于语言环境的代码页中)和一种采用wchar_t参数(采用 UTF-16)。

int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);

Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODEmacro is defined.

这些函数对中的每一个都有一个没有后缀的宏,这取决于是否UNICODE定义了宏。

#ifdef UNICODE
   #define MessageBox MessageBoxW
#else
   #define MessageBox MessageBoxA
#endif

In order to make this work, the TCHARtype is defined to abstract away the character type used by the API functions.

为了实现这一点,TCHAR定义了类型以抽象出 API 函数使用的字符类型。

#ifdef UNICODE
    typedef wchar_t TCHAR;
#else
    typedef char TCHAR;
#endif

This, however, was a bad idea. You should always explicitly specify the character type.

然而,这是一个坏主意。您应该始终明确指定字符类型。

What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ?

我没有得到的是 UTF-8 在概念上与 MBCS 编码有何不同?

MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.

MBCS 代表“多字节字符集”。从字面上看,UTF-8 似乎符合条件。

But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), but NOTUTF-8.

但在 Windows 中,“MBCS”仅指可与 Windows API 函数的“A”版本一起使用的字符编码。这包括代码页 932 (Shift_JIS)、936 (GBK)、949 (KS_C_5601-1987) 和 950 (Big5),但不包括UTF-8。

To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar, call the "W" version of the function, and call WideCharToMultiByteon the output. This is essentially what the "A" functions actually do, which makes me wonder why Windows doesn't just support UTF-8.

要使用 UTF-8,您必须使用 将字符串转换为 UTF-16 MultiByteToWideChar,调用函数的“W”版本,然后调用WideCharToMultiByte输出。这本质上是“A”函数实际所做的,这让我想知道为什么 Windows 不只支持 UTF-8

This inability to support the most common character encodingmakes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.

这种无法支持最常见的字符编码使得 Windows API 的“A”版本无用。因此,您应该始终使用“W”函数

Unicode is a 16-bit character encoding

This negates whatever I read about the Unicode.

Unicode 是一种 16 位字符编码

这否定了我读到的关于 Unicode 的任何内容。

MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)

MSDN 是错误的。Unicode 是一种 21 位编码字符集,具有多种编码,最常见的是 UTF-8、UTF-16 和 UTF-32。(还有其他 Unicode 编码,例如 GB18030、UTF-7 和 UTF-EBCDIC。)

Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 wasUnicode.

每当 Microsoft 提到“Unicode”时,它们的真正意思就是 UTF-16(或 UCS-2)。这是历史原因。Windows NT 是 Unicode 的早期采用者,当时人们认为 16 位就足够了,而 UTF-8 仅用于 Plan 9。所以 UCS-2Unicode。

回答by Jichao

_MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclento count the length of a string, the preprocessor would map _tcsclento different version according to the two macros: _MBCS and _UNICODE.

_MBCS 和 _UNICODE 是确定调用哪个版本的 TCHAR.H 例程的宏。例如,如果您使用_tcsclen计算字符串的长度,预处理器会_tcsclen根据两个宏:_MBCS 和_UNICODE映射到不同的版本。

_UNICODE & _MBCS Not Defined: strlen  
_MBCS Defined: _mbslen  
_UNICODE Defined: wcslen  

To explain the difference of these string length counting functions, consider following example.
If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.

要解释这些字符串长度计数函数的区别,请考虑以下示例。
如果您有一台运行 Windows 简体中文版并使用 GBK(936 代码页)的计算机,您编译一个 gbk 文件编码的源文件并运行它。

printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
printf("%d\n", strlen("I爱你M"));
printf("%d\n", wcslen((const wchar_t*)"I爱你M"));

The result would be 4 6 3.

结果是4 6 3

Here is the hexdecimal representation of I爱你Min GBK.

这是I爱你MGBK中的十六进制表示。

GBK:             49 B0 AE C4 E3 4D 00                

_mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4words: 49as I, B0 AEas , C4 E3as , 4Das M.

_mbslen 知道这个字符串是用 GBK 编码的,所以它可以正确地解释这个字符串并得到正确的结果4词:49as IB0 AEas C4 E3as 4Das M

strlen only knows 0x00, so it get 6.

strlen 只知道0x00,所以它得到6

wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3words: 49 B0, AE C4, E3 4D.

wcslen 认为这个十六进制数组是用 UTF16LE 编码的,它把两个字节算作一个词,所以它得到了3词:49 B0, AE C4, E3 4D

as @xiaokaoy pointed out, the only valid terminator for wcslenis 00 00. Thus the result is not guranteed to be 3if the following byte is not 00.

正如@xiaokaoy 所指出的,唯一有效的终止符wcslen00 00. 因此,3如果接下来的字节不是,则不保证结果为00

回答by stakx - no longer contributing

MBCSmeans Multi-Byte Character Setand describes any character set where a character is encoded into (possibly) more than 1 byte.

MBCS表示多字节字符集,描述任何字符被编码成(可能)超过 1 个字节的字符集。

The ANSI/ ASCIIcharacter sets are not multi-byte.

ANSI/ ASCII字符集不是多字节。

UTF-8, however, is a multi-byte encoding. It encodes any Unicode character as a sequence of 1, 2, 3, or 4 octets (bytes).

然而,UTF-8是一种多字节编码。它将任何 Unicode 字符编码为 1、2、3 或 4 个八位字节(字节)的序列。

However, UTF-8 is only one out of several possible concrete encodings of the Unicode character set. Notably, UTF-16 is another, and happens to be the encoding used by Windows / .NET (IIRC). Here's the difference between UTF-8 and UTF-16:

但是,UTF-8 只是 Unicode 字符集的几种可能的具体编码中的一种。值得注意的是,UTF-16 是另一种,恰好是 Windows / .NET (IIRC) 使用的编码。这是 UTF-8 和 UTF-16 之间的区别:

  • UTF-8 encodes any Unicode character as a sequence of 1, 2, 3, or 4 bytes.

  • UTF-16 encodes most Unicode characters as 2 bytes, and some as 4 bytes.

  • UTF-8 将任何 Unicode 字符编码为 1、2、3 或 4 个字节的序列。

  • UTF-16 将大多数 Unicode 字符编码为 2 个字节,有些编码为 4 个字节。

It is therefore notcorrect that Unicode is a 16-bit character encoding. It's rather something like a 21-bit encoding (or even more these days), as it encompasses a character set with code points U+000000up to U+10FFFF.

因此,Unicode 是 16 位字符编码是正确的。它更像是一种 21 位编码(现在甚至更多),因为它包含一个字符集,代码点U+000000高达U+10FFFF.

回答by Chris

As a footnote to the other answers, MSDN has a document Generic-Text Mappings in TCHAR.Hwith handy tables summarizing how the preprocessor directives _UNICODE and _MBCS change the definition of different C/C++ types.

作为其他答案的脚注,MSDN 有一个文档TCHAR.H 中的通用文本映射,其中包含一些方便的表格,总结了预处理器指令 _UNICODE 和_MBCS如何更改不同 C/C++ 类型的定义。

As to the phrasing "Unicode" and "Multi-Byte Character Set", people have already described what the effects are. I just want to emphasize that both of those are Microsoft-speak for some very specific things. (That is, they mean something less general and more particular-to-Windows than one might expect if coming from a non-Microsoft-specific understanding of text internationalization.) Those exact phrases show up and tend to get their own separate sections/subsections of microsoft technical documents, e.g. in Text and Strings in Visual C++

至于“Unicode”和“多字节字符集”这两个词组,人们已经描述了它的作用。我只想强调,这两个都是微软针对一些非常具体的事情的说法。(也就是说,如果来自对文本国际化的非 Microsoft 特定理解,它们对 Windows 的意义不那么普遍,而且比人们预期的更加特殊。)这些确切的短语出现并倾向于获得自己单独的部分/子部分微软技术文档,例如Visual C++中的文本和字符串