windows 为什么不允许 UTF-8 作为“ANSI”代码页?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2995111/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why isn't UTF-8 allowed as the "ANSI" code page?
提问by dan04
The Windows _setmbcpfunction allows any valid code page...
Windows _setmbcp函数允许任何有效的代码页...
(except UTF-7 and UTF-8, which are not supported)
(UTF-7 和 UTF-8 除外,不支持)
OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.
好吧,不支持 UTF-7 是有道理的:字符具有非唯一表示,这会带来复杂性和安全风险。
But why not UTF-8?
但为什么不是 UTF-8?
As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?
据我了解,Windows API 函数的“ANSI”版本将它们的参数转换为 UTF-16,调用等效的“W”函数,并将输出中的任何字符串转换为“ANSI”。这是我一直在手动做的。那么为什么 Windows 不能为我做呢?
采纳答案by Dean Harding
The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.
“ANSI”代码页基本上是遗留的:Windows 9X 时代。无论如何,所有现代软件都应该是基于 Unicode(即 UTF-16)的。
Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.
基本上,当最初设计 Ansi 代码页时,甚至还没有发明 UTF-8,因此对多字节编码的支持是相当随意的(即大多数 Ansi 代码页都是单字节的,除了一些东亚代码页它们是一或两个字节)。无论如何,当所有新开发都应该在 UTF-16 中完成时,添加对“正确”多字节编码的支持可能被认为不值得。
回答by Remy Lebeau
_setmbcp()
is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 API A
functions. When they call their W
counterparts internally, the A
functions always use MultiByteToWideChar()
and WideCharToMultiByte()
specifying codepage 0 (CP_ACP
) to use the system default Ansi codepage for the conversions.
_setmbcp()
是 VC++ RTL 函数,不是 Win32 API 函数。它只影响 RTL 解释字符串的方式。它对 Win32 APIA
函数没有任何影响。当它们在W
内部调用对应的A
函数时,函数总是使用MultiByteToWideChar()
并WideCharToMultiByte()
指定代码页 0 ( CP_ACP
) 以使用系统默认的 Ansi 代码页进行转换。
回答by jamesdlin
Michael Kaplan, an internationalization expert from Microsoft, tried to answer this on his blog.
来自微软的国际化专家迈克尔卡普兰试图在他的博客上回答这个问题。
Basically his explanation is that even though the "ANSI" versions of Windows API functions are meant to handle different code pages, historically there was an implicit expectation that character encodings would require at most two bytes per code point. UTF-8 doesn't meet that expectation, and changing all of those functions now would require a massive amount of testing.
基本上他的解释是,尽管 Windows API 函数的“ANSI”版本旨在处理不同的代码页,但历史上有一个隐含的期望,即字符编码每个代码点最多需要两个字节。UTF-8 不符合这种期望,现在更改所有这些功能需要进行大量测试。