C# 为什么 .net 对 string 使用 UTF16 编码,但使用 utf8 作为默认保存文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14942092/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why does .net uses the UTF16 encoding for string , but uses utf8 as default for saving files?
提问by Royi Namir
Essentially, string uses the UTF-16 character encoding form
本质上,string 使用 UTF-16 字符编码形式
But when saving vs StreamWriter:
但是当保存 vs StreamWriter 时:
This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM),
此构造函数使用 UTF-8 编码创建一个 StreamWriter,没有字节顺序标记 (BOM),
I've seen this sample (broken link removed):
我看过这个示例(已删除断开的链接):
And it looks like utf8
is smaller for some strings while utf-16
is smaller in some other strings.
并且utf8
某些字符串看起来更小,而utf-16
其他一些字符串则更小。
- So Why .net uses
utf16
as default encoding for string whileutf8
for saving file ?
- 那么为什么 .net
utf16
在utf8
保存文件时使用字符串的默认编码?
Thank you.
谢谢你。
p.s. Ive already read the famous article
ps 我已经读过那篇著名的文章
采纳答案by Jon Skeet
Ifyou're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.
如果您乐于忽略代理对(或等效地,您的应用程序需要基本多语言平面之外的字符的可能性),UTF-16 有一些不错的属性,主要是因为每个代码单元总是需要两个字节并代表所有 BMP 字符每个代码单元。
Consider the primitive type char
. If we use UTF-8 as the in-memory representation and want to cope with allUnicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!
考虑原始类型char
。如果我们使用 UTF-8 作为内存表示并希望处理所有Unicode 字符,那应该有多大?它最多可以有 4 个字节……这意味着我们总是必须分配 4 个字节。到时候我们不妨使用UTF-32!
Of course, we could use UTF-32 as the char
representation, but UTF-8 in the string
representation, converting as we go.
当然,我们可以使用 UTF-32 作为char
表示,但在表示中使用 UTF- 8 ,随时string
转换。
The two disadvantages of UTF-16 are:
UTF-16 的两个缺点是:
- The number of code units per Unicode character is variable, because not all characters arein the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
- For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.
- 每个 Unicode 字符的代码单元数是可变的,因为并非所有字符都在 BMP 中。在表情符号流行之前,这并没有影响日常使用中的许多应用程序。如今,对于消息应用程序等,使用 UTF-16 的开发人员确实需要了解代理对。
- 对于纯 ASCII(至少在西方有很多文本),它占用的空间是等效 UTF-8 编码文本的两倍。
(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)
(作为旁注,我相信 Windows 对 Unicode 数据使用 UTF-16,并且 .NET 出于互操作的原因效仿是有意义的。不过,这只是将问题推到了一步。)
Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)
鉴于代理对的问题,我怀疑如果一种语言/平台是从头开始设计的,没有互操作要求(但其文本处理基于 Unicode),则 UTF-16 将不是最佳选择。UTF-8(如果您想要内存效率并且不介意在获取第 n 个字符方面的处理复杂性)或 UTF-32(相反)将是更好的选择。(由于不同的规范化形式,即使到达第 n 个字符也有“问题”。文本很难......)
回答by Hans Passant
As with many "why was this chosen" questions, this was determined by history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. It wasn't until 1996 until Unicode acquired the supplementary planes to extend the coding space to a million codepoints. And surrogate pairs to fit them into a 16-bit encoding, thus setting the utf-16 standard.
与许多“为什么选择这个”问题一样,这是由历史决定的。Windows 于 1993 年成为其核心的 Unicode 操作系统。当时,Unicode 仍然只有 65535 个代码点的代码空间,现在称为 UCS。直到 1996 年,Unicode 才获得补充平面,将编码空间扩展到一百万个代码点。并且代理对使它们适合 16 位编码,从而设置 utf-16 标准。
.NET strings are utf-16 because that's an excellent fit with the operating system encoding, no conversion is required.
.NET 字符串是 utf-16,因为它非常适合操作系统编码,不需要转换。
The history of utf-8 is murkier. Definitely past Windows NT, RFC-3629 dates from November 1993. It took a while to gain a foot-hold, the Internet was instrumental.
utf-8 的历史更加模糊。RFC-3629 绝对超越了 Windows NT,它的日期是 1993 年 11 月。它花了一段时间才站稳脚跟,互联网发挥了重要作用。
回答by user2457603
UTF-8 is the default for text storage and transfer because it is a relatively compact form for most languages (some languages are more compact in UTF-16 than in UTF-8). Each specific language has a more efficient encoding.
UTF-8 是文本存储和传输的默认格式,因为它对于大多数语言来说是一种相对紧凑的形式(某些语言在 UTF-16 中比在 UTF-8 中更紧凑)。每种特定语言都有更有效的编码。
UTF-16 is used for in-memory strings because it is faster per character to parse and maps directly to unicode character class and other tables. All string functions in Windows use UTF-16 and have for years.
UTF-16 用于内存中的字符串,因为每个字符解析和映射到 unicode 字符类和其他表的速度更快。Windows 中的所有字符串函数都使用 UTF-16,并且已经使用了多年。