C# 为什么 .net 对 string 使用 UTF16 编码，但使用 utf8 作为默认保存文件？

Question

提问by Royi Namir

From here

从这里

Essentially, string uses the UTF-16 character encoding form

本质上，string 使用 UTF-16 字符编码形式

But when saving vs StreamWriter:

但是当保存 vs StreamWriter 时：

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM),

此构造函数使用 UTF-8 编码创建一个 StreamWriter，没有字节顺序标记 (BOM)，

I've seen this sample (broken link removed):

我看过这个示例（已删除断开的链接）：

enter image description here

在此处输入图片说明

And it looks like utf8is smaller for some strings while utf-16is smaller in some other strings.

并且utf8某些字符串看起来更小，而utf-16其他一些字符串则更小。

So Why .net uses utf16as default encoding for string while utf8for saving file ?

那么为什么 .netutf16在utf8保存文件时使用字符串的默认编码？

Thank you.

谢谢你。

p.s. Ive already read the famous article

ps 我已经读过那篇著名的文章

Answer 1

采纳答案by Jon Skeet

Ifyou're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.

如果您乐于忽略代理对（或等效地，您的应用程序需要基本多语言平面之外的字符的可能性），UTF-16 有一些不错的属性，主要是因为每个代码单元总是需要两个字节并代表所有 BMP 字符每个代码单元。

Consider the primitive type char. If we use UTF-8 as the in-memory representation and want to cope with allUnicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!

考虑原始类型char。如果我们使用 UTF-8 作为内存表示并希望处理所有Unicode 字符，那应该有多大？它最多可以有 4 个字节……这意味着我们总是必须分配 4 个字节。到时候我们不妨使用UTF-32！

Of course, we could use UTF-32 as the charrepresentation, but UTF-8 in the stringrepresentation, converting as we go.

当然，我们可以使用 UTF-32 作为char表示，但在表示中使用 UTF- 8 ，随时string转换。

The two disadvantages of UTF-16 are:

UTF-16 的两个缺点是：

The number of code units per Unicode character is variable, because not all characters arein the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.

每个 Unicode 字符的代码单元数是可变的，因为并非所有字符都在 BMP 中。在表情符号流行之前，这并没有影响日常使用中的许多应用程序。如今，对于消息应用程序等，使用 UTF-16 的开发人员确实需要了解代理对。
对于纯 ASCII（至少在西方有很多文本），它占用的空间是等效 UTF-8 编码文本的两倍。

(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)

（作为旁注，我相信 Windows 对 Unicode 数据使用 UTF-16，并且 .NET 出于互操作的原因效仿是有意义的。不过，这只是将问题推到了一步。）

Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)

鉴于代理对的问题，我怀疑如果一种语言/平台是从头开始设计的，没有互操作要求（但其文本处理基于 Unicode），则 UTF-16 将不是最佳选择。UTF-8（如果您想要内存效率并且不介意在获取第 n 个字符方面的处理复杂性）或 UTF-32（相反）将是更好的选择。（由于不同的规范化形式，即使到达第 n 个字符也有“问题”。文本很难......）

Answer 2

回答by Hans Passant

As with many "why was this chosen" questions, this was determined by history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. It wasn't until 1996 until Unicode acquired the supplementary planes to extend the coding space to a million codepoints. And surrogate pairs to fit them into a 16-bit encoding, thus setting the utf-16 standard.

与许多“为什么选择这个”问题一样，这是由历史决定的。Windows 于 1993 年成为其核心的 Unicode 操作系统。当时，Unicode 仍然只有 65535 个代码点的代码空间，现在称为 UCS。直到 1996 年，Unicode 才获得补充平面，将编码空间扩展到一百万个代码点。并且代理对使它们适合 16 位编码，从而设置 utf-16 标准。

.NET strings are utf-16 because that's an excellent fit with the operating system encoding, no conversion is required.

.NET 字符串是 utf-16，因为它非常适合操作系统编码，不需要转换。

The history of utf-8 is murkier. Definitely past Windows NT, RFC-3629 dates from November 1993. It took a while to gain a foot-hold, the Internet was instrumental.

utf-8 的历史更加模糊。RFC-3629 绝对超越了 Windows NT，它的日期是 1993 年 11 月。它花了一段时间才站稳脚跟，互联网发挥了重要作用。

Answer 3

回答by user2457603

UTF-8 is the default for text storage and transfer because it is a relatively compact form for most languages (some languages are more compact in UTF-16 than in UTF-8). Each specific language has a more efficient encoding.

UTF-8 是文本存储和传输的默认格式，因为它对于大多数语言来说是一种相对紧凑的形式（某些语言在 UTF-16 中比在 UTF-8 中更紧凑）。每种特定语言都有更有效的编码。

UTF-16 is used for in-memory strings because it is faster per character to parse and maps directly to unicode character class and other tables. All string functions in Windows use UTF-16 and have for years.

UTF-16 用于内存中的字符串，因为每个字符解析和映射到 unicode 字符类和其他表的速度更快。Windows 中的所有字符串函数都使用 UTF-16，并且已经使用了多年。

C# 为什么 .net 对 string 使用 UTF16 编码，但使用 utf8 作为默认保存文件？

提问by Royi Namir

采纳答案by Jon Skeet

回答by Hans Passant

回答by user2457603

相关推荐

最近更新

标签

C# 为什么 .net 对 string 使用 UTF16 编码，但使用 utf8 作为默认保存文件？

提问by Royi Namir

采纳答案by Jon Skeet

回答by Hans Passant

回答by user2457603

相关推荐

C# Gridview 模板行中的条件评估

C# App.Config 连接字符串

在 C# 中填充数组

C# 将复选框的值传递给 asp.net mvc4 中的控制器操作

相关推荐

最近更新

标签