.NET 的 String.Normalize 有什么作用？

Question

提问by GeReV

The MSDN article on String.Normalizestates simply:

关于 String.Normalize的MSDN 文章简单地指出：

Returns a new string whose binary representation is in a particular Unicode normalization form.

返回一个新字符串，其二进制表示采用特定的 Unicode 规范化形式。

And sometimes referring to a "Unicode normalization form C."

有时指的是“Unicode 规范化形式 C”。

I'm just wondering, what does that mean? How is this function useful in real life situations?

我只是想知道，这是什么意思？此功能在现实生活中有何用处？

Answer 1

采纳答案by Oded

It makes sure that unicode strings can be compared for equality (even if they are using different unicode encodings).

它确保可以比较 unicode 字符串的相等性（即使它们使用不同的 unicode 编码）。

From Unicode Standard Annex #15:

来自 Unicode 标准附件 #15：

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

本质上，Unicode Normalization Algorithm 将所有组合标记按指定顺序排列，并使用分解和组合规则将每个字符串转换为 Unicode Normalization Forms 之一。转换后的字符串的二进制比较将确定等效性。

Answer 2

回答by Hans Ke?ing

One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.

表格 C 和表格 D 之间的一个区别是如何表示带重音的字母：表格 C 使用单个带重音的字母代码点，而表格 D 将其分为字母和重音符号。

For instance, an "à" can be codepoint 224 ("Latin small letter A with grave"), or codepoint 97 ("Latin small letter A") followed by codepoint 786 ("Combining grave accent").

例如，“à”可以是代码点 224（“带有坟墓的拉丁小写字母 A”），或代码点 97（“拉丁小写字母 A”）后跟代码点 786（“组合严重的重音符号”）。

A side-effect is that this makes it possible to easily create a "remove accents" method.

一个副作用是这使得可以轻松创建“删除重音”方法。

public static string RemoveAccents(string input)
{
    return new string(input
        .Normalize(System.Text.NormalizationForm.FormD)
        .ToCharArray()
        .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        .ToArray());
    // the normalization to FormD splits accented letters in accents+letters
    // the rest removes those accents (and other non-spacing characters)
}

Answer 3

回答by devio

In Unicode, a (composed) character can either have a unique code point, or a sequence of code points consisting of the base character and its accents.

在 Unicode 中，（组合的）字符可以具有唯一的代码点，也可以具有由基本字符及其重音符号组成的代码点序列。

Wikipedialists as example Vietnamese ? (U+1EBF) and its decomposed sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent).

维基百科列出了越南语的例子？(U+1EBF) 及其分解序列 U+0065 (e) U+0302（循环重音） U+0301（重音）。

string.Normalize() converts between the 4 normal forms a string can be coded in Unicode.

string.Normalize() 在可以用 Unicode 编码的字符串的 4 种范式之间进行转换。

Answer 4

回答by Adam Houldsworth

This link has a good explanation:

这个链接有一个很好的解释：

http://unicode.org/reports/tr15/#Norm_Forms

From what I can surmise, its so you can compare two unicode strings for equality.

据我推测，它可以让您比较两个 unicode 字符串的相等性。

.NET 的 String.Normalize 有什么作用？

提问by GeReV

采纳答案by Oded

回答by Hans Ke?ing

回答by devio

回答by Adam Houldsworth

相关推荐

最近更新

标签

.NET 的 String.Normalize 有什么作用？

提问by GeReV

采纳答案by Oded

回答by Hans Ke?ing

回答by devio

回答by Adam Houldsworth

相关推荐

.net 什么是 byte[] 数组？

为什么 .NET 中的 System.Version 被定义为 Major.Minor.Build.Revision？

.NET Web 服务：.asmx?WSDL 与 .wsdl

.net ClickOnce 错误：部署标识与订阅不匹配

相关推荐

最近更新

标签