.NET 的 String.Normalize 有什么作用?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3288114/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 14:32:01  来源:igfitidea点击:

What does .NET's String.Normalize do?

.netstring

提问by GeReV

The MSDN article on String.Normalizestates simply:

关于 String.NormalizeMSDN 文章简单地指出:

Returns a new string whose binary representation is in a particular Unicode normalization form.

返回一个新字符串,其二进制表示采用特定的 Unicode 规范化形式。

And sometimes referring to a "Unicode normalization form C."

有时指的是“Unicode 规范化形式 C”。

I'm just wondering, what does that mean? How is this function useful in real life situations?

我只是想知道,这是什么意思?此功能在现实生活中有何用处?

采纳答案by Oded

It makes sure that unicode strings can be compared for equality (even if they are using different unicode encodings).

它确保可以比较 unicode 字符串的相等性(即使它们使用不同的 unicode 编码)。

From Unicode Standard Annex #15:

来自 Unicode 标准附件 #15

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

本质上,Unicode Normalization Algorithm 将所有组合标记按指定顺序排列,并使用分解和组合规则将每个字符串转换为 Unicode Normalization Forms 之一。转换后的字符串的二进制比较将确定等效性。

回答by Hans Ke?ing

One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.

表格 C 和表格 D 之间的一个区别是如何表示带重音的字母:表格 C 使用单个带重音的字母代码点,而表格 D 将其分为字母和重音符号。

For instance, an "à" can be codepoint 224 ("Latin small letter A with grave"), or codepoint 97 ("Latin small letter A") followed by codepoint 786 ("Combining grave accent").

例如,“à”可以是代码点 224(“带有坟墓的拉丁小写字母 A”),或代码点 97(“拉丁小写字母 A”)后跟代码点 786(“组合严重的重音符号”)。

A side-effect is that this makes it possible to easily create a "remove accents" method.

一个副作用是这使得可以轻松创建“删除重音”方法。

public static string RemoveAccents(string input)
{
    return new string(input
        .Normalize(System.Text.NormalizationForm.FormD)
        .ToCharArray()
        .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        .ToArray());
    // the normalization to FormD splits accented letters in accents+letters
    // the rest removes those accents (and other non-spacing characters)
}

回答by devio

In Unicode, a (composed) character can either have a unique code point, or a sequence of code points consisting of the base character and its accents.

在 Unicode 中,(组合的)字符可以具有唯一的代码点,也可以具有由基本字符及其重音符号组成的代码点序列。

Wikipedialists as example Vietnamese ? (U+1EBF) and its decomposed sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent).

维基百科列出了越南语的例子?(U+1EBF) 及其分解序列 U+0065 (e) U+0302(循环重音) U+0301(重音)。

string.Normalize() converts between the 4 normal forms a string can be coded in Unicode.

string.Normalize() 在可以用 Unicode 编码的字符串的 4 种范式之间进行转换。

回答by Adam Houldsworth

This link has a good explanation:

这个链接有一个很好的解释:

http://unicode.org/reports/tr15/#Norm_Forms

http://unicode.org/reports/tr15/#Norm_Forms

From what I can surmise, its so you can compare two unicode strings for equality.

据我推测,它可以让您比较两个 unicode 字符串的相等性。