C# 和 UTF-16 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/697055/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 13:52:01  来源:igfitidea点击:

C# and UTF-16 characters

c#unicode

提问by Dutow

Is it possible in C# to use UTF-32 characters not in Plane 0 as a char?

是否可以在 C# 中使用不在平面 0 中的 UTF-32 字符作为字符?

string s = ""; // valid
char c = ''; // generates a compiler error ("Too many characters in character literal")

And in s it is represented by two characters, not one.

在 s 中它由两个字符表示,而不是一个。

Edit: I mean, is there a character AN string type with full unicode support, UTF-32 or UTF-8 per character? For example if I want a for loop on utf-32 (maybe not in plane0) characters in a string.

编辑:我的意思是,是否有一个字符 AN 字符串类型具有完整的 unicode 支持,每个字符是 UTF-32 或 UTF-8?例如,如果我想对字符串中的 utf-32(可能不在 plane0)字符进行 for 循环。

采纳答案by Emperor XLII

The stringclass represents a UTF-16 encoded block of text, and each charin a stringrepresents a UTF-16 code value.

string类表示文本的UTF-16的编码块,并且每个charstring表示UTF-16码值。

Although there is no BCL type that represents a single Unicode code point, there is support for Unicode characters beyond Plane 0 in the form of method overloads taking a stringand an index instead of just a char. For example, the static GetUnicodeCategory(char)method on the System.Globalization.CharUnicodeInfoclass has a corresponding GetUnicodeCategory(string,int)method that will recognize a simple character or a surrogate pair starting at the specified index.

尽管没有表示单个 Unicode 代码点的 BCL 类型,但支持 0 平面以外的 Unicode 字符,以采用 astring和索引而不是仅 a的方法重载形式char。例如,System.Globalization.CharUnicodeInfoGetUnicodeCategory(char)上的静态方法有一个相应的方法,可以识别从指定索引开始的简单字符或代理对。GetUnicodeCategory(string,int)



To iterate through the text elements in a string, you can use the methods on the System.Globalization.StringInfoclass. Here, a "text element" corresponds to a single character as displayed on screen. This means that simple characters ("a"), combining characters ("a\u0304\u0308"= "ā̈"), and surrogate pairs ("\uD950\uDF21"= "��") will all be treated as a single text element.

要遍历 a 中的文本元素string,您可以使用System.Globalization.StringInfo类上的方法。这里,“文本元素”对应于屏幕上显示的单个字符。这意味着简单字符 ( "a")、组合字符 ( "a\u0304\u0308"= "ā̈") 和代理对 ( "\uD950\uDF21"= " ") 都将被视为单个文本元素。

Specifically, the GetTextElementEnumeratorstatic method will allow you to enumerate over each text element in a string(see the linked MSDN page for a code example).

具体来说,GetTextElementEnumerator静态方法将允许您枚举 a 中的每个文本元素string(有关代码示例,请参阅链接的 MSDN 页面)。

回答by Joachim Sauer

I only know this problem from Java and checked the documentation on charbefore answering and indeed the behavior is pretty much the same in .NET/C# and Java.

我只从 Java 中知道这个问题,并在回答之前检查了文档char,实际上 .NET/C# 和 Java 中的行为几乎相同。

It seems that indeed a charis defined to be 16 bit and definitely can't hold anything outside of Plane 0. Only String/stringis capable of handling those characters. In a char-array it will be represented as two surrogate characters.

似乎确实 achar被定义为 16 位,并且绝对不能容纳平面 0 之外的任何东西。只有String/string能够处理这些字符。在char-array 中,它将表示为两个代理字符

回答by Joachim Sauer

C# System.String support UTF-32 just fine, but you can't iterate through the string like it is an array of System.Char or use IEnumerable.

C# System.String 支持 UTF-32 就好了,但你不能像 System.Char 数组一样遍历字符串或使用 IEnumerable。

for example:

例如:

// iterating through a string NO UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
    if (Char.IsDigit(sample[i]))
    {
        Console.WriteLine("IsDigit");
    }
    else if (Char.IsLetter(sample[i]))
    {
        Console.WriteLine("IsLetter");
    }
}

// iterating through a string WITH UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
    if (Char.IsDigit(sample, i))
    {
        Console.WriteLine("IsDigit");
    }
    else if (Char.IsLetter(sample, i))
    {
        Console.WriteLine("IsLetter");
    }

    if (Char.IsSurrogate(sample, i))
    {
        ++i;
    }
}

Note the subtle difference in the Char.IsDigit and Char.IsLetter calls. And that String.Length is always the number of 16-bit "characters", not the number of "characters" in the UTF-32 sense.

请注意 Char.IsDigit 和 Char.IsLetter 调用中的细微差别。而 String.Length 始终是 16 位“字符”的数量,而不是 UTF-32 意义上的“字符”数量。

Off topic, but UTF-32 support is completely unnecessary for an application to handle international languages, unless you have a specific business case for an obscure historical/technical language.

题外话,但是对于处理国际语言的应用程序来说,UTF-32 支持是完全没有必要的,除非您有一个晦涩的历史/技术语言的特定业务案例。