C# 和 UTF-16 字符

Question

提问by Dutow

Is it possible in C# to use UTF-32 characters not in Plane 0 as a char?

是否可以在 C# 中使用不在平面 0 中的 UTF-32 字符作为字符？

string s = ""; // valid
char c = ''; // generates a compiler error ("Too many characters in character literal")

And in s it is represented by two characters, not one.

在 s 中它由两个字符表示，而不是一个。

Edit: I mean, is there a character AN string type with full unicode support, UTF-32 or UTF-8 per character? For example if I want a for loop on utf-32 (maybe not in plane0) characters in a string.

编辑：我的意思是，是否有一个字符 AN 字符串类型具有完整的 unicode 支持，每个字符是 UTF-32 或 UTF-8？例如，如果我想对字符串中的 utf-32（可能不在 plane0）字符进行 for 循环。

Answer 1

采纳答案by Emperor XLII

The stringclass represents a UTF-16 encoded block of text, and each charin a stringrepresents a UTF-16 code value.

的string类表示文本的UTF-16的编码块，并且每个char在string表示UTF-16码值。

Although there is no BCL type that represents a single Unicode code point, there is support for Unicode characters beyond Plane 0 in the form of method overloads taking a stringand an index instead of just a char. For example, the static GetUnicodeCategory(char)method on the System.Globalization.CharUnicodeInfoclass has a corresponding GetUnicodeCategory(string,int)method that will recognize a simple character or a surrogate pair starting at the specified index.

尽管没有表示单个 Unicode 代码点的 BCL 类型，但支持 0 平面以外的 Unicode 字符，以采用 astring和索引而不是仅 a的方法重载形式char。例如，System.Globalization.CharUnicodeInfo类GetUnicodeCategory(char)上的静态方法有一个相应的方法，可以识别从指定索引开始的简单字符或代理对。GetUnicodeCategory(string,int)

To iterate through the text elements in a string, you can use the methods on the System.Globalization.StringInfoclass. Here, a "text element" corresponds to a single character as displayed on screen. This means that simple characters ("a"), combining characters ("a\u0304\u0308"= "ā̈"), and surrogate pairs ("\uD950\uDF21"= "��") will all be treated as a single text element.

要遍历 a 中的文本元素string，您可以使用System.Globalization.StringInfo类上的方法。这里，“文本元素”对应于屏幕上显示的单个字符。这意味着简单字符 ( "a")、组合字符 ( "a\u0304\u0308"= "ā̈") 和代理对 ( "\uD950\uDF21"= " ") 都将被视为单个文本元素。

Specifically, the GetTextElementEnumeratorstatic method will allow you to enumerate over each text element in a string(see the linked MSDN page for a code example).

具体来说，GetTextElementEnumerator静态方法将允许您枚举 a 中的每个文本元素string（有关代码示例，请参阅链接的 MSDN 页面）。

Answer 2

回答by Joachim Sauer

I only know this problem from Java and checked the documentation on charbefore answering and indeed the behavior is pretty much the same in .NET/C# and Java.

我只从 Java 中知道这个问题，并在回答之前检查了文档char，实际上 .NET/C# 和 Java 中的行为几乎相同。

It seems that indeed a charis defined to be 16 bit and definitely can't hold anything outside of Plane 0. Only String/stringis capable of handling those characters. In a char-array it will be represented as two surrogate characters.

似乎确实 achar被定义为 16 位，并且绝对不能容纳平面 0 之外的任何东西。只有String/string能够处理这些字符。在char-array 中，它将表示为两个代理字符。

Answer 3

回答by Joachim Sauer

C# System.String support UTF-32 just fine, but you can't iterate through the string like it is an array of System.Char or use IEnumerable.

C# System.String 支持 UTF-32 就好了，但你不能像 System.Char 数组一样遍历字符串或使用 IEnumerable。

for example:

例如：

// iterating through a string NO UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
    if (Char.IsDigit(sample[i]))
    {
        Console.WriteLine("IsDigit");
    }
    else if (Char.IsLetter(sample[i]))
    {
        Console.WriteLine("IsLetter");
    }
}

// iterating through a string WITH UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
    if (Char.IsDigit(sample, i))
    {
        Console.WriteLine("IsDigit");
    }
    else if (Char.IsLetter(sample, i))
    {
        Console.WriteLine("IsLetter");
    }

    if (Char.IsSurrogate(sample, i))
    {
        ++i;
    }
}

Note the subtle difference in the Char.IsDigit and Char.IsLetter calls. And that String.Length is always the number of 16-bit "characters", not the number of "characters" in the UTF-32 sense.

请注意 Char.IsDigit 和 Char.IsLetter 调用中的细微差别。而 String.Length 始终是 16 位“字符”的数量，而不是 UTF-32 意义上的“字符”数量。

Off topic, but UTF-32 support is completely unnecessary for an application to handle international languages, unless you have a specific business case for an obscure historical/technical language.

题外话，但是对于处理国际语言的应用程序来说，UTF-32 支持是完全没有必要的，除非您有一个晦涩的历史/技术语言的特定业务案例。

C# 和 UTF-16 字符

提问by Dutow

采纳答案by Emperor XLII

回答by Joachim Sauer

回答by Joachim Sauer

相关推荐

最近更新

标签

C# 和 UTF-16 字符

提问by Dutow

采纳答案by Emperor XLII

回答by Joachim Sauer

回答by Joachim Sauer

相关推荐

什么是十进制的 0x10？

C# 什么时候应该使用 Environment.Exit 来终止控制台应用程序？

.Net (C#/VB.NET) 中泛型的使用示例

C# 在 WPF 中动态更改旋转动画

相关推荐

最近更新

标签