C# 和 UTF-16 字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/697055/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
C# and UTF-16 characters
提问by Dutow
Is it possible in C# to use UTF-32 characters not in Plane 0 as a char?
是否可以在 C# 中使用不在平面 0 中的 UTF-32 字符作为字符?
string s = ""; // valid
char c = ''; // generates a compiler error ("Too many characters in character literal")
And in s it is represented by two characters, not one.
在 s 中它由两个字符表示,而不是一个。
Edit: I mean, is there a character AN string type with full unicode support, UTF-32 or UTF-8 per character? For example if I want a for loop on utf-32 (maybe not in plane0) characters in a string.
编辑:我的意思是,是否有一个字符 AN 字符串类型具有完整的 unicode 支持,每个字符是 UTF-32 或 UTF-8?例如,如果我想对字符串中的 utf-32(可能不在 plane0)字符进行 for 循环。
采纳答案by Emperor XLII
The string
class represents a UTF-16 encoded block of text, and each char
in a string
represents a UTF-16 code value.
的string
类表示文本的UTF-16的编码块,并且每个char
在string
表示UTF-16码值。
Although there is no BCL type that represents a single Unicode code point, there is support for Unicode characters beyond Plane 0 in the form of method overloads taking a string
and an index instead of just a char
. For example, the static GetUnicodeCategory
(char)
method on the System.Globalization.CharUnicodeInfoclass has a corresponding GetUnicodeCategory
(string,int)
method that will recognize a simple character or a surrogate pair starting at the specified index.
尽管没有表示单个 Unicode 代码点的 BCL 类型,但支持 0 平面以外的 Unicode 字符,以采用 astring
和索引而不是仅 a的方法重载形式char
。例如,System.Globalization.CharUnicodeInfo类GetUnicodeCategory
(char)
上的静态方法有一个相应的方法,可以识别从指定索引开始的简单字符或代理对。GetUnicodeCategory
(string,int)
To iterate through the text elements in a string
, you can use the methods on the System.Globalization.StringInfoclass. Here, a "text element" corresponds to a single character as displayed on screen. This means that simple characters ("a"
), combining characters ("a\u0304\u0308"
= "ā̈"), and surrogate pairs ("\uD950\uDF21"
= "��") will all be treated as a single text element.
要遍历 a 中的文本元素string
,您可以使用System.Globalization.StringInfo类上的方法。这里,“文本元素”对应于屏幕上显示的单个字符。这意味着简单字符 ( "a"
)、组合字符 ( "a\u0304\u0308"
= "ā̈") 和代理对 ( "\uD950\uDF21"
= " ") 都将被视为单个文本元素。
Specifically, the GetTextElementEnumeratorstatic method will allow you to enumerate over each text element in a string
(see the linked MSDN page for a code example).
具体来说,GetTextElementEnumerator静态方法将允许您枚举 a 中的每个文本元素string
(有关代码示例,请参阅链接的 MSDN 页面)。
回答by Joachim Sauer
I only know this problem from Java and checked the documentation on char
before answering and indeed the behavior is pretty much the same in .NET/C# and Java.
我只从 Java 中知道这个问题,并在回答之前检查了文档char
,实际上 .NET/C# 和 Java 中的行为几乎相同。
It seems that indeed a char
is defined to be 16 bit and definitely can't hold anything outside of Plane 0. Only String
/string
is capable of handling those characters. In a char
-array it will be represented as two surrogate characters.
似乎确实 achar
被定义为 16 位,并且绝对不能容纳平面 0 之外的任何东西。只有String
/string
能够处理这些字符。在char
-array 中,它将表示为两个代理字符。
回答by Joachim Sauer
C# System.String support UTF-32 just fine, but you can't iterate through the string like it is an array of System.Char or use IEnumerable.
C# System.String 支持 UTF-32 就好了,但你不能像 System.Char 数组一样遍历字符串或使用 IEnumerable。
for example:
例如:
// iterating through a string NO UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
if (Char.IsDigit(sample[i]))
{
Console.WriteLine("IsDigit");
}
else if (Char.IsLetter(sample[i]))
{
Console.WriteLine("IsLetter");
}
}
// iterating through a string WITH UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
if (Char.IsDigit(sample, i))
{
Console.WriteLine("IsDigit");
}
else if (Char.IsLetter(sample, i))
{
Console.WriteLine("IsLetter");
}
if (Char.IsSurrogate(sample, i))
{
++i;
}
}
Note the subtle difference in the Char.IsDigit and Char.IsLetter calls. And that String.Length is always the number of 16-bit "characters", not the number of "characters" in the UTF-32 sense.
请注意 Char.IsDigit 和 Char.IsLetter 调用中的细微差别。而 String.Length 始终是 16 位“字符”的数量,而不是 UTF-32 意义上的“字符”数量。
Off topic, but UTF-32 support is completely unnecessary for an application to handle international languages, unless you have a specific business case for an obscure historical/technical language.
题外话,但是对于处理国际语言的应用程序来说,UTF-32 支持是完全没有必要的,除非您有一个晦涩的历史/技术语言的特定业务案例。