在 C# 中处理 Unicode 字符串的最佳实践是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/144397/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What are the best practices for handling Unicode strings in C#?
提问by Vijesh VP
Can somebody please provide me some important aspects I should be aware of while handling Unicode strings in C#?
有人可以向我提供一些在 C# 中处理 Unicode 字符串时应该注意的重要方面吗?
回答by Chris Wenham
C# (and .Net in general) handle unicode strings transparently, and you won't have to do anything special unless your application needs to read/write files with specific encodings. In those cases, you can convert managed strings to byte arrays of the encoding of your choice by using the classes in the System.Text.Encodings namespace.
C#(以及一般的 .Net)透明地处理 unicode 字符串,除非您的应用程序需要读取/写入具有特定编码的文件,否则您无需执行任何特殊操作。在这些情况下,您可以使用 System.Text.Encodings 命名空间中的类将托管字符串转换为您选择的编码的字节数组。
回答by JacquesB
Only think about encoding when reading and writing streams. Use TextReader and TextWriters to read and write text in different encodings. Always use utf-8 if you have a choice.
读写流时只考虑编码。使用 TextReader 和 TextWriters 以不同的编码读取和写入文本。如果您有选择,请始终使用 utf-8。
Don't get confused by languages and cultures - that's a completely separate issue from unicode.
不要被语言和文化所迷惑——这是一个与 unicode 完全不同的问题。
回答by Matt Howells
.Net has relatively good i18n support. You don't really need to think about unicode that much as all .Net strings and built-in string functions do the right thing with unicode. The only thing to bear in mind is that most of the string functions, for example DateTime.ToString(), use by default the thread's culture which by default is the Windows culture. You can specify a different culture for formatting either on the current thread or on each method call.
.Net 对 i18n 的支持比较好。您真的不需要考虑 unicode,因为所有 .Net 字符串和内置字符串函数都可以使用 unicode 做正确的事情。唯一要记住的是,大多数字符串函数,例如 DateTime.ToString(),默认使用线程的文化,默认情况下是 Windows 文化。您可以在当前线程或每个方法调用上为格式指定不同的文化。
The only time unicode is an issue is when encoding/decoding strings to and from bytes.
unicode 唯一的问题是在将字符串编码/解码为字节时。
回答by Aaron
Keep in mind that C# strings are sequnces of Char, UTF-16 code units. They are notUnicode code-points. Some unicode code points require two Char's, and you should not split strings between these Chars.
请记住,C# 字符串是 Char、UTF-16 代码单元的序列。它们不是Unicode 代码点。某些 unicode 代码点需要两个 Char,您不应在这些 Char 之间拆分字符串。
In addition, unicode code points may combine to form a single language 'character' -- for instance, a 'u' Char followed by umlat Char. So you can't split strings between arbitrary code points either.
此外,unicode 代码点可以组合形成单个语言“字符”——例如,“u”字符后跟 umlat 字符。因此,您也不能在任意代码点之间拆分字符串。
Basically, it's mess of issues, where any given issue may only in practice affect languages you don't know.
基本上,这是一堆问题,任何给定的问题可能只会在实践中影响您不知道的语言。
回答by Pat
As mentioned, .NET strings handle Unicode transparently. Besides file I/O, the other consideration would be at the database layer. SQL Server for instance distinguishes between VARCHAR (non-unicode) and NVARCHAR (which handles unicode). Also need to pay attention to stored procedure parameters.
如前所述,.NET 字符串透明地处理 Unicode。除了文件 I/O,另一个考虑因素是数据库层。例如,SQL Server 区分 VARCHAR(非 unicode)和 NVARCHAR(处理 unicode)。还需要注意存储过程参数。
回答by nedruod
System.String already handled unicode internally so you are covered there. Best practice would be to use System.Text.Encoding.UTF8Encoding when reading and writing files. It's more than just reading/writing files however, anything that streams data out including network connections is going to depend upon the encoding. If you're using WCF, it's going to default to UTF8 for most of the bindings (in fact most don't allow ASCII at all).
System.String 已经在内部处理了 unicode,所以你在那里被覆盖。最佳实践是在读取和写入文件时使用 System.Text.Encoding.UTF8Encoding。然而,这不仅仅是读取/写入文件,任何流式传输数据(包括网络连接)都将取决于编码。如果您使用 WCF,它会默认为大多数绑定使用 UTF8(事实上,大多数绑定根本不允许使用 ASCII)。
UTF8 is a good choice because while it still supports the entire Unicode character set, for the majority of the ASCII character set it has a byte similarity. Thus naive applications that don't support Unicode have some chance of reading/writing your applications data. Those applications will only begin to fail when you start using extended characters.
UTF8 是一个不错的选择,因为虽然它仍然支持整个 Unicode 字符集,但对于大多数 ASCII 字符集,它具有字节相似性。因此,不支持 Unicode 的幼稚应用程序有一些机会读取/写入您的应用程序数据。只有当您开始使用扩展字符时,这些应用程序才会开始失败。
System.Text.Encoding.Unicode will write UTF-16 which is a minimum of two bytes per character, making it both larger and fully incompatible with ASCII. And System.Text.Encoding.UTF32 as you can guess is larger still. I'm not sure of the real-world use case of UTF-16 and 32, but perhaps they perform better when you have large numbers of extended characters. That's just a theory, but if it is true, then Japanese/Chinese developers making a product that will be used primarily in those languages might find UTF-16/32 a better choice.
System.Text.Encoding.Unicode 将编写 UTF-16,即每个字符至少两个字节,使其既更大又与 ASCII 完全不兼容。和 System.Text.Encoding.UTF32 你可以猜到更大。我不确定 UTF-16 和 32 的实际用例,但是当您有大量扩展字符时,它们的性能可能会更好。这只是一个理论,但如果这是真的,那么日本/中国开发人员制作的产品将主要用于这些语言,可能会发现 UTF-16/32 是更好的选择。
回答by pradeeptp
More details can be found on this thread:
可以在此线程上找到更多详细信息:
http://discuss.joelonsoftware.com/default.asp?dotnet.12.189999.12
http://discuss.joelonsoftware.com/default.asp?dotnet.12.189999.12