string UTF8 vs. UTF16 vs. char* vs. 什么?有人给我解释一下这个烂摊子!
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/172133/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!
提问by dicroce
I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows).
我已经设法忽略了所有这些多字节字符的东西,但现在我需要做一些 UI 工作,我知道我在这方面的无知会赶上我!任何人都可以在几段或更少的段落中解释我需要知道的内容,以便我可以本地化我的应用程序吗?我应该使用什么类型(我同时使用 .Net 和 C/C++,我需要这个 Unix 和 Windows 的答案)。
回答by Dylan Beattie
Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
查看 Joel Spolsky 的《每个软件开发人员绝对、肯定必须了解 Unicode 和字符集的绝对最小值(没有任何借口!)
EDIT 20140523: Also, watch Characters, Symbols and the Unicode Miracleby Tom Scott on YouTube - it's just under ten minutes, and a wonderful explanation of the brilliant 'hack' that is UTF-8
编辑 20140523:另外,在 YouTube 上观看 Tom Scott 的Characters, Symbols and the Unicode Miracle- 不到 10 分钟,并且对 UTF-8 的精彩“hack”进行了精彩的解释
回答by Brian R. Bondy
A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.
字符编码由一系列代码组成,每个代码从给定的字符集中查找一个符号。请参阅维基百科上关于字符编码的这篇好文章。
UTF8 (UCS)uses 1 to 4 bytes for each symbol. Wikipediagives a good rundown of how the multi-byte rundown works:
UTF8 (UCS)为每个符号使用 1 到 4 个字节。 维基百科提供了一个关于多字节纲要如何工作的很好的纲要:
- The most significant bit of a single-byte character is always 0.
- The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
- The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.
- A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)
- 单字节字符的最高有效位始终为 0。
- 多字节序列的第一个字节的最高有效位决定了序列的长度。对于两字节序列,这些最高有效位是 110;1110 表示三字节序列,依此类推。
- 多字节序列中剩余的字节有 10 个作为它们的两个最高有效位。
- UTF-8 流既不包含字节 FE 也不包含 FF。这可确保 UTF-8 流永远不会像以 U+FEFF(字节顺序标记)开头的 UTF-16 流
The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.
该页面还向您展示了每种字符编码类型的优缺点之间的很好的比较。
Uses 2 bytes to 4 bytes for each symbol.
每个符号使用 2 个字节到 4 个字节。
uses 4 bytes always for each symbol.
每个符号总是使用 4 个字节。
charjust means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.
char只是表示一个字节的数据,而不是实际的编码。它与 UTF8/UTF16/ascii 不同。char* 指针可以指向任何类型的数据和任何编码。
STL:
STL:
Both stl's std::wstring and std::string are not designed for variable-length character encodings like UTF-8 and UTF-16.
stl 的 std::wstring 和 std::string 都不是为像 UTF-8 和 UTF-16 这样的可变长度字符编码而设计的。
How to implement:
如何实施:
Take a look at the iconv library. iconvis a powerful character encoding conversion library used by such projects as libxml(XML C parser of Gnome)
看看 iconv 库。 iconv是一个强大的字符编码转换库,被libxml(Gnome 的 XML C 解析器)等项目使用
Other great resources on character encoding:
关于字符编码的其他重要资源:
- tbray.org's Characters vs. Bytes
- IANA character sets
- www.cs.tut.fi's A tutorial on code issues
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)(first mentioned by @Dylan Beattie)
- tbray.org 的字符与字节
- IANA 字符集
- www.cs.tut.fi 的代码问题教程
- 每个软件开发人员绝对必须了解的绝对最小值(没有任何借口!)(@Dylan Beattie 首先提到)
回答by mmalc
Received wisdomsuggests that Spolsky's article misses a couple of important points.
公认的智慧表明斯波尔斯基的文章遗漏了几个要点。
This article is recommended as being more complete: The Unicode? Standard: A Technical Introduction
推荐这篇文章更完整: The Unicode? 标准:技术介绍
This article is also a good introduction: Unicode Basics
这篇文章也是很好的介绍:Unicode Basics
The latter in particular gives an overview of the character encoding forms and schemes for Unicode.
后者特别概述了 Unicode 的字符编码形式和方案。
回答by John Nilsson
The various UTF standards are ways to encode "code points". A codepoint is the index into the Unicode charater set.
各种 UTF 标准是编码“代码点”的方法。代码点是 Unicode 字符集的索引。
Another encoding is UCS2 which is allways 16bit, and thus doesn't support the full Unicode range.
另一种编码是 UCS2,它始终是 16 位,因此不支持完整的 Unicode 范围。
Good to know is also that one codepoint isn't equal to one character. For example a character such as ? can be represented both as a code point or as two code points one for the a and one for the ring.
很高兴知道一个代码点不等于一个字符。例如像 ? 可以表示为一个代码点或两个代码点,一个用于 a,一个用于环。
Comparing two unicode strings thus requires normalization to get the canonical representation before comparison.
因此,比较两个 unicode 字符串需要规范化以在比较之前获得规范表示。
回答by John Nilsson
There is also the issue with fonts. There are two ways to handle fonts. Either you use a gigantic font with glyphs for all the Unicode characters you need (I think recent versions of Windows comes with one or two such fonts). Or you use som library capable of combining glyphs from various fonts dedicated to subsets of the Unicode standard.
字体也有问题。有两种处理字体的方法。您要么使用带有字形的巨大字体来表示您需要的所有 Unicode 字符(我认为最近版本的 Windows 带有一种或两种这样的字体)。或者您使用能够组合来自专用于 Unicode 标准子集的各种字体的字形的 som 库。