string 字符、代码点、字形和字素之间有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27331819/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 16:20:03  来源:igfitidea点击:

What's the difference between a character, a code point, a glyph and a grapheme?

stringunicodeterminology

提问by Mark Amery

Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplestcase, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other - is causing me trouble.

试图理解现代 Unicode 的微妙之处让我很头疼。尤其是代码点、字符、字形和字素之间的区别——在最简单的情况下,当使用 ASCII 字符处理英文文本时,这些概念之间都具有一对一的关系——给我带来了麻烦。

Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problemor Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.

看到这些术语在像 Matthias Bynens 的JavaScript 有 unicode 问题或维基百科关于汉族统一的文章中是如何使用的,我发现这些概念不是一回事,将它们混为一谈是危险的,但我有点努力理解每个术语的含义

The Unicode Consortium offers a glossaryto explain this stuff, but it's full of "definitions" like this:

Unicode Consortium 提供了一个词汇表来解释这些东西,但它充满了这样的“定义”:

Abstract Character. A unit of information used for the organization, control, or representation of textual data. ...

...

Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ...

...

Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.

...

Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...

抽象字符。用于组织、控制或表示文本数据的信息单元。...

...

性格。... (2) 抽象字符的同义词。(3) Unicode 字符编码的基本编码单位。...

...

字形。(1) 代表一个或多个字形图像的抽象形式。(2) 字形图像的同义词。在显示 Unicode 字符数据时,可以选择一个或多个字形来描绘特定字符。

...

字素。(1) 在特定书写系统的上下文中的最小独特的书写单位。...

Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.

这些定义中的大多数具有听起来非常学术和正式的品质,但缺乏任何意义的品质,或者将定义问题推迟到另一个词汇表条目或标准部分。

So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?

所以我寻求比我更有学问的人的奥术智慧。这些概念中的每一个究竟有什么不同,在什么情况下它们之间不会有一对一的关系?

回答by Kerrek SB

  • Characteris an overloaded term than can mean many things.

  • A code pointis the atomic unit of information. Textis a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.

  • A code unitis the unit of storage of a partof an encoded code point. In UTF-8 this means 8-bits, in UTF-16 this means 16-bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph (?) is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.

  • A graphemeis a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both aand ?are graphemes, but they may consist of multiple code points (e.g. ?may be two code points, one for the base character afollowed by one for the diaresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).

  • A glyphis an image, usually stored in a font(which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ?is a single code point, a font may chose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.

  • 性格是一个超载的术语,它可以意味着很多东西。

  • 码点是信息的原子单位。文本是一系列代码点。每个代码点都是一个数字,由 Unicode 标准赋予其含义。

  • 代码单元是一个存储的单元部分的编码码点。在 UTF-8 中这意味着 8 位,在 UTF-16 中这意味着 16 位。单个代码单元可以表示完整的代码点或代码点的一部分。例如,雪人字形 ( ?) 是单个代码点,但有 3 个 UTF-8 代码单元和 1 个 UTF-16 代码单元。

  • 字形是被显示为一个阅读器识别为书写系统的一个单一的元素的单一的图形单元中的一个或多个码点的序列。例如,a?都是字素,但它们可能由多个代码点组成(例如?可能是两个代码点,一个用于基本字符,a然后一个用于日记;但还有一个替代的、遗留的、单一的代码点代表这个字素)。某些代码点从不属于任何字素(例如,零宽度非连接器或方向覆盖)。

  • 字形是一个图像,通常是存储在字体(这是字形的集合),用于表示字形或其部分。字体可以将多个字形组合成一个单一的表示,例如,如果上面?是一个单一的代码点,字体可能会选择将其呈现为两个独立的、空间重叠的字形。对于 OTF,字体的 GSUB 和 GPOS 表包含替换和定位信息来完成这项工作。一个字体也可能包含同一个字素的多个替代字形。

回答by Poor Yorick

Outside the Unicode standard a characteris an individual unit of textcomposed of one or more graphemes. What the Unicode standard defines as "characters" is actually a mix of graphemes and characters. Unicode provides rules for the interpretation of juxtaposed graphemes as individual characters.

Unicode标准外的字符是一个文本的个别单元的一种或多种组成的字形。Unicode 标准定义的“字符”实际上是字素和字符的混合。Unicode 提供了将并列字素解释为单个字符的规则。

A Unicodecode pointis a unique number assigned to each Unicode character(which is either a character or a grapheme).

Unicode的码点是分配给每个唯一的编号Unicode字符(其可以是一个字符或字形)。

Unfortunately, the Unicode rules allow some juxtaposed graphemes to be interpreted as other graphemes that already have their own code points (precomposed forms). This means that there is more than one way in Unicode to represent a character. Unicode normalizationaddresses this issue.

不幸的是,Unicode 规则允许将一些并列的字素解释为已经拥有自己的代码点(预组合形式)的其他字素。这意味着在 Unicode 中有不止一种方法来表示一个字符。Unicode 规范化解决了这个问题。

A glyph is the visual representation of a character. A font provides a set of glyphs for a certain set of characters (not Unicode characters). For every character, there is an infinite number of possible glyphs.

字形是字符的视觉表示。字体为特定字符集(不是 Unicode 字符)提供了一组字形。对于每个字符,都有无数可能的字形。

A Reply to Mark Amery

对马克·艾默里的回复

First, as I stated, there is an infinite number of possible glyphs for each character so no, a character is not "always represented by a single glyph". Unicode doesn't concern itself much with glyphs, and the things it defines in its code charts are certainly not glyphs. The problem is that neither are they all characters. So what are they?

首先,正如我所说,每个字符都有无数可能的字形,所以不,一个字符并不“总是由一个字形表示”。Unicode 不太关心字形,它在其代码图中定义的东西肯定不是字形。问题是他们也不是所有的角色。那么它们是什么?

Which is the greater entity, the grapheme or the character? What does one call those graphic elements in text that are not letters or punctuation? One term that springs quickly to mind is "grapheme". It's a word that precisely conjure up the idea of "a graphical unit in a text". I offer this definition: A grapheme is the smallest distinct component in a written text.

哪个是更大的实体,字素或字符?文本中那些不是字母或标点符号的图形元素称为什么?一个快速浮现在脑海中的术语是“字素”。这个词准确地唤起了“文本中的图形单元”的概念。我提供了这个定义:字素是书面文本中最小的不同成分

One could go the other way and say that graphemes are composed of characters, but then they would be called "Chinese graphemes", and all those bits and pieces Chinese graphemes are composed of would have to be called "characters" instead. However, that's all backwards. Graphemes are the distinct little bits and pieces. Characters are more developed. The phrase "glyphs are composable", would be better stated in the context of Unicode as "characters are composable".

可以反其道而行之,说字素是由字符组成的,但那样的话就叫“汉字字素”,而把汉字字素组成的点点滴滴都改成“字”。然而,这一切都倒退了。字素是独特的点点滴滴。人物比较发达。短语“字形是可组合的”在 Unicode 的上下文中会更好地表述为“字符是可组合的”。

Unicode defines characters but it also defines graphemes that are to be composed with other graphemes or characters. Those monstrosities you composed are a fine example of this. If they catch on maybe they'll get their own code points in a later version of Unicode ;)

Unicode 定义了字符,但它也定义了与其他字素或字符组合的字素。你创作的那些怪物就是一个很好的例子。如果他们流行起来,也许他们会在更高版本的 Unicode 中获得自己的代码点;)

There's a recursive element to all this. At higher levels, graphemes become characters become graphemes, but it's graphemes all the way down.

所有这些都有一个递归元素。在更高的层次上,字素变成字符变成字素,但它一直是字素。

A Reply to T S

对 TS 的回复

Chapter 1of the standard states: "The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility". Given this statement, we should be prepared for some conflation of terms in the standard. Sometimes the proper terminology only becomes clear in retrospect as a standard develops.

该标准的第 1 章指出:“Unicode 字符编码等效地对待字母字符、表意字符和符号,这意味着它们可以以任何混合方式使用并且具有相同的便利性”。鉴于此声明,我们应该为标准中的术语混淆做好准备。有时,随着标准的发展,正确的术语只有在回顾时才会变得清晰。

It often happens in formal definitions of a language that two fundamental things are defined in terms of each other. For example, in XMLan element is defined as a starting tag possibly followed by content, followed by an ending tag. Content is defined in turn as either an element, character data, or a few other possible things. A pattern of self-referential definitions is also implicit in the Unicode standard:

在语言的正式定义中,经常会出现两个基本事物相互定义的情况。例如,在 XML 中,一个元素被定义为一个开始标签,可能后面跟着内容,后面跟着一个结束标签。内容又被定义为元素、字符数据或其他一些可能的东西。Unicode 标准中还隐含了一种自引用定义模式:

A grapheme is a code point or a character.

A character is composed from a sequence of one or more graphemes.

字素是代码点或字符。

一个字符由一个或多个字素的序列组成。

When first confronted with these two definitions the reader might object to the first definition on the grounds that a code point isa character, but that's not always true. A sequence of two code points sometimes encodes a single code point under normalization, and that encoded code point represents the character, as illustrated in figure 2.7. Sequences of code points that encode other code points. This is getting a little tricky and we haven't even reached the layer where where character encoding schemes such as UTF-8are used to encode code points into byte sequences.

当第一次遇到这两个定义时,读者可能会反对第一个定义,理由是代码点一个字符,但这并不总是正确的。两个代码点的序列有时会在规范化下对单个代码点进行 编码,该编码的代码点代表字符, 如图 2.7 所示。编码其他代码点的代码点序列。这有点棘手,我们甚至还没有达到使用字符编码方案(如UTF-8)将代码点编码为字节序列的层。

In some contexts, for example a scholarly article on diacritics, and individual part of a character might show up in the text by itself. In that context, the individual character part could be considered a character, so it makes sense that the Unicode standard remain flexible as well.

在某些情况下,例如一篇关于变音符号的学术文章 ,一个字符的个别部分可能会单独出现在文本中。在这种情况下,单个字符部分可以被视为一个字符,因此 Unicode 标准也保持灵活是有道理的。

As Mark Avery pointed out, a character can be composed into a more complex thing. That is, each character can can serve as a grapheme if desired. The final result of all composition is a thing that "the user thinks of as a character". There doesn't seem to be any real resistance, either in the standard or in this discussion, to the idea that at the highest level there are these things in the text that the user thinks of as individual characters. To avoid overloading that term, we can use "grapheme" in all cases where we want to refer to parts used to compose a character.

正如 Mark Avery 所指出的,一个角色可以组合成一个更复杂的东西。也就是说,如果需要,每个字符都可以用作字素。所有构图的最终结果是一个“用户认为是一个角色”的东西。无论是在标准中还是在本次讨论中,似乎都没有任何真正的反对意见,即在最高级别的文本中存在用户认为是单个字符的这些东西。为了避免重载该术语,我们可以在所有要指代用于组成字符的部分的情况下使用“字素”。

At times the Unicode standard is all over the place with its terminology. For example, Chapter 3defines UTF-8 as an "encoding form" whereas the glossary defines "encoding form" as something else, and UTF-8 as a "Character Encoding Scheme". Another example is "Grapheme_Base" and "Grapheme_Extend", which are acknowledgedto be mistakes but that persist because purging them is a bit of a task. There is still work to be done to tighten up the terminology employed by the standard.

有时,Unicode 标准随处可见。例如,第 3 章将 UTF-8 定义为“编码形式”,而术语表将“编码形式”定义为其他内容,将 UTF-8 定义为“字符编码方案”。另一个例子是“Grapheme_Base”和“Grapheme_Extend”,它们被认为是错误但仍然存在,因为清除它们是一项任务。为了收紧标准所使用的术语,仍有工作要做。

The Proposal for addition of COMBINING GRAPHEME JOINERgot it wrong when it stated that "Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters." It should instead read, "A sequence of one or more graphemes composes what the user thinks of as a character." Then it could use the term "grapheme sequence" distinctly from the term "character sequence". Both terms are useful. "grapheme sequence" neatly implies the process of building up a character from smaller pieces. "character sequence" means what we all typically intuit it to mean: "A sequence of things the user thinks of as characters."

添加 COMBINING GRAPHEME JOINER提案错误地指出“Graphemes 是一个或多个编码字符的序列,与用户认为的字符相对应”。它应该改为,“一个或多个字素的序列组成了用户认为的字符。” 然后它可以使用与术语“字符序列”不同的术语“字素序列”。这两个术语都很有用。“字素序列”巧妙地暗示了从较小的部分构建字符的过程。“字符序列”的意思是我们通常直觉上的意思:“用户认为是字符的一系列事物。”

Sometimes a programmer really does want to operate at the level of grapheme sequences, so mechanisms to inspect and manipulate those sequences should be available, but generally, when processing text, it is sufficient to operate on "character sequences" (what the user thinks of as a character) and let the system manage the lower-level details.

有时程序员确实想在字素序列层面进行操作,因此应该有检查和操作这些序列的机制,但一般来说,在处理文本时,对“字符序列”(用户认为的)进行操作就足够了作为角色)并让系统管理较低级别的细节。

In every case covered so far in this discussion, it's cleaner to use "grapheme" to refer to the indivisible components and "character" to refer to the composed entity. This usage also better reflects the long-established meanings of both terms.

到目前为止,在本讨论中涉及的每种情况下,使用“字素”来指代不可分割的组件,使用“字符”来指代组合实体会更简洁。这种用法也更好地反映了这两个术语的长期含义。