string 一个 Unicode 字符需要多少字节?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5290182/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 01:02:07  来源:igfitidea点击:

How many bytes does one Unicode character take?

stringlanguage-agnosticunicodeencoding

提问by nan

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?

我对编码有点困惑。据我所知,旧的 ASCII 字符每个字符占用一个字节。一个 Unicode 字符需要多少字节?

I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character?

我假设一个 Unicode 字符可以包含来自任何语言的所有可能的字符 - 我是否正确?那么每个字符需要多少字节呢?

And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?

UTF-7、UTF-6、UTF-16 等是什么意思?它们是不同版本的 Unicode 吗?

I read the Wikipedia article about Unicodebut it is quite difficult for me. I am looking forward to seeing a simple answer.

我阅读了关于 Unicode维基百科文章,但对我来说这很困难。我期待看到一个简单的答案。

采纳答案by Logan Capaldo

You won't see a simple answer because there isn't one.

你不会看到一个简单的答案,因为没有一个。

First, Unicode doesn't contain "every character from every language", although it sure does try.

首先,Unicode 不包含“来自每种语言的每个字符”,尽管它确实尝试过。

Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usuallya character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an aor a uto create a new logical character. A character therefore can consist of 1 or more codepoints.

Unicode 本身是一个映射,它定义了代码点,而代码点是一个数字,通常与一个字符相关联。我说通常是因为有像组合字符这样的概念。您可能熟悉口音或元音变音等内容。这些可以与另一个字符一起使用,例如 ana或 au以创建新的逻辑字符。因此,一个字符可以由 1 个或多个代码点组成。

To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).

为了在计算系统中有用,我们需要为这些信息选择一个表示。这些是各种 unicode 编码,例如 utf-8、utf-16le、utf-32 等。它们的主要区别在于它们的代码单元的大小。UTF-32 是最简单的编码,它有一个 32 位的代码单元,这意味着单个代码点可以轻松地放入一个代码单元中。其他编码会遇到这样的情况:一个代码点需要多个代码单元,或者该特定代码点根本无法在编码中表示(例如,这是 UCS-2 的问题)。

Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say "an 'a' with an accent"which is 2 codepoints, one of which is a combining char or "accented 'a'"which is one codepoint).

由于组合字符的灵活性,即使在给定的编码中,每个字符的字节数也可能因字符和规范化形式而异。这是一种用于处理具有多个表示的字符的协议(您可以说"an 'a' with an accent"哪个是 2 个代码点,其中一个是组合字符或"accented 'a'"哪个是一个代码点)。

回答by paul.ago

Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:

奇怪的是,没有人指出如何计算一个 Unicode 字符占用了多少字节。以下是 UTF-8 编码字符串的规则:

Binary    Hex          Comments
0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
10xxxxxx  0x80..0xBF   Continuation byte: one of 1-3 bytes following the first
110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
11110xxx  0xF0..0xF7   First byte of a 4-byte character encoding

So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it'll take up.

所以快速回答是:它需要 1 到 4 个字节,具体取决于第一个指示它将占用多少字节的字节。

回答by basic6

I know this question is old and already has an accepted answer, but I want to offer a few examples (hoping it'll be useful to someone).

我知道这个问题很旧并且已经有一个公认的答案,但我想提供一些例子(希望它对某人有用)。

As far as I know old ASCII characters took one byte per character.

据我所知,旧的 ASCII 字符每个字符占用一个字节。

Right. Actually, since ASCII is a 7-bit encoding, it supports 128 codes (95 of which are printable), so it only uses half a byte (if that makes any sense).

对。实际上,由于 ASCII 是 7 位编码,它支持 128 个代码(其中 95 个是可打印的),所以它只使用半字节(如果有任何意义的话)。

How many bytes does a Unicode character require?

一个 Unicode 字符需要多少字节?

Unicode just maps characters to codepoints. It doesn't define how to encode them. A text file does not contain Unicode characters, but bytes/octets that may represent Unicode characters.

Unicode 只是将字符映射到代码点。它没有定义如何编码它们。文本文件不包含 Unicode 字符,但包含可能表示 Unicode 字符的字节/八位字节。

I assume that one Unicode character can contain every possible character from any language - am I correct?

我假设一个 Unicode 字符可以包含来自任何语言的所有可能的字符 - 我是否正确?

No. But almost. So basically yes. But still no.

不,但几乎。所以基本上是的。但还是没有。

So how many bytes does it need per character?

那么每个字符需要多少字节呢?

Same as your 2nd question.

和你的第二个问题一样。

And what do UTF-7, UTF-6, UTF-16 etc mean? Are they some kind Unicode versions?

UTF-7、UTF-6、UTF-16 等是什么意思?它们是某种 Unicode 版本吗?

No, those are encodings. They define how bytes/octets should represent Unicode characters.

不,那些是编码。它们定义字节/八位字节应如何表示 Unicode 字符。

A couple of examples. If some of those cannot be displayed in your browser (probably because the font doesn't support them), go to http://codepoints.net/U+1F6AA(replace 1F6AAwith the codepoint in hex) to see an image.

举几个例子。如果其中一些无法在您的浏览器中显示(可能是因为字体不支持它们),请转到http://codepoints.net/U+1F6AA(用1F6AA十六进制代码点替换)以查看图像。

    • U+0061 LATIN SMALL LETTER A: a
      • No: 97
      • UTF-8: 61
      • UTF-16: 00 61
    • U+00A9 COPYRIGHT SIGN: ?
      • No: 169
      • UTF-8: C2 A9
      • UTF-16: 00 A9
    • U+00AE REGISTERED SIGN: ?
      • No: 174
      • UTF-8: C2 AE
      • UTF-16: 00 AE
    • U+1337 ETHIOPIC SYLLABLE PHWA: ?
      • No: 4919
      • UTF-8: E1 8C B7
      • UTF-16: 13 37
    • U+2014 EM DASH:
      • No: 8212
      • UTF-8: E2 80 94
      • UTF-16: 20 14
    • U+2030 PER MILLE SIGN:
      • No: 8240
      • UTF-8: E2 80 B0
      • UTF-16: 20 30
    • U+20AC EURO SIGN:
      • No: 8364
      • UTF-8: E2 82 AC
      • UTF-16: 20 AC
    • U+2122 TRADE MARK SIGN: ?
      • No: 8482
      • UTF-8: E2 84 A2
      • UTF-16: 21 22
    • U+2603 SNOWMAN: ?
      • No: 9731
      • UTF-8: E2 98 83
      • UTF-16: 26 03
    • U+260E BLACK TELEPHONE: ?
      • No: 9742
      • UTF-8: E2 98 8E
      • UTF-16: 26 0E
    • U+2614 UMBRELLA WITH RAIN DROPS: ?
      • No: 9748
      • UTF-8: E2 98 94
      • UTF-16: 26 14
    • U+263A WHITE SMILING FACE: ?
      • No: 9786
      • UTF-8: E2 98 BA
      • UTF-16: 26 3A
    • U+2691 BLACK FLAG: ?
      • No: 9873
      • UTF-8: E2 9A 91
      • UTF-16: 26 91
    • U+269B ATOM SYMBOL: ?
      • No: 9883
      • UTF-8: E2 9A 9B
      • UTF-16: 26 9B
    • U+2708 AIRPLANE: ?
      • No: 9992
      • UTF-8: E2 9C 88
      • UTF-16: 27 08
    • U+271E SHADOWED WHITE LATIN CROSS: ?
      • No: 10014
      • UTF-8: E2 9C 9E
      • UTF-16: 27 1E
    • U+3020 POSTAL MARK FACE: ?
      • No: 12320
      • UTF-8: E3 80 A0
      • UTF-16: 30 20
    • U+8089 CJK UNIFIED IDEOGRAPH-8089:
      • No: 32905
      • UTF-8: E8 82 89
      • UTF-16: 80 89
    • U+1F4A9 PILE OF POO:
      • No: 128169
      • UTF-8: F0 9F 92 A9
      • UTF-16: D8 3D DC A9
    • U+1F680 ROCKET:
      • No: 128640
      • UTF-8: F0 9F 9A 80
      • UTF-16: D8 3D DE 80
    • U+0061 拉丁文小写字母 A: a
      • 编号:97
      • UTF-8:61
      • UTF-16:00 61
    • U+00A9 版权标志: ?
      • 编号:169
      • UTF-8:C2 A9
      • UTF-16: 00 A9
    • U+00AE 注册标志: ?
      • 编号:174
      • UTF-8:C2 AE
      • UTF-16:00 AE
    • U+1337 埃塞俄比亚语音节 PHWA: ?
      • 编号:4919
      • UTF-8:E1 8C B7
      • UTF-16:13 37
    • U+2014 EM DASH:
      • 编号:8212
      • UTF-8:E2 80 94
      • UTF-16:20 14
    • U+2030 PERMILLE 标志:
      • 编号:8240
      • UTF-8:E2 80 B0
      • UTF-16:20 30
    • U+20AC 欧元标志:
      • 编号:8364
      • UTF-8:E2 82 AC
      • UTF-16:20 AC
    • U+2122 商标标志: ?
      • 编号:8482
      • UTF-8:E2 84 A2
      • UTF-16:21 22
    • U+2603 雪人: ?
      • 编号:9731
      • UTF-8:E2 98 83
      • UTF-16:26 03
    • U+260E 黑色电话: ?
      • 编号:9742
      • UTF-8:E2 98 8E
      • UTF-16: 26 0E
    • U+2614 雨伞: ?
      • 编号:9748
      • UTF-8:E2 98 94
      • UTF-16:26 14
    • U+263A 白色笑脸: ?
      • 编号:9786
      • UTF-8:E2 98 BA
      • UTF-16:26 3A
    • U+2691 黑旗: ?
      • 编号:9873
      • UTF-8:E2 9A 91
      • UTF-16:26 91
    • U+269B 原子符号: ?
      • 编号:9883
      • UTF-8:E2 9A 9B
      • UTF-16:26 9B
    • U+2708 飞机: ?
      • 编号:9992
      • UTF-8:E2 9C 88
      • UTF-16:27 08
    • U+271E 阴影白色拉丁十字: ?
      • 编号:10014
      • UTF-8:E2 9C 9E
      • UTF-16:27 1E
    • U+3020 邮戳面: ?
      • 编号:12320
      • UTF-8:E3 80 A0
      • UTF-16:30 20
    • U+8089 CJK 统一 IDEOGRAPH-8089:
      • 编号:32905
      • UTF-8:E8 82 89
      • UTF-16:80 89
    • U+1F4A9 一堆便便:
      • 编号:128169
      • UTF-8:F0 9F 92 A9
      • UTF-16:D8 3D DC A9
    • U+1F680 火箭:
      • 编号:128640
      • UTF-8:F0 9F 9A 80
      • UTF-16:D8 3D DE 80

Okay I'm getting carried away...

好吧,我被带走了......

Fun facts:

有趣的事实:

回答by Zimbabao

Simply speaking Unicodeis a standard which assigned one number (called code point) to all characters of the world (Its still work in progress).

简单的说Unicode是一种标准,它为世界上的所有字符分配一个数字(称为代码点)(仍在进行中)。

Now you need to represent this code points using bytes, thats called character encoding. UTF-8, UTF-16, UTF-6are ways of representing those characters.

现在您需要使用字节来表示这些代码点,这称为character encoding. UTF-8, UTF-16, UTF-6是表示这些字符的方式。

UTF-8is multibyte character encoding. Characters can have 1 to 6 bytes (some of them may be not required right now).

UTF-8是多字节字符编码。字符可以有 1 到 6 个字节(其中一些现在可能不需要)。

UTF-32each characters have 4 bytes a characters.

UTF-32每个字符有 4 个字节一个字符。

UTF-16uses 16 bits for each character and it represents only part of Unicode characters called BMP (for all practical purposes its enough). Java uses this encoding in its strings.

UTF-16每个字符使用 16 位,它只代表称为 BMP 的 Unicode 字符的一部分(对于所有实际目的来说就足够了)。Java 在其字符串中使用这种编码。

回答by John

In UTF-8:

在 UTF-8 中:

1 byte:       0 -     7F     (ASCII)
2 bytes:     80 -    7FF     (all European plus some Middle Eastern)
3 bytes:    800 -   FFFF     (multilingual plane incl. the top 1792 and private-use)
4 bytes:  10000 - 10FFFF

In UTF-16:

在 UTF-16 中:

2 bytes:      0 -   D7FF     (multilingual plane except the top 1792 and private-use )
4 bytes:   D800 - 10FFFF

In UTF-32:

在 UTF-32 中:

4 bytes:      0 - 10FFFF

10FFFF is the last unicode codepoint by definition, and it's defined that way because it's UTF-16's technical limit.

根据定义,10FFFF 是最后一个 unicode 代码点,之所以这样定义,是因为它是 UTF-16 的技术限制。

It is also the largest codepoint UTF-8 can encode in 4 byte, but the idea behind UTF-8's encoding also works for 5 and 6 byte encodings to cover codepoints until 7FFFFFFF, ie. half of what UTF-32 can.

它也是 UTF-8 可以编码为 4 字节的最大代码点,但 UTF-8 编码背后的想法也适用于 5 和 6 字节编码以覆盖直到 7FFFFFFF 的代码点,即。UTF-32 可以的一半。

回答by 0xC0000022L

In Unicode the answer is not easily given. The problem, as you already pointed out, are the encodings.

在 Unicode 中,答案并不容易给出。正如您已经指出的,问题在于编码。

Given any English sentence without diacritic characters, the answer for UTF-8 would be as many bytes as characters and for UTF-16 it would be number of characters times two.

给定任何没有变音符号的英语句子,UTF-8 的答案将是与字符一样多的字节数,而对于 UTF-16,它将是字符数乘以 2。

The only encoding where (as of now) we can make the statement about the size is UTF-32. There it's always 32bit per character, even though I imagine that code points are prepared for a future UTF-64 :)

(截至目前)我们可以对大小进行声明的唯一编码是 UTF-32。每个字符总是 32 位,即使我认为代码点是为未来的 UTF-64 准备的:)

What makes it so difficult are at least two things:

使它如此困难的原因至少有两点:

  1. composed characters, where instead of using the character entity that is already accented/diacritic (à), a user decided to combine the accent and the base character (`A).
  2. code points. Code points are the method by which the UTF-encodings allow to encode more than the number of bits that gives them their name would usually allow. E.g. UTF-8 designates certain bytes which on their own are invalid, but when followed by a valid continuation byte will allow to describe a character beyond the 8-bit range of 0..255. See the Examplesand Overlong Encodings below in the Wikipedia article on UTF-8.
    • The excellent example given there is that the character (code point U+20ACcan be represented either as three-bytesequence E2 82 ACor four-bytesequence F0 82 82 AC.
    • Both are valid, and this shows how complicated the answer is when talking about "Unicode" and not about a specific encoding of Unicode, such as UTF-8 or UTF-16.
  1. 组合字符,而不是使用已经重音/变音符号 (à) 的字符实体,用户决定组合重音和基本字符 (`A)。
  2. 代码点。代码点是 UTF 编码允许编码比赋予它们的名称通常允许的位数更多的方法。例如,UTF-8 指定了某些本身无效的字节,但是当后面跟着一个有效的连续字节时,将允许描述超出 0..255 的 8 位范围的字符。请参阅下面有关 UTF-8 的维基百科文章中的示例和超长编码。
    • 那里给出的一个很好的例子是字符 (code pointU+20AC可以表示为三字节序列E2 82 AC四字节序列F0 82 82 AC
    • 两者都是有效的,这表明在谈论“Unicode”而不是 Unicode 的特定编码(例如 UTF-8 或 UTF-16)时,答案是多么复杂。

回答by Nic Cottrell

There is a great tool for calculating the bytes of any string in UTF-8: http://mothereff.in/byte-counter

有一个很好的工具可以计算 UTF-8 中任何字符串的字节数:http: //mothereff.in/byte-counter

Update: @mathias has made the code public: https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js

更新:@mathias 已将代码公开:https: //github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js

回答by Loduwijk

Well I just pulled up the Wikipedia page on it too, and in the intro portion I saw "Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters), the now-obsolete UCS-2 (which uses two bytes for each character but cannot encode every character in the current Unicode standard)"

好吧,我也刚刚打开了维基百科页面,在介绍部分我看到“Unicode 可以通过不同的字符编码实现。最常用的编码是 UTF-8(它对任何 ASCII 字符使用一个字节,它有UTF-8 和 ASCII 编码中的相同代码值,其他字符最多四个字节),现在已经过时的 UCS-2(每个字符使用两个字节,但不能对当前 Unicode 标准中的每个字符进行编码)”

As this quote demonstrates, your problem is that you are assuming Unicode is a single way of encoding characters. There are actually multiple forms of Unicode, and, again in that quote, one of them even has 1 byte per character just like what you are used to.

正如这句话所表明的,您的问题是您假设 Unicode 是一种编码字符的单一方式。实际上有多种形式的 Unicode,同样在引用中,其中一种甚至每个字符都有 1 个字节,就像你习惯的那样。

So your simple answer that you want is that it varies.

所以你想要的简单答案是它会有所不同。

回答by prewett

For UTF-16, the character needs four bytes (two code units) if it starts with 0xD800 or greater; such a character is called a "surrogate pair." More specifically, a surrogate pair has the form:

对于 UTF-16,如果字符以 0xD800 或更大的开头,则需要四个字节(两个代码单元);这样的字符称为“代理对”。更具体地说,代理对具有以下形式:

[0xD800 - 0xDBFF]  [0xDC00 - 0xDFF]

where [...] indicates a two-byte code unit with the given range. Anything <= 0xD7FF is one code unit (two bytes). Anything >= 0xE000 is invalid (except BOM markers, arguably).

其中 [...] 表示具有给定范围的两字节代码单元。任何 <= 0xD7FF 都是一个代码单元(两个字节)。任何 >= 0xE000 都是无效的(除了 BOM 标记,可以说)。

See http://unicodebook.readthedocs.io/unicode_encodings.html, section 7.5.

请参阅http://unicodebook.readthedocs.io/unicode_encodings.html,第 7.5 节。

回答by ma11hew28

Check out this Unicode code converter. For example, enter 0x2009, where 2009 is the Unicode number for thin space, in the "0x... notation" field, and click Convert. The hexadecimal number E2 80 89(3 bytes) appears in the "UTF-8 code units" field.

看看这个Unicode 代码转换器。例如,输入0x2009,其中2009年是Unicode号狭窄的空间,在“0X ...符号”栏,然后点击转换。十六进制数E2 80 89(3 个字节)出现在“UTF-8 代码单元”字段中。