Python str 与 unicode 类型
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18034272/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python str vs unicode types
提问by Caumons
Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode
instead of str
, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode
strings using the escape char \
?:
使用 Python 2.7,我想知道使用 typeunicode
而不是有什么真正的优势str
,因为它们似乎都能够保存 Unicode 字符串。除了能够unicode
使用转义字符在字符串中设置 Unicode 代码之外,还有什么特殊原因\
吗?:
Executing a module with:
执行一个模块:
# -*- coding: utf-8 -*-
a = 'á'
ua = u'á'
print a, ua
Results in: á, á
结果在:á,á
EDIT:
编辑:
More testing using Python shell:
使用 Python shell 进行更多测试:
>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'
So, the unicode
string seems to be encoded using latin1
instead of utf-8
and the raw string is encoded using utf-8
? I'm even more confused now! :S
因此,unicode
字符串似乎是使用latin1
而不是编码的utf-8
,而原始字符串是使用utf-8
? 我现在更糊涂了!:S
采纳答案by Bakuriu
unicode
is meant to handle text. Text is a sequence of code pointswhich may be bigger than a single byte. Text can be encodedin a specific encoding to represent the text as raw bytes(e.g. utf-8
, latin-1
...).
unicode
是为了处理text。文本是序列码点,其可以比一个字节大。可以用特定的编码对文本进行编码,以将文本表示为原始字节(例如utf-8
,latin-1
...)。
Note that unicode
is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.
请注意,unicode
未编码!python 使用的内部表示是一个实现细节,只要它能够表示你想要的代码点,你就不必关心它。
On the contrary str
in Python 2 is a plain sequence of bytes. It does not represent text!
相反,str
在 Python 2 中是一个简单的字节序列。不代表文字!
You can think of unicode
as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str
.
您可以将其unicode
视为某些文本的一般表示,可以以多种不同方式将其编码为通过str
.
Note: In Python 3, unicode
was renamed to str
and there is a new bytes
type for a plain sequence of bytes.
注意:在 Python 3 中,unicode
已重命名为,str
并且有一个bytes
用于纯字节序列的新类型。
Some differences that you can see:
您可以看到的一些差异:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
?
Note that using str
you have a lower-level control on the single bytes of a specific encoding representation, while using unicode
you can only control at the code-point level. For example you can do:
请注意,使用str
您可以对特定编码表示的单个字节进行较低级别的控制,而使用unicode
您只能在代码点级别进行控制。例如你可以这样做:
>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à?ìòù
What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
以前有效的 UTF-8 不再是了。使用 unicode 字符串时,您不能以生成的字符串不是有效的 unicode 文本的方式进行操作。您可以删除代码点,用不同的代码点替换代码点等,但您不能弄乱内部表示。
回答by Martijn Pieters
Your terminal happens to be configured to UTF-8.
您的终端恰好配置为 UTF-8。
The fact that printing a
works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a
is a value of length two, containing two bytes, hex values C3 and A1, while ua
is a unicode value of length one, containing a codepoint U+00E1.
印刷a
作品是一个巧合;您正在向终端写入原始 UTF-8 字节。a
是长度的值2,含有两个字节,十六进制值C3和A1,而ua
是长度的Unicode值一个,包含一个编码点U + 00E1。
This difference in length is one major reason to use Unicode values; you cannot easily measure the number of textcharacters in a byte string; the len()
of a byte string tells you how many bytes were used, not how many characters were encoded.
这种长度差异是使用 Unicode 值的一个主要原因;您无法轻松测量字节字符串中的文本字符数;该len()
字节串的告诉你有多少字节被使用,许多字符没有编码方式。
You can see the difference when you encodethe unicode value to different output encodings:
你可以看到差异,当你编码的Unicode值到不同的输出编码:
>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'
Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.
请注意,Unicode 标准的前 256 个代码点与拉丁文 1 标准匹配,因此 U+00E1 代码点被编码为拉丁文 1,作为十六进制值为 E1 的字节。
Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x..
escape values as well. This is why a Unicode string with a code point between 128 and 255 looks justlike the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u....
is used instead, with a four-digit hex value.
此外,Python 在 unicode 和字节字符串的表示中使用转义码,并且不可打印的 ASCII 低代码点也使用\x..
转义值表示。这就是为什么一个Unicode字符串,128米255的外观之间的代码点只是喜欢拉丁1个编码。如果您有代码点超过 U+00FF 的 unicode 字符串,\u....
则使用不同的转义序列,并使用四位十六进制值。
It looks like you don't yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:
看起来您还没有完全理解 Unicode 和编码之间的区别。在继续之前,请务必阅读以下文章:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)by Joel Spolsky
Pragmatic Unicodeby Ned Batchelder
每个软件开发人员绝对、肯定地必须了解 Unicode 和字符集的绝对最低要求(没有任何借口!)作者:Joel Spolsky
内德巴切尔德的实用 Unicode
回答by Ali Rasim Kocal
When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua
当您将 a 定义为 unicode 时,字符 a 和 á 是相等的。否则 á 算作两个字符。试试 len(a) 和 len(au)。除此之外,当您在其他环境中工作时,您可能需要进行编码。例如,如果您使用 md5,您将获得不同的 a 和 ua 值
回答by weibeld
Unicode and encodings are completely different, unrelated things.
Unicode 和编码是完全不同的、不相关的东西。
Unicode
统一码
Assigns a numeric ID to each character:
为每个字符分配一个数字 ID:
- 0x41 → A
- 0xE1 → á
- 0x414 → Д
- 0x41 → A
- 0xE1 → á
- 0x414 → Д
So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.
因此,Unicode 将数字 0x41 分配给 A,将 0xE1 分配给 á,将 0x414 分配给 Д。
Even the little arrow → I used has its Unicode number, it's 0x2192. And even emojis have their Unicode numbers, is 0x1F602.
即使是小箭头 → 我使用的也有它的 Unicode 编号,它是 0x2192。甚至表情符号也有它们的 Unicode 编号,是 0x1F602。
You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.
您可以在此表中查找所有字符的 Unicode 编号。特别是,你可以找到上面的前三个字符这里,箭头在这里,和表情符号在这里。
These numbers assigned to all characters by Unicode are called code points.
这些由 Unicode 分配给所有字符的数字称为代码点。
The purpose of all thisis to provide a means to unambiguously refer to a each character. For example, if I'm talking about , instead of saying "you know, this laughing emoji with tears", I can just say, Unicode code point 0x1F602. Easier, right?
所有这些的目的是提供一种方式来明确地引用每个字符。例如,如果我在谈论 ,而不是说“你知道,这个笑着流泪的表情符号”,我只能说,Unicode 代码点 0x1F602。更容易,对吧?
Note that Unicode code points are usually formatted with a leading U+
, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.
请注意,Unicode 代码点通常使用前导格式化U+
,然后将十六进制数值填充为至少 4 位数字。因此,上面的示例将是 U+0041、U+00E1、U+0414、U+2192、U+1F602。
Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).
Unicode 码位范围从 U+0000 到 U+10FFFF。那是 1,114,112 个数字。这些数字中有 2048 个用于代理,因此,还有 1,112,064 个。这意味着,Unicode 可以为 1,112,064 个不同的字符分配一个唯一的 ID(代码点)。并非所有这些代码点都分配给一个字符,并且 Unicode 不断扩展(例如,当引入新的表情符号时)。
The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.
需要记住的重要一点是,Unicode 所做的一切就是为每个字符分配一个数字 ID,称为代码点,以便于轻松明确地引用。
Encodings
编码
Map characters to bit patterns.
将字符映射到位模式。
These bit patterns are used to represent the characters in computer memory or on disk.
这些位模式用于表示计算机内存或磁盘中的字符。
There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:
有许多不同的编码涵盖不同的字符子集。在英语世界中,最常见的编码如下:
ASCII
ASCII码
Maps 128 characters(code points U+0000 to U+007F) to bit patterns of length 7.
将128 个字符(代码点 U+0000 到 U+007F)映射到长度为 7 的位模式。
Example:
例子:
- a → 1100001 (0x61)
- a → 1100001 (0x61)
You can see all the mappings in this table.
您可以在此表中看到所有映射。
ISO 8859-1 (aka Latin-1)
ISO 8859-1(又名拉丁语 1)
Maps 191 characters(code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.
将191 个字符(代码点 U+0020 到 U+007E 和 U+00A0 到 U+00FF)映射到长度为 8 的位模式。
Example:
例子:
- a → 01100001 (0x61)
- á → 11100001 (0xE1)
- a → 01100001 (0x61)
- á → 11100001 (0xE1)
You can see all the mappings in this table.
您可以在此表中看到所有映射。
UTF-8
UTF-8
Maps 1,112,064 characters(all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).
将1,112,064 个字符(所有现有的 Unicode 代码点)映射到长度为 8、16、24 或 32 位(即 1、2、3 或 4 字节)的位模式。
Example:
例子:
- a → 01100001 (0x61)
- á → 11000011 10100001 (0xC3 0xA1)
- ≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
- → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)
- a → 01100001 (0x61)
- á → 11000011 10100001 (0xC3 0xA1)
- ≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
- → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)
The way UTF-8 encodes characters to bit strings is very well described here.
UTF-8 将字符编码为位串的方式在这里得到了很好的描述。
Unicode and Encodings
Unicode 和编码
Looking at the above examples, it becomes clear how Unicode is useful.
看看上面的例子,Unicode 的用处就一目了然了。
For example, if I'm Latin-1and I want to explain my encoding of á, I don't need to say:
例如,如果我是Latin-1并且我想解释我的 á 编码,我不需要说:
"I encode that a with an aigu (or however you call that rising bar) as 11100001"
“我用 aigu(或者你称之为上升条)将 a 编码为 11100001”
But I can just say:
但我只能说:
"I encode U+00E1 as 11100001"
“我将 U+00E1 编码为 11100001”
And if I'm UTF-8, I can say:
如果我是UTF-8,我可以说:
"Me, in turn, I encode U+00E1 as 11000011 10100001"
“反过来,我将 U+00E1 编码为 11000011 10100001”
And it's unambiguously clear to everybody which character we mean.
每个人都清楚地知道我们指的是哪个角色。
Now to the often arising confusion
现在到经常出现的困惑
It's true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.
确实,有时编码的位模式(如果将其解释为二进制数)与该字符的 Unicode 代码点相同。
For example:
例如:
- ASCII encodes aas 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of ais U+0061.
- Latin-1 encodes áas 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of áis U+00E1.
- ASCII编码一个为1100001,您可以解释为十六进制数0x61,和的Unicode代码点一个是U + 0061。
- Latin-1 将á编码为 11100001,您可以将其解释为十六进制数0xE1,á的 Unicode 代码点是U+00E1。
Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.
当然,这是为了方便而特意安排的。但您应该将其视为纯属巧合。用于表示内存中字符的位模式与该字符的 Unicode 代码点没有任何关联。
Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.
甚至没有人说你必须将像 11100001 这样的位串解释为二进制数。只需将其视为 Latin-1 用于对字符á进行编码的位序列。
Back to your question
回到你的问题
The encoding used by your Python interpreter is UTF-8.
您的 Python 解释器使用的编码是UTF-8。
Here's what's going on in your examples:
以下是您的示例中发生的情况:
Example 1
示例 1
The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a
.
下面将字符 á 编码为 UTF-8。这导致位串 11000011 10100001,它保存在变量 中a
。
>>> a = 'á'
When you look at the value of a
, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1'
:
当您查看 的值时a
,其内容 11000011 10100001 被格式化为十六进制数 0xC3 0xA1 并输出为'\xc3\xa1'
:
>>> a
'\xc3\xa1'
Example 2
示例 2
The following saves the Unicode code point of á, which is U+00E1, in the variable ua
(we don't know which data format Python uses internally to represent the code point U+00E1 in memory, and it's unimportant to us):
下面将á的Unicode码位U+00E1保存在变量中ua
(我们不知道Python内部使用哪种数据格式来表示内存中的码位U+00E1,对我们来说并不重要):
>>> ua = u'á'
When you look at the value of ua
, Python tells you that it contains the code point U+00E1:
当您查看 的值时ua
,Python 会告诉您它包含代码点 U+00E1:
>>> ua
u'\xe1'
Example 3
示例 3
The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:
下面用 UTF-8 对 Unicode 代码点 U+00E1(代表字符 á)进行编码,得到位模式 11000011 10100001。同样,对于输出,此位模式表示为十六进制数 0xC3 0xA1:
>>> ua.encode('utf-8')
'\xc3\xa1'
Example 4
示例 4
The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidenceis the same as the initial code point U+00E1:
下面用Latin-1对Unicode代码点U+00E1(代表字符á)进行编码,得到位模式11100001。对于输出,这个位模式表示为十六进制数0xE1,巧合的是与初始值相同代码点 U+00E1:
>>> ua.encode('latin1')
'\xe1'
There's no relation between the Unicode object ua
and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.
Unicode 对象ua
和 Latin-1 编码之间没有关系。á 的代码点是 U+00E1 而 á 的 Latin-1 编码是 0xE1(如果您将编码的位模式解释为二进制数)纯属巧合。