Python 什么是unicode字符串？

Question

提问by Stevanus Iskandar

What exactly is a unicode string?

unicode 字符串究竟是什么？

What's the difference between a regular string and unicode string?

常规字符串和 unicode 字符串有什么区别？

What is utf-8?

什么是utf-8？

I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?

我现在正在尝试学习 Python，而且我一直听到这个流行语。下面的代码有什么作用？

i18n Strings (Unicode)

i18n 字符串 (Unicode)

> ustring = u'A unicode \u018e string \xf1'
> ustring
u'A unicode \u018e string \xf1'

## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode \xc6\x8e string \xc3\xb1'  ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8')             ## Convert bytes back to a unicode string
> t == ustring                      ## It's the same as the original, yay!
True

Files Unicode

文件 Unicode

import codecs

f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string

Answer 1

采纳答案by tom

This answer is about Python 2. In Python 3, stris a Unicode string.

这个答案是关于 Python 2 的。在 Python 3 中，str是一个 Unicode 字符串。

Python's strtype is a collection of 8-bit characters. The English alphabet can be represented using these 8-bit characters, but symbols such as ±, ?, Ω and ? cannot.

Python 的str类型是 8 位字符的集合。英文字母表可以使用这些 8 位字符来表示，但符号如 ±、?、Ω 和 ? 不能。

Unicodeis a standard for working with a wide range of characters. Each symbol has a codepoint (a number), and these codepoints can be encoded (converted to a sequence of bytes) using a variety of encodings.

Unicode是一种用于处理各种字符的标准。每个符号都有一个代码点（一个数字），这些代码点可以使用各种编码进行编码（转换为字节序列）。

UTF-8is one such encoding. The low codepoints are encoded using a single byte, and higher codepoints are encoded as sequences of bytes.

UTF-8就是这样一种编码。低码点使用单个字节编码，高码点编码为字节序列。

Python's unicodetype is a collection of codepoints. The line ustring = u'A unicode \u018e string \xf1'creates a Unicode string with 20 characters.

Python 的unicode类型是代码点的集合。该行ustring = u'A unicode \u018e string \xf1'创建一个包含 20 个字符的 Unicode 字符串。

When the Python interpreter displays the value of ustring, it escapes two of the characters (? and ?) because they are not in the standard printable range.

当 Python 解释器显示的值时ustring，它会转义两个字符（? 和 ?），因为它们不在标准的可打印范围内。

The line s = unistring.encode('utf-8')encodes the Unicode string using UTF-8. This converts each codepoint to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of sis 22 bytes, because two of the characters have high codepoints and are encoded as a sequence of two bytes rather than a single byte.

该行s = unistring.encode('utf-8')使用 UTF-8 对 Unicode 字符串进行编码。这会将每个代码点转换为适当的字节或字节序列。结果是一个字节集合，以str. 的大小s为 22 字节，因为其中两个字符具有高代码点并且被编码为两个字节的序列而不是单个字节。

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (\xc6, \x8e, \xc3, and \xb1). The two pairs of bytes are not treated as single characters like before because sis of type str, not unicode.

当Python解释显示值的s，它逸出不在可打印范围四个字节（\xc6，\x8e，\xc3，和\xb1）。这两对字节不像以前那样被视为单个字符，因为s是类型str，不是unicode。

The line t = unicode(s, 'utf-8')does the opposite of encode(). It reconstructs the original codepoints by looking at the bytes of sand parsing byte sequences. The result is a Unicode string.

该行t = unicode(s, 'utf-8')与相反encode()。它通过查看s字节序列和解析字节序列来重建原始代码点。结果是一个 Unicode 字符串。

The call to codecs.open()specifies utf-8as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.

调用codecs.open()指定utf-8为编码，它告诉 Python 将文件（字节集合）的内容解释为使用 UTF-8 编码的 Unicode 字符串。

Answer 2

回答by Renjith Nair

Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of "pointers". The unicode is an in-memory representation of the sequence and every symbol on it is not a char but a number (in hex format) intended to select a char in a map. So a unicode var does not have encoding because it does not contain chars.

Python 支持字符串类型和 unicode 类型。字符串是字符序列，而 unicode 是“指针”序列。unicode 是序列在内存中的表示形式，其上的每个符号都不是字符，而是用于在映射中选择字符的数字（十六进制格式）。所以 unicode var 没有编码，因为它不包含字符。

Python 什么是unicode字符串？

提问by Stevanus Iskandar

采纳答案by tom

回答by Renjith Nair

相关推荐

最近更新

标签

Python 什么是unicode字符串？

提问by Stevanus Iskandar

采纳答案by tom

回答by Renjith Nair

相关推荐

Python mkvirtualenv：找不到命令

Python 整数和整数的正则表达式？

Python 如何将字典列表保存到文件中？

Python 从 sys.stdin 获取输入，非阻塞

相关推荐

最近更新

标签