Python 3.x 中字符串的内部表示是什么
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1838170/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is internal representation of string in Python 3.x
提问by thebat
In Python 3.x, a string consists of items of Unicode ordinal. (See the quotation from the language reference below.) What is the internal representation of Unicode string? Is it UTF-16?
在 Python 3.x 中,字符串由 Unicode 序数项组成。(请参阅下面语言参考中的引用。)Unicode 字符串的内部表示是什么?是 UTF-16 吗?
The items of a string object are Unicode code units. A Unicode code unit is represented by a string object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time). Surrogate pairs may be present in the Unicode object, and will be reported as two separate items.
字符串对象的项是 Unicode 代码单元。一个 Unicode 代码单元由一个项目的字符串对象表示,可以保存一个 16 位或 32 位的值来表示一个 Unicode 序数(序数的最大值在 sys.maxunicode 中给出,这取决于 Python 如何在编译时配置)。Unicode 对象中可能存在代理对,并将报告为两个单独的项目。
采纳答案by John Machin
There has been NO CHANGE in Unicode internal representation between Python 2.X and 3.X.
Python 2.X 和 3.X 之间的 Unicode 内部表示没有变化。
It's definitely NOT UTF-16. UTF-anything is a byte-oriented EXTERNAL representation.
它绝对不是 UTF-16。UTF-anything 是面向字节的 EXTERNAL 表示。
Each code unit (character, surrogate, etc) has been assigned a number from range(0, 2 ** 21). This is called its "ordinal".
每个代码单元(字符、代理等)都被分配了一个范围(0, 2 ** 21)的数字。这被称为它的“序数”。
Really, the documentation you quoted says it all. Most Python binaries use 16-bit ordinals which restricts you to the Basic Multilingual Plane ("BMP") unless you want to muck about with surrogates (handy if you can't find your hair shirt and your bed of nails is off being de-rusted). For working with the full Unicode reperttheitroade, you'd prefer a "wide build" (32 bits wide).
真的,您引用的文档说明了一切。大多数 Python 二进制文件使用 16 位序数,这将你限制在基本多语言平面(“BMP”),除非你想用代理来捣乱(如果你找不到你的头发衬衫并且你的指甲床被取消了,这很方便)生锈)。为了使用完整的 Unicode 曲目,您更喜欢“宽构建”(32 位宽)。
Briefly, the internal representation in a unicode object is an array of 16-bit unsigned integers, or an array of 32-bit unsigned integers (using only 21 bits).
简而言之,unicode 对象的内部表示是一个 16 位无符号整数数组,或一个 32 位无符号整数数组(仅使用 21 位)。
回答by Tobu
The internal representation will change in Python 3.3 which implements PEP 393. The new representation will pick one or several of ascii, latin-1, utf-8, utf-16, utf-32, generally trying to get a compact representation.
内部表示将在实现PEP 393 的Python 3.3 中更改。新的表示将选择 ascii、latin-1、utf-8、utf-16、utf-32 中的一个或几个,通常试图获得紧凑的表示。
Implicit conversions into surrogate pairs will only be done when talking to legacy APIs (those only exist on windows, where wchar_t is two bytes); the Python string will be preserved. Here are the release notes.
隐式转换为代理对只会在与遗留 API 交谈时进行(那些只存在于 windows 上,其中 wchar_t 是两个字节);Python 字符串将被保留。这是发行说明。
回答by Matthew Brett
In Python 3.3 and above, the internal representation of the string will depend on the string, and can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393.
在 Python 3.3 及更高版本中,字符串的内部表示将取决于字符串,并且可以是 latin-1、UCS-2 或 UCS-4 中的任何一个,如PEP 393 中所述。
For previous Pythons, the internal representation depends on the build flags of Python. Python can be built with flag values --enable-unicode=ucs2
or --enable-unicode=ucs4
. ucs2
builds do in fact use UTF-16 as their internal representation, and ucs4
builds use UCS-4 / UTF-32.
对于以前的 Python,内部表示取决于 Python 的构建标志。Python 可以使用标志值--enable-unicode=ucs2
或--enable-unicode=ucs4
. ucs2
构建实际上使用 UTF-16 作为其内部表示,并且ucs4
构建使用 UCS-4 / UTF-32。
回答by codeape
Looking at the source code for CPython 3.1.5, in Include/unicodeobject.h
:
查看 CPython 3.1.5 的源代码,在Include/unicodeobject.h
:
/* --- Unicode Type ------------------------------------------------------- */
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Length of raw Unicode data in buffer */
Py_UNICODE *str; /* Raw Unicode buffer */
long hash; /* Hash value; -1 if not set */
int state; /* != 0 if interned. In this case the two
* references from the dictionary to this object
* are *not* counted in ob_refcnt. */
PyObject *defenc; /* (Default) Encoded version as Python
string, or NULL; this is used for
implementing the buffer protocol */
} PyUnicodeObject;
The characters are stored as an array of Py_UNICODE
. On most platforms, I believe Py_UNICODE
is #define
d as wchar_t
.
字符存储为Py_UNICODE
. 在大多数平台上,我相信Py_UNICODE
是#define
d as wchar_t
。
回答by Ned Deily
回答by YOU
I think, Its hard to judge difference between UTF-16, which is just a sequences of 16 bit words, to Python's string object.
我认为,很难判断 UTF-16(只是 16 位字的序列)与 Python 的字符串对象之间的区别。
And If python is compiled with Unicode=UCS4 option, it will be comparing between UTF-32 and Python string.
如果 python 是用 Unicode=UCS4 选项编译的,它将在 UTF-32 和 Python 字符串之间进行比较。
So, better consider, they are in different category, although you can transform each others.
因此,最好考虑一下,尽管您可以相互转换,但它们属于不同的类别。
回答by user2240578
>>> import array; s = 'Привет мир!'; b = array.array('u', s).tobytes(); print(b); print(len(s) * 4 == len(b))
b'\x1f\x04\x00\x00@\x04\x00\x008\x04\x00\x002\x04\x00\x005\x04\x00\x00B\x04\x00\x00 \x00\x00\x00<\x04\x00\x008\x04\x00\x00@\x04\x00\x00!\x00\x00\x00'
True
>>> import array; s = 'test'; b = array.array('u', s).tobytes(); print(b); print(len(s) * 4 == len(b))
b't\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'
True
>>>