Python / Django 中的 Unicode 与 UTF-8 混淆?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 19:24:11  来源:igfitidea点击:

Unicode vs UTF-8 confusion in Python / Django?

pythondjangounicode

提问by Hanno Fietz

I stumbled over this passage in the Django tutorial:

我在Django 教程中偶然发现了这段话:

Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.

Django 模型有一个默认的str() 方法,它调用unicode() 并将结果转换为 UTF-8 字节串。这意味着 unicode(p) 将返回一个 Unicode 字符串,而 str(p) 将返回一个普通字符串,字符编码为 UTF-8。

Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial"which boldly states

现在,我很困惑,因为 afaik Unicode 不是任何特定的表示形式,那么 Python 中的“Unicode 字符串”是什么?这是否意味着 UCS-2?谷歌搜索了这个“Python Unicode 教程”,它大胆地指出

Unicode is a two-byte encoding which covers all of the world's common writing systems.

Unicode 是一种两字节编码,涵盖了世界上所有常见的书写系统。

which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?

这是完全错误的,或者是吗?我曾多次被字符集和编码问题弄糊涂,但在这里我很确定我正在阅读的文档是糊涂的。有谁知道 Python 给我一个“Unicode 字符串”时发生了什么?

回答by bobince

what is a "Unicode string" in Python? Does that mean UCS-2?

什么是 Python 中的“Unicode 字符串”?这是否意味着 UCS-2?

Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode') for their versions of Python.

Python 中的 Unicode 字符串在内部存储为 UCS-2(定长 16 位表示,几乎与 UTF-16 相同)或 UCS-4/UTF-32(定长 32 位表示)。这是一个编译时选项;在 Windows 上,它始终是 UTF-16,而许多 Linux 发行版为其 Python 版本设置了 UTF-32(“宽模式”)。

You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.

您通常不应该关心:您会看到 Unicode 代码点作为字符串中的单个元素,并且您不知道它们是存储为两个字节还是四个字节。如果您使用 UTF-16 构建并且需要处理基本多语言平面之外的字符,那么您将做错了,但这仍然非常罕见,真正需要额外字符的用户应该编译宽构建。

plain wrong, or is it?

完全错误,或者是吗?

Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).

是的,这是完全错误的。公平地说,我认为该教程相当陈旧;它可能早于宽 Unicode 字符串,如果不是 Unicode 3.1(在基本多语言平面之外引入字符的版本)。

There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.

由于 Windows 习惯使用术语“Unicode”来表示 NT 内部使用的 UTF-16LE 编码,因此还有一个额外的混淆来源。来自 Microsoftland 的人可能经常复制这种有点误导的习惯。

回答by Hanno Fietz

Meanwhile, I did a refined research to verify what the internal representation in Python is, and also what its limits are. "The Truth About Unicode In Python" is a very good article which cites directly from the Python developers. Apparently, internal representation is either UCS-2 or UCS-4 depending on a compile-time switch. So Jon, it's not UTF-16, but your answer put me on the right track anyway, thanks.

同时,我做了一个精细的研究,以验证 Python 的内部表示是什么,以及它的限制是什么。“ Python 中的 Unicode 的真相”是一篇非常好的文章,直接引用了 Python 开发人员。显然,内部表示是 UCS-2 或 UCS-4,具体取决于编译时开关。所以乔恩,它不是 UTF-16,但无论如何你的回答让我走上了正轨,谢谢。

回答by Jonathan Works

Python stores Unicode as UTF-16. str() will return the UTF-8 representation of the UTF-16 string.

Python 将 Unicode 存储为 UTF-16。str() 将返回 UTF-16 字符串的 UTF-8 表示。

回答by Andy

From Wikipedia on UTF-8:

来自维基百科上的 UTF-8

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.
UTF-8(8 位 UCS/Unicode 转换格式)是 Unicode 的可变长度字符编码。它能够表示 Unicode 标准中的任何字符,但字节码的初始编码和 UTF-8 的字符分配向后兼容 ASCII。由于这些原因,它逐渐成为电子邮件、网页 [1] 和其他存储或传输字符的地方的首选编码。

So, it's anywhere between one and four bytes depending on which character you wish to represent within the realm of Unicode.

因此,它介于 1 到 4 个字节之间,具体取决于您希望在 Unicode 领域内表示的字符。

From Wikipedia on Unicode:

来自维基百科上的 Unicode:

In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.
在计算领域,Unicode 是一种行业标准,它允许计算机一致地表示和处理世界上大多数书写系统中表达的文本。

So it's able to represent most (but not all) of the world's writing systems.

所以它能够代表世界上大多数(但不是全部)的书写系统。

I hope this helps :)

我希望这有帮助 :)

回答by Ravi Chhabra

so what is a "Unicode string" in Python?

那么什么是 Python 中的“Unicode 字符串”?

Python 'knows' that your string is Unicode. Hence if you do regex on it, it will know which is character and which is not etc, which is really helpful. If you did a strlen it will also give the correct result. As an example if you did string count on Hello, you will get 5 (even if it's Unicode). But if you did a string count of a foreign word and that string was not a Unicode string than you will have much larger result. Pythong uses the information form the Unicode Character Database to identify each character in the Unicode String. Hope that helps.

Python“知道”您的字符串是 Unicode。因此,如果您对其进行正则表达式,它将知道哪些是字符,哪些不是等等,这真的很有帮助。如果你做了一个 strlen 它也会给出正确的结果。例如,如果您对 Hello 进行字符串计数,您将得到 5(即使它是 Unicode)。但是,如果您对一个外来词进行字符串计数并且该字符串不是 Unicode 字符串,那么您将得到更大的结果。Pythong 使用来自 Unicode 字符数据库的信息来识别 Unicode 字符串中的每个字符。希望有帮助。