Unicode与UTF-8在Python / Django中的混淆？-IGI

时间：2020-03-05 18:41:57 　来源:igfitidea点击:

我在Django教程中偶然发现了这一段：

Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.

现在，我很困惑，因为afaik Unicode没有任何特定的表示形式，那么Python中的" Unicode字符串"是什么？这是否意味着UCS-2？谷歌搜索出现了这个" Python Unicode教程"，它大胆地指出

Unicode is a two-byte encoding which covers all of the world's common writing systems.

这是完全错误的，或者是？字符集和编码问题使我很困惑，但是在这里，我很确定我正在阅读的文档很混乱。有谁知道Python给我一个" Unicode字符串"时发生了什么？

解决方案

回答

Python将Unicode存储为UTF-16. str()将返回UTF-16字符串的UTF-8表示形式。

回答

从UTF-8上的Wikipedia：

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.

因此，它取决于我们希望在Unicode领域中代表哪个字符，介于1到4个字节之间。

从Wikipedia上的Unicode：

In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.

因此，它能够代表世界上大多数(但不是全部)书写系统。

我希望这有帮助：)

回答

so what is a "Unicode string" in
  Python?

Python"知道"字符串是Unicode。因此，如果我们对它进行正则表达式，它将知道哪个是字符，哪个不是字符，这真的很有帮助。如果我们做得很努力，它也会给出正确的结果。例如，如果我们对Hello进行了字符串计数，我们将获得5(即使它是Unicode)。但是，如果我们对一个外国单词进行了字符串计数，并且该字符串不是Unicode字符串，则结果将大得多。 Pythong使用Unicode字符数据库中的信息来标识Unicode字符串中的每个字符。希望能有所帮助。

回答

同时，我进行了详尽的研究，以验证Python的内部表示形式及其局限性。 " Python中Unicode的真相"是一篇非常好的文章，直接引用了Python开发人员。显然，内部表示是UCS-2还是UCS-4，具体取决于编译时开关。因此，乔恩(Jon)，它不是UTF-16，但是回答还是使我走上了正轨，谢谢。

回答

what is a "Unicode string" in Python? Does that mean UCS-2?

Python中的Unicode字符串在内部以UCS-2(固定长度的16位表示形式，与UTF-16几乎相同)或者UCS-4 / UTF-32(固定长度的32位表示形式)内部存储。这是一个编译时选项；在Windows上，它始终是UTF-16，而许多Linux发行版都为其Python版本设置了UTF-32(宽模式)。

通常，我们不必在意：我们会在字符串中将Unicode代码点视为单个元素，并且不知道它们是以两个或者四个字节存储的。如果我们使用的是UTF-16版本，并且需要在Basic Multilingual Plane之外处理字符，那我们将做错事情，但这仍然非常罕见，确实需要额外字符的用户应该编译广泛的版本。

plain wrong, or is it?

是的，这是完全错误的。公平地讲，我认为该教程已经过时了。如果不是Unicode 3.1(在Basic Multilingual Plane之外引入字符的版本)，它可能早于宽Unicode字符串。

由于Windows习惯使用术语Unicode，特别是NT在内部使用的UTF-16LE编码，因此还产生了另一个混乱的来源。来自Microsoftland的人们可能经常复制这种有点误导性的习惯。

Unicode与UTF-8在Python / Django中的混淆？

解决方案

回答

回答

回答

回答

回答

相关推荐

最近更新

标签

Unicode与UTF-8在Python / Django中的混淆？

解决方案

回答

回答

回答

回答

回答

相关推荐

PostgreSQL：GIN或者GiST索引？

在git中切换分支名称

我们需要什么技能才能在Web Apps中进行正确的UI /交互/功能设计？

System.Web.Caching与企业库缓存块

相关推荐

最近更新

标签