为什么unicode()仅在没有给出编码的情况下在我的对象上使用str()?
我首先创建一个字符串变量,上面带有一些非ASCII utf-8编码的数据:
>>> text = 'á' >>> text '\xc3\xa1' >>> text.decode('utf-8') u'\xe1'
在其上使用unicode()会引发错误...
>>> unicode(text) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
...但是如果我知道编码,则可以将其用作第二个参数:
>>> unicode(text, 'utf-8') u'\xe1' >>> unicode(text, 'utf-8') == text.decode('utf-8') True
现在,如果我有一个类可以在__str __()方法中返回此文本:
>>> class ReturnsEncoded(object): ... def __str__(self): ... return text ... >>> r = ReturnsEncoded() >>> str(r) '\xc3\xa1'
unicode(r)
似乎在上面使用了str()
,因为它引起与上面的unicode(text)
相同的错误:
>>> unicode(r) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
到目前为止,一切都按计划进行!
但是没有人能想到,unicode(r,'utf-8')
甚至不会尝试:
>>> unicode(r, 'utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: coercing to Unicode: need string or buffer, ReturnsEncoded found
为什么?为什么会有这种不一致的行为?是虫子吗?是有意的吗?很尴尬
解决方案
unicode不会猜测文本的编码。如果对象可以将自身打印为" unicode",请定义返回Unicode字符串的__unicode __()方法。
秘诀是unicode(r)本身并未真正调用__str __()。相反,它正在寻找一种__unicode __()方法。 __unicode __()的默认实现将调用__str __(),然后尝试使用ascii字符集对其进行解码。传递编码时,unicode()
期望第一个对象是可以解码的对象,即" basestring"的实例。
Behavior is weird because it tries to decode as ascii if I don't pass 'utf-8'. But if I pass 'utf-8' it gives a different error...
这是因为当我们指定" utf-8"时,它将第一个参数视为要解码的类似字符串的对象。没有它,它将参数视为要强制转换为unicode的对象。
我不明白这种混乱。如果我们知道对象的text
属性将始终采用UTF-8编码,则只需定义__unicode __()
,然后一切都会正常运行。
这种行为看起来确实令人困惑,但却是故意的。我在这里重现了Python内置函数文档中的unicode文档的全部内容(在我撰写本文时,适用于2.5.2版):
unicode([object[, encoding [, errors]]]) Return the Unicode string version of object using one of the following modes: If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according to errors; this specifies the treatment of characters which are invalid in the input encoding. If errors is 'strict' (the default), a ValueError is raised on errors, while a value of 'ignore' causes errors to be silently ignored, and a value of 'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded. See also the codecs module. If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied. For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode. New in version 2.0. Changed in version 2.2: Support for __unicode__() added.
因此,当我们调用unicode(r,'utf-8')
时,它要求使用8位字符串或者字符缓冲区作为第一个参数,因此它会使用__str __()方法强制对象,并尝试使用utf-8
编解码器对其进行解码。如果没有utf-8
,unicode()
函数会在对象上寻找一个__unicode __()方法,而没有找到它,则按照建议调用__str __()
方法,试图使用默认编解码器转换为unicode。