Python 如果不是 unicode 则解码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3857763/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:05:27  来源:igfitidea点击:

Decoding if it's not unicode

pythonunicodeencodingutf-8

提问by Manuel Ceron

I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something like this:

我希望我的函数采用一个参数,该参数可以是 unicode 对象或 utf-8 编码字符串。在我的函数中,我想将参数转换为 unicode。我有这样的事情:

def myfunction(text):
    if not isinstance(text, unicode):
        text = unicode(text, 'utf-8')

    ...

Is it possible to avoid the use of isinstance? I was looking for something more duck-typing friendly.

是否可以避免使用 isinstance?我正在寻找更易于打字的东西。

During my experiments with decoding, I have run into several weird behaviours of Python. For instance:

在我的解码实验中,我遇到了 Python 的几个奇怪的行为。例如:

>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)

Or

或者

>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported

By the way. I'm using Python 2.6

顺便一提。我正在使用 Python 2.6

采纳答案by unutbu

You could just try decoding it with the 'utf-8' codec, and if that does not work, then return the object.

您可以尝试使用“utf-8”编解码器对其进行解码,如果这不起作用,则返回该对象。

def myfunction(text):
    try:
        text = unicode(text, 'utf-8')
    except TypeError:
        return text

print(myfunction(u'cer\xf3n'))
# cerón

When you take a unicode object and call its decodemethod with the 'utf-8'codec, Python first tries to convert the unicode object to a string object, and then it calls the string object's decode('utf-8') method.

当您获取一个 unicode 对象并decode使用'utf-8'编解码器调用其方法时,Python 首先尝试将 unicode 对象转换为字符串对象,然后调用字符串对象的 decode('utf-8') 方法。

Sometimes the conversion from unicode object to string object fails because Python2 uses the ascii codec by default.

有时从 unicode 对象到字符串对象的转换会失败,因为 Python2 默认使用 ascii 编解码器。

So, in general, never try to decode unicode objects. Or, if you must try, trap it in a try..except block. There may be a few codecs for which decoding unicode objects works in Python2 (see below), but they have been removed in Python3.

所以,一般来说,永远不要尝试解码 unicode 对象。或者,如果您必须尝试,请将其困在 try..except 块中。可能有一些编解码器可以在 Python2 中解码 unicode 对象(见下文),但它们已在 Python3 中删除。

See this Python bug ticketfor an interesting discussion of the issue, and also Guido van Rossum's blog:

有关该问题的有趣讨论,请参阅此Python 错误票证,以及Guido van Rossum 的博客

"We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction.This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API)."

“我们对编解码器采用了一种略有不同的方法:虽然在 Python 2 中,编解码器可以接受 Unicode 或 8 位作为输入并产生任何一种作为输出,但在 Py3k 中,编码始终是从 Unicode(文本)字符串到字节数组,解码总是相反的方向。这意味着我们不得不放弃一些不适合这个模型的编解码器,例如 rot13、base64 和 bz2(仍然支持这些转换,只是不通过编码/解码API)。”

回答by Will McCutchen

I'm not aware of any good way to avoid the isinstancecheck in your function, but maybe someone else will be. I can point out that the two weirdnesses you cite are because you're doing something that doesn't make sense: Trying to decode into Unicode something that's already decoded into Unicode.

我不知道有什么好方法可以避免isinstance检查您的函数,但也许其他人会这样做。我可以指出,您引用的两个奇怪之处是因为您正在做一些没有意义的事情:尝试将已经解码为 Unicode 的内容解码为 Unicode。

The first should instead look like this, which decodes the UTF-8 encoding of that string into the Unicode version:

第一个应该看起来像这样,它将该字符串的 UTF-8 编码解码为 Unicode 版本:

>>> 'cer\xc3\xb3n'.decode('utf-8')
u'cer\xf3n'

And your second should look like this (not using a u''Unicode string literal):

你的第二个应该是这样的(不使用u''Unicode 字符串文字):

>>> unicode('hello', 'utf-8')
u'hello'