Python 如何修复：“UnicodeDecodeError：‘ascii’编解码器无法解码字节”

Question

提问by fisherman

as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File "/usr/local/bin/wok", line 4, in
Engine()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 104, in init
self.load_pages()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 238, in load_pages
p = Page.from_file(os.path.join(root, f), self.options, self, renderer)
File "/usr/local/lib/python2.7/site-packages/wok/page.py", line 111, in from_file
page.meta['content'] = page.renderer.render(page.original)
File "/usr/local/lib/python2.7/site-packages/wok/renderers.py", line 46, in render
return markdown(plain, Markdown.plugins)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 419, in markdown
return md.convert(text)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 281, in convert
source = unicode(source)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 1: ordinal not in range(128). -- Note: Markdown only accepts unicode input!

How to fix it?

如何解决？

In some other python-based static blog apps, Chinese post can be published successfully. Such as this app: http://github.com/vrypan/bucket3. In my site http://bc3.brite.biz/, Chinese post can be published successfully.

在其他一些基于python的静态博客应用程序中，可以成功发布中文帖子。例如这个应用程序：http: //github.com/vrypan/bucket3。在我的站点http://bc3.brite.biz/ 中，可以成功发布中文帖子。

Answer 1

回答by GreenAsJade

This is the classic "unicode issue". I believe that explaining this is beyond the scope of a StackOverflow answer to completely explain what is happening.

这是经典的“unicode 问题”。我相信解释这一点超出了 StackOverflow 答案的范围，无法完全解释正在发生的事情。

It is well explained here.

这很好解释here。

In very brief summary, you have passed something that is being interpreted as a string of bytes to something that needs to decode it into Unicode characters, but the default codec (ascii) is failing.

在非常简短的总结中，您已将一些被解释为字节字符串的内容传递给需要将其解码为 Unicode 字符的内容，但默认编解码器 (ascii) 失败了。

The presentation I pointed you to provides advice for avoiding this. Make your code a "unicode sandwich". In Python 2, the use of from __future__ import unicode_literalshelps.

我向您指出的演示文稿提供了避免这种情况的建议。使您的代码成为“unicode 三明治”。在 Python 2 中，from __future__ import unicode_literalshelps的使用。

Update: how can the code be fixed:

更新：如何修复代码：

OK - in your variable "source" you have some bytes. It is not clear from your question how they got in there - maybe you read them from a web form? In any case, they are not encoded with ascii, but python is trying to convert them to unicode assuming that they are. You need to explicitly tell it what the encoding is. This means that you need to knowwhat the encoding is! That is not always easy, and it depends entirely on where this string came from. You could experiment with some common encodings - for example UTF-8. You tell unicode() the encoding as a second parameter:

好的 - 在您的变量“源”中，您有一些字节。从你的问题中不清楚他们是如何进入那里的——也许你是从网络表格中读到的？在任何情况下，它们都不是用 ascii 编码的，但假设它们是，python 正试图将它们转换为 unicode。您需要明确地告诉它编码是什么。这意味着您需要知道编码是什么！这并不总是那么容易，这完全取决于这个字符串的来源。您可以尝试一些常见的编码 - 例如 UTF-8。你告诉 unicode() 编码作为第二个参数：

source = unicode(source, 'utf-8')

Answer 2

回答by fisherman

Finally I got it:

最后我明白了：

as3:/usr/local/lib/python2.7/site-packages# cat sitecustomize.py
# encoding=utf8  
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

Let me check:

让我检查一下：

as3:~/ngokevin-site# python
Python 2.7.6 (default, Dec  6 2013, 14:49:02)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.getdefaultencoding()
'utf8'
>>>

The above shows the default encoding of python is utf8. Then the error is no more.

上面显示了python的默认编码是utf8. 那么错误就没有了。

Answer 3

回答by Davy

In some cases, when you check your default encoding (print sys.getdefaultencoding()), it returns that you are using ASCII. If you change to UTF-8, it doesn't work, depending on the content of your variable. I found another way:

在某些情况下，当您检查默认编码 ( print sys.getdefaultencoding()) 时，它会返回您使用的是 ASCII。如果更改为 UTF-8，则不起作用，具体取决于变量的内容。我找到了另一种方法：

import sys
reload(sys)  
sys.setdefaultencoding('Cp1252')

Answer 4

回答by miraculixx

I find the best is to always convert to unicode - but this is difficult to achieve because in practice you'd have to check and convert every argument to every function and method you ever write that includes some form of string processing.

我发现最好的方法是始终转换为 unicode - 但这很难实现，因为在实践中，您必须检查每个参数并将其转换为您编写的每个函数和方法，其中包括某种形式的字符串处理。

So I came up with the following approach to either guarantee unicodes or byte strings, from either input. In short, include and usethe following lambdas:

所以我想出了以下方法来保证来自任一输入的 unicodes 或字节字符串。简而言之，包括并使用以下 lambda：

# guarantee unicode string
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
_uu = lambda *tt: tuple(_u(t) for t in tt) 
# guarantee byte string in UTF8 encoding
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
_uu8 = lambda *tt: tuple(_u8(t) for t in tt)

Examples:

例子：

text='Some string with codes > 127, like Zürich'
utext=u'Some string with codes > 127, like Zürich'
print "==> with _u, _uu"
print _u(text), type(_u(text))
print _u(utext), type(_u(utext))
print _uu(text, utext), type(_uu(text, utext))
print "==> with u8, uu8"
print _u8(text), type(_u8(text))
print _u8(utext), type(_u8(utext))
print _uu8(text, utext), type(_uu8(text, utext))
# with % formatting, always use _u() and _uu()
print "Some unknown input %s" % _u(text)
print "Multiple inputs %s, %s" % _uu(text, text)
# but with string.format be sure to always work with unicode strings
print u"Also works with formats: {}".format(_u(text))
print u"Also works with formats: {},{}".format(*_uu(text, text))
# ... or use _u8 and _uu8, because string.format expects byte strings
print "Also works with formats: {}".format(_u8(text))
print "Also works with formats: {},{}".format(*_uu8(text, text))

Here's some more reasoning about this.

这里有一些关于这个的更多推理。

Answer 5

回答by Alastair McCormack

tl;dr / quick fix

tl;博士/快速修复

Don't decode/encode willy nilly
Don't assume your strings are UTF-8 encoded
Try to convert strings to Unicode strings as soon as possible in your code
Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
Don't be tempted to use quick reloadhacks

不要随意解码/编码
不要假设您的字符串是 UTF-8 编码的
尝试在您的代码中尽快将字符串转换为 Unicode 字符串
修复您的语言环境：如何解决 Python 3.6 中的 UnicodeDecodeError？
不要试图使用快速reload黑客

Unicode Zen in Python 2.x - The Long Version

Python 2.x 中的 Unicode Zen - 长版

Without seeing the source it's difficult to know the root cause, so I'll have to speak generally.

不看源头很难知道根本原因，所以我只能笼统地说。

UnicodeDecodeError: 'ascii' codec can't decode bytegenerally happens when you try to convert a Python 2.x strthat contains non-ASCII to a Unicode string without specifying the encoding of the original string.

UnicodeDecodeError: 'ascii' codec can't decode byte当您尝试将str包含非 ASCII的 Python 2.x 转换为Unicode 字符串而不指定原始字符串的编码时，通常会发生这种情况。

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codesand therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicodeand Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

简而言之，Unicode 字符串是一种完全独立的 Python 字符串类型，不包含任何编码。它们只保存 Unicode点代码，因此可以保存整个范围内的任何 Unicode 点。字符串包含编码文本，如 UTF-8、UTF-16、ISO-8895-1、GBK、Big5 等。字符串被解码为 Unicode，而 Unicode被编码为字符串。文件和文本数据始终以编码字符串传输。

The Markdown module authors probably use unicode()(where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Markdown 模块作者可能使用unicode()（抛出异常的地方）作为其余代码的质量门——它将转换 ASCII 或将现有的 Unicode 字符串重新包装为新的 Unicode 字符串。Markdown 作者无法知道传入字符串的编码，因此在传递给 Markdown 之前将依赖您将字符串解码为 Unicode 字符串。

Unicode strings can be declared in your code using the uprefix to strings. E.g.

Unicode 字符串可以在您的代码中使用u字符串的前缀来声明。例如

>>> my_u = u'my ünic?dé str?ng'
>>> type(my_u)
<type 'unicode'>

Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.

Unicode 字符串也可能来自文件、数据库和网络模块。发生这种情况时，您无需担心编码。

Gotchas

陷阱

Conversion from strto Unicode can happen even when you don't explicitly call unicode().

从转换str时不显式调用到Unicode甚至可以发生unicode()。

The following scenarios cause UnicodeDecodeErrorexceptions:

以下场景会导致UnicodeDecodeError异常：

# Explicit conversion without encoding
unicode('')

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('')

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % ''

# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + ''

Examples

例子

In the following diagram, you can see how the word caféhas been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, cafis just regular ascii. In UTF-8, éis encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode()is invoked and conversion to a Python Unicode is successfull:

在下图中，您可以看到单词café是如何根据终端类型以“UTF-8”或“Cp1252”编码进行编码的。在这两个例子中，caf只是普通的 ascii。在 UTF-8 中，é使用两个字节进行编码。在“Cp1252”中，é 是 0xE9（这也恰好是 Unicode 点值（这不是巧合））。decode()调用正确并成功转换为 Python Unicode：

In this diagram, decode()is called with ascii(which is the same as calling unicode()without an encoding given). As ASCII can't contain bytes greater than 0x7F, this will throw a UnicodeDecodeErrorexception:

在这个图中，decode()被调用ascii（这与unicode()没有给定编码的调用相同）。由于 ASCII 不能包含大于的字节0x7F，这将引发UnicodeDecodeError异常：

The Unicode Sandwich

Unicode 三明治

It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

在您的代码中形成一个 Unicode 三明治是一种很好的做法，您可以在其中将所有传入数据解码为 Unicode 字符串，使用 Unicode，然后在输出时编码为strs。这使您无需担心代码中间的字符串编码。

Input / Decode

输入/解码

Source code

源代码

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

如果您需要将非 ASCII 编码到您的源代码中，只需在字符串前加上u. 例如

u'Zürich'

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:

为了让 Python 解码你的源代码，你需要添加一个编码头来匹配你文件的实际编码。例如，如果您的文件被编码为“UTF-8”，您将使用：

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

仅当您的源代码中有非 ASCII 时才需要这样做。

Files

文件

Usually non-ASCII data is received from a file. The iomodule provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:

通常从文件接收非 ASCII 数据。该io模块提供了一个 TextWrapper，它使用给定的encoding. 您必须为文件使用正确的编码 - 它不容易被猜到。例如，对于 UTF-8 文件：

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read()

my_unicode_stringwould then be suitable for passing to Markdown. If a UnicodeDecodeErrorfrom the read()line, then you've probably used the wrong encoding value.

my_unicode_string然后将适合传递给 Markdown。如果 aUnicodeDecodeError来自该read()行，那么您可能使用了错误的编码值。

CSV Files

CSV 文件

The Python 2.7 CSV module does not support non-ASCII characters . Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

Python 2.7 CSV 模块不支持非 ASCII 字符。但是，使用https://pypi.python.org/pypi/backports.csv 可以获得帮助。

Use it like above but pass the opened file to it:

像上面一样使用它，但将打开的文件传递给它：

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

Databases

数据库

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

大多数 Python 数据库驱动程序可以以 Unicode 格式返回数据，但通常需要一些配置。SQL 查询始终使用 Unicode 字符串。

MySQL

In the connection string add:

在连接字符串中添加：

charset='utf8',
use_unicode=True

E.g.

例如

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")

PostgreSQL

Add:

添加：

psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

HTTP

Web pages can be encoded in just about any encoding. The Content-typeheader should contain a charsetfield to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requestsreturns Unicodes in response.text.

网页几乎可以用任何编码进行编码。的Content-type报头应包含一个charset字段在编码暗示。然后可以根据该值手动解码内容。或者，Python-Requests以response.text.

Manually

手动

If you must decode strings manually, you can simply do my_string.decode(encoding), where encodingis the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeErrorthen you've probably got the wrong encoding.

如果您必须手动解码字符串，您可以简单地执行my_string.decode(encoding)，encoding适当的编码在哪里。此处给出了 Python 2.x 支持的编解码器：标准编码。同样，如果你得到UnicodeDecodeError那么你可能得到了错误的编码。

The meat of the sandwich

三明治的肉

Work with Unicodes as you would normal strs.

像处理普通字符串一样使用 Unicode。

Output

输出

stdout / printing

标准输出/打印

printwrites through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console's encoding. For example, if a Linux shell's localeis en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

print通过标准输出流写入。Python 尝试在 stdout 上配置编码器，以便将 Unicode 编码为控制台的编码。例如，如果 Linux shelllocale是en_GB.UTF-8，则输出将被编码为UTF-8. 在 Windows 上，您将被限制为 8 位代码页。

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODINGenvironment variable can force the encoding for stdout.

错误配置的控制台（例如损坏的区域设置）可能会导致意外的打印错误。PYTHONIOENCODING环境变量可以强制标准输出的编码。

Files

文件

Just like input, io.opencan be used to transparently convert Unicodes to encoded byte strings.

就像输入一样，io.open可用于透明地将 Unicode 转换为编码的字节字符串。

Database

数据库

The same configuration for reading will allow Unicodes to be written directly.

相同的读取配置将允许直接写入 Unicode。

Python 3

蟒蛇 3

Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular stris now a Unicode string and the old stris now bytes.

Python 3 的 Unicode 能力并不比 Python 2.x 多，但它在这个主题上的困惑稍微少一些。例如，常规str现在是 Unicode 字符串，而旧str的现在是bytes.

The default encoding is UTF-8, so if you .decode()a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people's Unicode problems.

默认编码是 UTF-8，所以如果你.decode()的字节串没有给出编码，Python 3 使用 UTF-8 编码。这可能解决了人们 50% 的 Unicode 问题。

Further, open()operates in text mode by default, so returns decoded str(Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

此外，open()默认情况下在文本模式下运行，因此返回已解码str（Unicode 的）。编码源自您的语言环境，在 Un*x 系统上通常是 UTF-8，或者在 Windows 机器上是 8 位代码页，例如 windows-1251。

Why you shouldn't use `sys.setdefaultencoding('utf8')`

为什么你不应该使用 `sys.setdefaultencoding('utf8')`

It's a nasty hack (there's a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?for further details

这是一个令人讨厌的 hack（你必须使用它是有原因的reload），它只会掩盖问题并阻碍你迁移到 Python 3.x。了解问题，解决根本原因并享受 Unicode zen。请参阅为什么我们不应该在 py 脚本中使用 sys.setdefaultencoding("utf-8")？欲知更多详情

Answer 6

回答by Paul Bormans

In a Django (1.9.10)/Python 2.7.5 project I have frequent UnicodeDecodeErrorexceptions; mainly when I try to feed unicode strings to logging. I made a helper function for arbitrary objects to basically format to 8-bit ascii strings and replacing any characters not in the table to '?'. I think it's not the best solution but since the default encoding is ascii (and i don't want to change it) it will do:

在 Django (1.9.10)/Python 2.7.5 项目中，我经常遇到UnicodeDecodeError异常；主要是当我尝试将 unicode 字符串提供给日志记录时。我为任意对象创建了一个辅助函数，基本上将其格式化为 8 位 ascii 字符串，并将表中没有的任何字符替换为“？”。我认为这不是最好的解决方案，但由于默认编码是 ascii（我不想更改它），它将执行以下操作：

def encode_for_logging(c, encoding='ascii'):
    if isinstance(c, basestring):
        return c.encode(encoding, 'replace')
    elif isinstance(c, Iterable):
        c_ = []
        for v in c:
            c_.append(encode_for_logging(v, encoding))
        return c_
    else:
        return encode_for_logging(unicode(c))

`

Answer 7

回答by Alle Pavesi

I got the same problem with the string "Pasteler?-a Mallorca" and I solved with:

我对字符串 "Pasteler?-a Mallorca" 遇到了同样的问题，我解决了：

unicode("Pasteler?-a Mallorca", 'latin-1')

Answer 8

回答by Reihan_amn

I had the same problem but it didn't work for Python 3. I followed this and it solved my problem:

我遇到了同样的问题，但它不适用于 Python 3。我遵循了这个，它解决了我的问题：

enc = sys.getdefaultencoding()
file = open(menu, "r", encoding = enc)

You have to set the encoding when you are reading/writing the file.

您必须在读取/写入文件时设置编码。

Answer 9

回答by Aishwarya Subramanian

"UnicodeDecodeError: 'ascii' codec can't decode byte"

Cause of this error: input_string must be unicode but str was given

此错误的原因： input_string 必须是 unicode 但给出了 str

"TypeError: Decoding Unicode is not supported"

Cause of this error: trying to convert unicode input_string into unicode

此错误的原因：试图将 unicode input_string 转换为 unicode

So first check that your input_string is strand convert to unicode if necessary:

因此，首先检查您的 input_string 是否是str并在必要时转换为 unicode：

if isinstance(input_string, str):
   input_string = unicode(input_string, 'utf-8')

Secondly, the above just changes the type but does not remove non ascii characters. If you want to remove non-ascii characters:

其次，上面只是改变了类型，但没有删除非 ascii 字符。如果要删除非 ascii 字符：

if isinstance(input_string, str):
   input_string = input_string.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.

elif isinstance(input_string, unicode):
   input_string = input_string.encode('ascii', 'ignore')

Answer 10

回答by RAFI AFRIDI

Encode converts a unicode object in to a string object. I think you are trying to encode a string object. first convert your result into unicode object and then encode that unicode object into 'utf-8'. for example

编码将 unicode 对象转换为字符串对象。我认为您正在尝试对字符串对象进行编码。首先将您的结果转换为 unicode 对象，然后将该 unicode 对象编码为“utf-8”。例如

    result = yourFunction()
    result.decode().encode('utf-8')

Python 如何修复：“UnicodeDecodeError：‘ascii’编解码器无法解码字节”

提问by fisherman

回答by GreenAsJade

回答by fisherman

回答by Davy

回答by miraculixx

回答by Alastair McCormack

tl;dr / quick fix

tl;博士/快速修复

Unicode Zen in Python 2.x - The Long Version

Python 2.x 中的 Unicode Zen - 长版

Gotchas

陷阱

Examples

例子

The Unicode Sandwich

Unicode 三明治

Input / Decode

输入/解码

Source code

源代码

Files

文件

CSV Files

CSV 文件

Databases

数据库

HTTP

HTTP

Manually

手动

The meat of the sandwich

三明治的肉

Output

输出

stdout / printing

标准输出/打印

Files

文件

Database

数据库

Python 3

蟒蛇 3

Why you shouldn't use sys.setdefaultencoding('utf8')

为什么你不应该使用 sys.setdefaultencoding('utf8')

回答by Paul Bormans

回答by Alle Pavesi

回答by Reihan_amn

回答by Aishwarya Subramanian

回答by RAFI AFRIDI

相关推荐

Python 将一维数组转换为numpy矩阵

pythons re.compile(r' pattern flags') 中的“r”是什么意思？

使用 cv2 在 python 中创建一个多通道零垫

Python 将熊猫函数应用于列以创建多个新列？

相关推荐

最近更新

标签

Why you shouldn't use `sys.setdefaultencoding('utf8')`

为什么你不应该使用 `sys.setdefaultencoding('utf8')`