Python UnicodeDecodeError: 'utf8' 编解码器无法解码字节“0xc3”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18403898/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UnicodeDecodeError: 'utf8' codec can't decode byte "0xc3"
提问by Baz
In python 2.7 I have this:
在python 2.7中,我有这个:
# -*- coding: utf-8 -*-
from nltk.corpus import abc
with open("abc.txt","w") as f:
f.write(" ".join(i.words()))
I then try to read in this document in Python 3:
然后我尝试在 Python 3 中阅读本文档:
with open("abc.txt", 'r', encoding='utf-8') as f:
f.read()
only to get:
只得到:
File "C:\Python32\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 633096: invalid continuation byte
What have I done wrong? Notepad++ seems to indicate that the document is Unicode utf-8. Even if I try to convert the document to this format with Notepad++ I still get this error in python 3, which is strange since I read many other utf-8 encoded documents without any problems.
我做错了什么?Notepad++ 似乎表明该文档是 Unicode utf-8。即使我尝试使用 Notepad++ 将文档转换为这种格式,我仍然会在 python 3 中收到此错误,这很奇怪,因为我阅读了许多其他 utf-8 编码的文档而没有任何问题。
回答by Baz
My guess is that your input is encoded as ISO-8859-2 which contains ?
as 0xC3
. Check the encoding of your input file.
我的猜测是,你的输入编码为ISO-8859-2含有?
作为0xC3
。检查输入文件的编码。
回答by Weeble
Based on the fact that your piece of Python 2.7 doesn't throw an exception, I would infer that i.words()
returns a sequence of bytestrings. These are unlikely to be encoded in UTF8 - I'd guess maybe Latin-1 or something like that. You then write them to the file. No encoding happens at this point.
基于你的 Python 2.7 没有抛出异常这一事实,我推断它i.words()
返回了一个字节串序列。这些不太可能用 UTF8 编码 - 我猜可能是 Latin-1 或类似的东西。然后将它们写入文件。此时没有编码发生。
You probably need to convert these to unicode strings, for which you'll need to know their existing encoding, and then you'll need to encode these as UTF-8 when writing the file.
您可能需要将它们转换为 unicode 字符串,为此您需要知道它们现有的编码,然后在编写文件时需要将它们编码为 UTF-8。
For example:
例如:
# -*- coding: utf-8 -*-
from nltk.corpus import abc
import codecs
with codecs.open("abc.txt","w","utf-8") as f:
f.write(u" ".join(codecs.decode(word,"latin-1") for word in i.words()))
Some further notes, in case there's any confusion:
一些进一步的说明,以防万一有任何混淆:
- The
-*- coding: utf-8 -*-
line refers to the encoding used to write the Python script itself. It has no effect on the input or output of that script. - In Python 2.7, there are two kinds of strings: bytestrings, which are sequences of bytes with an unspecified encoding, and unicode strings, which are sequences of unicode code points. Bytestrings are most common and are what you get if you use the regular
"abc"
string literal syntax. Unicode strings are what you get when you use theu"abc"
syntax. - In Python 2.7, if you just use the open function to open a file and write bytestrings to it, no encoding will happen. The bytes of the bytestring are written straight into the file. If you try to write unicode strings to it, you'll get an exception if they contain characters that can't be encoded by the default (ASCII) codec.
- 该
-*- coding: utf-8 -*-
行是指用于编写 Python 脚本本身的编码。它对该脚本的输入或输出没有影响。 - 在 Python 2.7 中,有两种字符串:字节串,它是具有未指定编码的字节序列,以及 unicode 字符串,它是 unicode 代码点的序列。字节串是最常见的,如果你使用常规的
"abc"
字符串文字语法,你就会得到它。Unicode 字符串是您使用u"abc"
语法时得到的。 - 在 Python 2.7 中,如果您只是使用 open 函数打开文件并向其中写入字节串,则不会发生编码。字节串的字节直接写入文件。如果您尝试向其中写入 unicode 字符串,如果它们包含无法由默认 (ASCII) 编解码器编码的字符,则会出现异常。