Python UnicodeDecodeError: 'utf8' 编解码器无法解码字节“0xc3”

Question

提问by Baz

In python 2.7 I have this:

在python 2.7中，我有这个：

# -*- coding: utf-8 -*-
from nltk.corpus import abc
with open("abc.txt","w") as f:
    f.write(" ".join(i.words()))

I then try to read in this document in Python 3:

然后我尝试在 Python 3 中阅读本文档：

 with open("abc.txt", 'r', encoding='utf-8') as f:
     f.read()

only to get:

只得到：

  File "C:\Python32\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 633096: invalid continuation byte

What have I done wrong? Notepad++ seems to indicate that the document is Unicode utf-8. Even if I try to convert the document to this format with Notepad++ I still get this error in python 3, which is strange since I read many other utf-8 encoded documents without any problems.

我做错了什么？Notepad++ 似乎表明该文档是 Unicode utf-8。即使我尝试使用 Notepad++ 将文档转换为这种格式，我仍然会在 python 3 中收到此错误，这很奇怪，因为我阅读了许多其他 utf-8 编码的文档而没有任何问题。

Answer 1

回答by Baz

My guess is that your input is encoded as ISO-8859-2 which contains ?as 0xC3. Check the encoding of your input file.

我的猜测是，你的输入编码为ISO-8859-2含有?作为0xC3。检查输入文件的编码。

Answer 2

回答by Weeble

Based on the fact that your piece of Python 2.7 doesn't throw an exception, I would infer that i.words()returns a sequence of bytestrings. These are unlikely to be encoded in UTF8 - I'd guess maybe Latin-1 or something like that. You then write them to the file. No encoding happens at this point.

基于你的 Python 2.7 没有抛出异常这一事实，我推断它i.words()返回了一个字节串序列。这些不太可能用 UTF8 编码 - 我猜可能是 Latin-1 或类似的东西。然后将它们写入文件。此时没有编码发生。

You probably need to convert these to unicode strings, for which you'll need to know their existing encoding, and then you'll need to encode these as UTF-8 when writing the file.

您可能需要将它们转换为 unicode 字符串，为此您需要知道它们现有的编码，然后在编写文件时需要将它们编码为 UTF-8。

For example:

例如：

# -*- coding: utf-8 -*-
from nltk.corpus import abc
import codecs
with codecs.open("abc.txt","w","utf-8") as f:
    f.write(u" ".join(codecs.decode(word,"latin-1") for word in i.words()))

Some further notes, in case there's any confusion:

一些进一步的说明，以防万一有任何混淆：

The -*- coding: utf-8 -*-line refers to the encoding used to write the Python script itself. It has no effect on the input or output of that script.
In Python 2.7, there are two kinds of strings: bytestrings, which are sequences of bytes with an unspecified encoding, and unicode strings, which are sequences of unicode code points. Bytestrings are most common and are what you get if you use the regular "abc"string literal syntax. Unicode strings are what you get when you use the u"abc"syntax.
In Python 2.7, if you just use the open function to open a file and write bytestrings to it, no encoding will happen. The bytes of the bytestring are written straight into the file. If you try to write unicode strings to it, you'll get an exception if they contain characters that can't be encoded by the default (ASCII) codec.

该-*- coding: utf-8 -*-行是指用于编写 Python 脚本本身的编码。它对该脚本的输入或输出没有影响。
在 Python 2.7 中，有两种字符串：字节串，它是具有未指定编码的字节序列，以及 unicode 字符串，它是 unicode 代码点的序列。字节串是最常见的，如果你使用常规的"abc"字符串文字语法，你就会得到它。Unicode 字符串是您使用u"abc"语法时得到的。
在 Python 2.7 中，如果您只是使用 open 函数打开文件并向其中写入字节串，则不会发生编码。字节串的字节直接写入文件。如果您尝试向其中写入 unicode 字符串，如果它们包含无法由默认 (ASCII) 编解码器编码的字符，则会出现异常。

Python UnicodeDecodeError: 'utf8' 编解码器无法解码字节“0xc3”

提问by Baz

回答by Baz

回答by Weeble

相关推荐

最近更新

标签

Python UnicodeDecodeError: 'utf8' 编解码器无法解码字节“0xc3”

提问by Baz

回答by Baz

回答by Weeble

相关推荐

Python 四舍五入到最接近的整数

Python 异常值：未能找到 libmagic。检查您在 Windows 7 中的安装

Python 如何将 PIL Image.image 对象转换为 base64 字符串？

Python AttributeError：不允许分配给协议消息对象中的复合字段“任务”

相关推荐

最近更新

标签