带有 readlines() 方法的 Python3 UnicodeDecodeError

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35028683/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:53:52  来源:igfitidea点击:

Python3 UnicodeDecodeError with readlines() method

pythonpython-3.xunicodetweepysys

提问by r_e_cur

Trying to create a twitter bot that reads lines and posts them. Using Python3 and tweepy, via a virtualenv on my shared server space. This is the part of the code that seems to have trouble:

试图创建一个 twitter 机器人来读取线条并发布它们。通过我共享服务器空间上的 virtualenv 使用 Python3 和 tweepy。这是代码中似乎有问题的部分:

#!/foo/env/bin/python3

import re
import tweepy, time, sys

argfile = str(sys.argv[1])

filename=open(argfile, 'r')
f=filename.readlines()
filename.close()

this is the error I get:

这是我得到的错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)

The error specifically points to f=filename.readlines()as the source of the error. Any idea what might be wrong? Thanks.

错误特别指出f=filename.readlines()作为错误的来源。知道可能有什么问题吗?谢谢。

回答by r_e_cur

Ended up finding a working answer for myself:

最终为自己找到了一个可行的答案:

filename=open(argfile, 'rb')

This posthelped me out a lot.

这篇文章对我帮助很大。

回答by ShadowRanger

Your default encoding appears to be ASCII, where the input is more than likely UTF-8. When you hit non-ASCII bytes in the input, it's throwing the exception. It's not so much that readlinesitself is responsible for the problem; rather, it's causing the read+decode to occur, and the decode is failing.

您的默认编码似乎是 ASCII,其中输入很可能是 UTF-8。当您在输入中遇到非 ASCII 字节时,它会抛出异常。readlines问题本身并没有太大的责任;相反,它导致读取+解码发生,并且解码失败。

It's an easy fix though; the default openin Python 3 allows you to provide the known encodingof an input, replacing the default (ASCII in your case) with any other recognized encoding. Providing it allows you to keep reading as str(rather than the significantly different raw binary data bytesobjects), while letting Python do the work of converting from raw disk bytes to true text data:

不过,这很容易解决;openPython 3 中的默认值允许您提供已知encoding的输入,用任何其他可识别的编码替换默认值(在您的情况下为 ASCII)。提供它允许您继续读取str(而不是显着不同的原始二进制数据bytes对象),同时让 Python 完成从原始磁盘字节转换为真实文本数据的工作:

# Using with statement closes the file for us without needing to remember to close
# explicitly, and closes even when exceptions occur
with open(argfile, encoding='utf-8') as inf:
    f = inf.readlines()

回答by caleb

I think the best answer (in Python 3) is to use the errors=parameter:

我认为最好的答案(在 Python 3 中)是使用errors=参数:

with open('evil_unicode.txt', 'r', errors='replace') as f:
    lines = f.readlines()

Proof:

证明:

>>> s = b'\xe5abc\nline2\nline3'
>>> with open('evil_unicode.txt','wb') as f:
...     f.write(s)
...
16
>>> with open('evil_unicode.txt', 'r') as f:
...     lines = f.readlines()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte
>>> with open('evil_unicode.txt', 'r', errors='replace') as f:
...     lines = f.readlines()
...
>>> lines
['?abc\n', 'line2\n', 'line3']
>>>

Note that the errors=can be replaceor ignore. Here's what ignorelooks like:

请注意,errors=可以是replaceignore。这是ignore看起来的样子:

>>> with open('evil_unicode.txt', 'r', errors='ignore') as f:
...     lines = f.readlines()
...
>>> lines
['abc\n', 'line2\n', 'line3']