带有 readlines() 方法的 Python3 UnicodeDecodeError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35028683/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python3 UnicodeDecodeError with readlines() method
提问by r_e_cur
Trying to create a twitter bot that reads lines and posts them. Using Python3 and tweepy, via a virtualenv on my shared server space. This is the part of the code that seems to have trouble:
试图创建一个 twitter 机器人来读取线条并发布它们。通过我共享服务器空间上的 virtualenv 使用 Python3 和 tweepy。这是代码中似乎有问题的部分:
#!/foo/env/bin/python3
import re
import tweepy, time, sys
argfile = str(sys.argv[1])
filename=open(argfile, 'r')
f=filename.readlines()
filename.close()
this is the error I get:
这是我得到的错误:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)
The error specifically points to f=filename.readlines()
as the source of the error. Any idea what might be wrong? Thanks.
错误特别指出f=filename.readlines()
作为错误的来源。知道可能有什么问题吗?谢谢。
回答by r_e_cur
回答by ShadowRanger
Your default encoding appears to be ASCII, where the input is more than likely UTF-8. When you hit non-ASCII bytes in the input, it's throwing the exception. It's not so much that readlines
itself is responsible for the problem; rather, it's causing the read+decode to occur, and the decode is failing.
您的默认编码似乎是 ASCII,其中输入很可能是 UTF-8。当您在输入中遇到非 ASCII 字节时,它会抛出异常。readlines
问题本身并没有太大的责任;相反,它导致读取+解码发生,并且解码失败。
It's an easy fix though; the default open
in Python 3 allows you to provide the known encoding
of an input, replacing the default (ASCII in your case) with any other recognized encoding. Providing it allows you to keep reading as str
(rather than the significantly different raw binary data bytes
objects), while letting Python do the work of converting from raw disk bytes to true text data:
不过,这很容易解决;open
Python 3 中的默认值允许您提供已知encoding
的输入,用任何其他可识别的编码替换默认值(在您的情况下为 ASCII)。提供它允许您继续读取str
(而不是显着不同的原始二进制数据bytes
对象),同时让 Python 完成从原始磁盘字节转换为真实文本数据的工作:
# Using with statement closes the file for us without needing to remember to close
# explicitly, and closes even when exceptions occur
with open(argfile, encoding='utf-8') as inf:
f = inf.readlines()
回答by caleb
I think the best answer (in Python 3) is to use the errors=
parameter:
我认为最好的答案(在 Python 3 中)是使用errors=
参数:
with open('evil_unicode.txt', 'r', errors='replace') as f:
lines = f.readlines()
Proof:
证明:
>>> s = b'\xe5abc\nline2\nline3'
>>> with open('evil_unicode.txt','wb') as f:
... f.write(s)
...
16
>>> with open('evil_unicode.txt', 'r') as f:
... lines = f.readlines()
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte
>>> with open('evil_unicode.txt', 'r', errors='replace') as f:
... lines = f.readlines()
...
>>> lines
['?abc\n', 'line2\n', 'line3']
>>>
Note that the errors=
can be replace
or ignore
. Here's what ignore
looks like:
请注意,errors=
可以是replace
或ignore
。这是ignore
看起来的样子:
>>> with open('evil_unicode.txt', 'r', errors='ignore') as f:
... lines = f.readlines()
...
>>> lines
['abc\n', 'line2\n', 'line3']