Python “utf-8”编解码器无法解码字节 0x80

Question

提问by Ehab AlBadawy

I'm trying to download BVLC-trained model and I'm stuck with this error

我正在尝试下载 BVLC 训练模型，但遇到此错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte

I think it's because of the following function (complete code)

我认为这是因为以下功能（完整代码）

  # Closure-d function for checking SHA1.
  def model_checks_out(filename=model_filename, sha1=frontmatter['sha1']):
      with open(filename, 'r') as f:
          return hashlib.sha1(f.read()).hexdigest() == sha1

Any idea how to fix this?

知道如何解决这个问题吗？

Answer 1

回答by Martijn Pieters

You are opening a file that is not UTF-8 encoded, while the default encoding for your system is set to UTF-8.

您打开的文件不是 UTF-8 编码的，而系统的默认编码设置为 UTF-8。

Since you are calculating a SHA1 hash, you should read the data as binaryinstead. The hashlibfunctions require you pass in bytes:

由于您正在计算 SHA1 哈希，因此您应该将数据作为二进制读取。这些hashlib函数要求您传入字节：

with open(filename, 'rb') as f:
    return hashlib.sha1(f.read()).hexdigest() == sha1

Note the addition of bin the file mode.

注意b在文件模式中的添加。

See the open()documentation:

请参阅open()文档：

modeis an optional string that specifies the mode in which the file is opened. It defaults to 'r'which means open for reading in text mode. [...]In text mode, if encodingis not specified the encoding used is platform dependent: locale.getpreferredencoding(False)is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encodingunspecified.)

mode是一个可选字符串，用于指定打开文件的模式。它默认为'r'这意味着以文本模式打开阅读。[...]在文本模式下，如果编码未指定使用的编码是与平台相关的：locale.getpreferredencoding(False)被称为获取当前的本地编码。（对于读取和写入原始字节，请使用二进制模式并且不指定编码。）

and from the hashlibmodule documentation:

并从hashlib模块文档：

You can now feed this object with bytes-like objects (normally bytes) using the update() method.

您现在可以使用 update() 方法为这个对象提供类似字节的对象（通常是字节）。

Answer 2

回答by DSM

You didn't specify to open the file in binary mode, so f.read()is trying to read the file as a UTF-8-encoded text file, which doesn't seem to be working. But since we take the hash of bytes, not of strings, it doesn't matter what the encoding is, or even whether the file is text at all: just open it, and then read it, as a binary file.

您没有指定以二进制模式打开文件，因此f.read()尝试将该文件作为 UTF-8 编码的文本文件读取，这似乎不起作用。但是由于我们采用bytes的哈希值，而不是strings的哈希值，因此编码是什么，甚至文件是否是文本都无关紧要：只需打开它，然后将其作为二进制文件读取即可。

>>> with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
Traceback (most recent call last):
  File "<ipython-input-3-fdba09d5390b>", line 1, in <module>
    with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
  File "/home/dsm/sys/pys/Python-3.5.1-bin/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 10: invalid start byte

but

但

>>> with open("test.h5.bz2","rb") as f: print(hashlib.sha1(f.read()).hexdigest())
21bd89480061c80f347e34594e71c6943ca11325

Answer 3

回答by 4F2E4A2E

Since there is not a single hint in the documentation nor src code, I have no clue why, but using the b char (i guess for binary) totally works (tf-version: 1.1.0):

由于文档和 src 代码中没有任何提示，我不知道为什么，但使用 b 字符（我猜是二进制）完全有效（tf-version：1.1.0）：

image_data = tf.gfile.FastGFile(filename, 'rb').read()

For more information, check out: gfile

有关更多信息，请查看：gfile

Python “utf-8”编解码器无法解码字节 0x80

提问by Ehab AlBadawy

回答by Martijn Pieters

回答by DSM

回答by 4F2E4A2E

相关推荐

最近更新

标签

Python “utf-8”编解码器无法解码字节 0x80

提问by Ehab AlBadawy

回答by Martijn Pieters

回答by DSM

回答by 4F2E4A2E

相关推荐

Python 检查变量是否为 None 或 numpy.array 时出现 ValueError

Python 获取视频中每一帧的时间戳

Python 如何通过布尔列过滤火花数据框

Python ValueError：对象类型 <class 'pandas.core.frame.DataFrame'> 没有名为 node2 的轴

相关推荐

最近更新

标签