Python gzip 拒绝读取未压缩的文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16813267/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:43:44  来源:igfitidea点击:

Python gzip refuses to read uncompressed file

pythongzip

提问by mok0

I seem to remember that the Python gzip module previously allowed you to read non-gzipped files transparently. This was really useful, as it allowed to read an input file whether or not it was gzipped. You simply didn't have to worry about it.

我似乎记得 Python gzip 模块以前允许您透明地读取非 gzip 文件。这真的很有用,因为它允许读取输入文件,无论它是否被 gzip 压缩。你根本不必担心它。

Now,I get an IOError exception (in Python 2.7.5):

现在,我得到一个 IOError 异常(在 Python 2.7.5 中):

   Traceback (most recent call last):
  File "tst.py", line 14, in <module>
    rec = fd.readline()
  File "/sw/lib/python2.7/gzip.py", line 455, in readline
    c = self.read(readsize)
  File "/sw/lib/python2.7/gzip.py", line 261, in read
    self._read(readsize)
  File "/sw/lib/python2.7/gzip.py", line 296, in _read
    self._read_gzip_header()
  File "/sw/lib/python2.7/gzip.py", line 190, in _read_gzip_header
    raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file

If anyone has a neat trick, I'd like to hear about it. Yes, I know how to catch the exception, but I find it rather clunky to first read a line, then close the file and open it again.

如果有人有巧妙的技巧,我想听听。是的,我知道如何捕捉异常,但我发现先读取一行,然后关闭文件并再次打开它相当笨拙。

回答by synthesizerpatel

The best solution for this would be to use something like https://github.com/ahupp/python-magicwith libmagic. You simply cannot avoid at least reading a header to identify a file (unless you implicitly trust file extensions)

最好的解决方案是使用类似https://github.com/ahupp/python-magic和 libmagic 的东西。您根本无法避免至少读取标头来识别文件(除非您隐式信任文件扩展名)

If you're feeling spartan the magic number for identifying gzip(1) files is the first two bytes being 0x1f 0x8b.

如果您感觉很简陋,那么识别 gzip(1) 文件的神奇数字是前两个字节是 0x1f 0x8b。

In [1]: f = open('foo.html.gz')
In [2]: print `f.read(2)`
'\x1f\x8b'

gzip.open is just a wrapper around GzipFile, you could have a function like this that just returns the correct type of object depending on what the source is without having to open the file twice:

gzip.open 只是 GzipFile 的一个包装器,你可以有一个这样的函数,它只返回正确类型的对象,具体取决于源是什么,而不必打开文件两次:

#!/usr/bin/python

import gzip

def opener(filename):
    f = open(filename,'rb')
    if (f.read(2) == '\x1f\x8b'):
        f.seek(0)
        return gzip.GzipFile(fileobj=f)
    else:
        f.seek(0)
        return f

回答by Mark Adler

Read the first four bytes. If the first three are 0x1f, 0x8b, 0x08, and if the high three bits of the fourth byte are zeros, then fire up the gzip compression starting with those four bytes. Otherwise write out the four bytes and continue to read transparently.

读取前四个字节。如果前三个是 0x1f、0x8b、0x08,并且如果第四个字节的高三位为零,则从这四个字节开始启动 gzip 压缩。否则写出四个字节并继续透明读取。

You should still have the clunky solution to back that up, so that if the gzip read fails nevertheless, then back up and read transparently. But it should be quite unlikely to have the first four bytes mimic a gzip file so well, but not be a gzip file.

您仍然应该有笨拙的解决方案来备份它,这样如果 gzip 读取仍然失败,那么备份并透明地读取。但是前四个字节不太可能很好地模仿 gzip 文件,但不是 gzip 文件。

回答by Rob Flickenger

Maybe you're thinking of zless or zgrep, which will open compressed or uncompressed files without complaining.

也许您正在考虑 zless 或 zgrep,它们可以毫无顾虑地打开压缩或未压缩的文件。

Can you trust that the file name ends in .gz?

您能相信文件名以 .gz 结尾吗?

if file_name.endswith('.gz'):
    opener = gzip.open
else:
    opener = open

with opener(file_name, 'r') as f:
    ...

回答by bulletmark

You can iterate over files transparently using fileinput(files, openhook=fileinput.hook_compressed)

您可以使用fileinput(files, openhook=fileinput.hook_compressed)透明地迭代文件