Python 如何判断文件是否是 gzip 压缩的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3703276/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:20:45  来源:igfitidea点击:

How to tell if a file is gzip compressed?

pythoncompressiongzip

提问by Ryan Gabbard

I have a Python program which is going to take text files as input. However, some of these files may be gzip compressed.

我有一个 Python 程序,它将把文本文件作为输入。但是,其中一些文件可能是 gzip 压缩的。

Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?

是否有跨平台的、可从 Python 使用的方式来确定文件是否是 gzip 压缩的?

Is the following reliable or could an ordinary text file 'accidentally' look gzip-like enough for me to get false positives?

以下是可靠的还是普通的文本文件“不小心”看起来像 gzip 一样足以让我得到误报?

try:
    gzip.GzipFile(filename, 'r')
    # compressed
    # ...
except:
    # not compressed
    # ...

采纳答案by Ryan Gabbard

The magic numberfor gzip compressed files is 1f 8b. Although testing for this is not 100% reliable, it is highly unlikely that "ordinary text files" start with those two bytes—in UTF-8 it's not even legal.

gzip 压缩文件的幻数1f 8b. 尽管对此进行的测试不是 100% 可靠,但“普通文本文件”极不可能以这两个字节开头——在 UTF-8 中它甚至是不合法的。

Usually gzip compressed files sport the suffix .gzthough. Even gzip(1)itself won't unpack files without it unless you --forceit to. You could conceivably use that, but you'd still have to deal with a possible IOError (which you have to in any case).

不过,通常 gzip 压缩文件带有后缀.gzgzip(1)除非你--force这样做,否则即使它本身也不会在没有它的情况下解压文件。您可以想象使用它,但您仍然必须处理可能的 IOError (无论如何您都必须这样做)。

One problem with your approach is, that gzip.GzipFile()will not throw an exception if you feed it an uncompressed file. Only a later read()will. This means, that you would probably have to implement some of your program logic twice. Ugly.

您的方法的一个问题是,gzip.GzipFile()如果您将未压缩的文件提供给它,则不会引发异常。只有以后read()才会。这意味着,您可能必须两次实现某些程序逻辑。丑陋的。

回答by David Ries

Import the mimetypesmodule. It can automatically guess what kind of file you have, and if it is compressed.

导入mimetypes模块。它可以自动猜测您拥有什么样的文件,以及它是否被压缩。

i.e.

IE

mimetypes.guess_type('blabla.txt.gz')

returns:

返回:

('text/plain', 'gzip')

('文本/纯文本', 'gzip')

回答by ewr2san

Doesn't seem to work well in python3...

在python3中似乎不太好用...

import mimetypes
filename = "./datasets/test"

def file_type(filename):
    type = mimetypes.guess_type(filename)
    return type
print(file_type(filename))

returns (None, None) But from the unix command "File"

返回 (None, None) 但是来自 unix 命令“文件”

:~> file datasets/test datasets/test: gzip compressed data, was "iostat_collection", from Unix, last modified: Thu Jan 29 07:09:34 2015

:~> 文件数据集/测试数据集/测试:gzip 压缩数据,是“iostat_collection”,来自 Unix,最后修改时间:2015 年 1 月 29 日星期四 07:09:34

回答by themaninthewoods

"Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?"

“有没有一种跨平台的,可以从 Python 中使用的方式来确定文件是否是 gzip 压缩的?”

The accepted answer got me 90% of the way to the pretty reliable solution (test if first two bytes are 1f 8b), but did not show how to actually do this in Python. Here is one possible way:

接受的答案让我获得了相当可靠的解决方案的 90%(测试前两个字节是否为1f 8b),但没有展示如何在 Python 中实际执行此操作。这是一种可能的方法:

import binascii

def is_gz_file(filepath):
    with open(filepath, 'rb') as test_f:
        return binascii.hexlify(test_f.read(2)) == b'1f8b'

回答by Dennis

gzipitself will raise an OSErrorif it's not a gzipped file.

gzipOSError如果它不是一个 gzip 文件,它本身会引发一个。

>>> with gzip.open('README.md', 'rb') as f:
...     f.read()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'# ')

Can combine this approach with some others to increase confidence, such as checking the mimetype or looking for a magic number in the file header (see other answers for an example) and checking the extension.

可以将此方法与其他一些方法结合使用以增加信心,例如检查 mimetype 或在文件头中查找幻数(请参阅其他答案的示例)并检查扩展名。

import pathlib

if '.gz' in pathlib.Path(filepath).suffixes:
   # some more inexpensive checks until confident we can attempt to decompress
   # ...
   try ...
     ...
   except OSError as e:
     ...

回答by winni2k

As of python3.7, this works

从python3.7开始,这有效

import gzip
with gzip.open(input_file, 'r') as fh:
    try:
        fh.read(1)
    except OSError:
        print('input_file is not a valid gzip file by OSError')

As of python3.8, this also works:

从 python3.8 开始,这也有效:

import gzip
with gzip.open(input_file, 'r') as fh:
    try:
        fh.read(1)
    except gzip.BadGzipFile:
        print('input_file is not a valid gzip file by BadGzipFile')