python 在python中获取.gz文件的未压缩大小

Question

提问by Paul Oyster

Using gzip, tell() returns the offset in the uncompressed file.
In order to show a progress bar, I want to know the original (uncompressed) size of the file.
Is there an easy way to find out?

使用 gzip，tell() 返回未压缩文件中的偏移量。
为了显示进度条，我想知道文件的原始（未压缩）大小。
有没有简单的方法可以查到？

Answer 1

采纳答案by Jorge Israel Pe?a

The gzip formatspecifies a field called ISIZEthat:

该gzip格式指定一个名为领域ISIZE是：

This contains the size of the original (uncompressed) input data modulo 2^32.

这包含原始（未压缩）输入数据模 2^32 的大小。

In gzip.py, which I assume is what you're using for gzip support, there is a method called _read_eofdefined as such:

在gzip.py 中，我假设这是您用于 gzip 支持的方法，有一个方法被_read_eof定义为：

def _read_eof(self):
    # We've read to the end of the file, so we have to rewind in order
    # to reread the 8 bytes containing the CRC and the file size.
    # We check the that the computed CRC and size of the
    # uncompressed data matches the stored values.  Note that the size
    # stored is the true file size mod 2**32.
    self.fileobj.seek(-8, 1)
    crc32 = read32(self.fileobj)
    isize = U32(read32(self.fileobj))   # may exceed 2GB
    if U32(crc32) != U32(self.crc):
        raise IOError, "CRC check failed"
    elif isize != LOWU32(self.size):
        raise IOError, "Incorrect length of data produced"

There you can see that the ISIZEfield is being read, but only to to compare it to self.sizefor error detection. This then should mean that GzipFile.sizestores the actual uncompressed size. However, I thinkit's not exposed publicly, so you might have to hack it in to expose it. Not so sure, sorry.

在那里您可以看到ISIZE正在读取该字段，但只是为了将其与self.size错误检测进行比较。这应该意味着GzipFile.size存储实际未压缩的大小。但是，我认为它没有公开公开，因此您可能必须破解它才能公开它。不太确定，抱歉。

I just looked all of this up right now, and I haven't tried it so I could be wrong. I hope this is of some use to you. Sorry if I misunderstood your question.

我现在只是查看了所有这些，我还没有尝试过，所以我可能是错的。我希望这对你有用。对不起，如果我误解了你的问题。

Answer 2

回答by Brice M. Dempsey

Uncompressed size is stored in the last 4 bytes of the gzip file. We can read the binary data and convert it to an int. (This will only work for files under 4GB)

未压缩的大小存储在 gzip 文件的最后 4 个字节中。我们可以读取二进制数据并将其转换为 int。（这仅适用于 4GB 以下的文件）

import struct

def getuncompressedsize(filename):
    with open(filename, 'rb') as f:
        f.seek(-4, 2)
        return struct.unpack('I', f.read(4))[0]

Answer 3

回答by yk4ever

Unix way: use "gunzip -l file.gz" via subprocess.call / os.popen, capture and parse its output.

Unix 方式：通过 subprocess.call / os.popen 使用“gunzip -l file.gz”，捕获并解析其输出。

Answer 4

回答by John La Rooy

The last 4 bytes of the .gz hold the original size of the file

.gz 的最后 4 个字节保存文件的原始大小

Answer 5

回答by norok2

I am not sure about performance, but this could be achieved without knowing gzipmagic by using:

我不确定性能，但这可以在不了解gzip魔法的情况下通过使用：

with gzip.open(filepath, 'rb') as file_obj:
    file_size = file_obj.seek(0, io.SEEK_END)

This should also work for other (compressed) stream readers like bz2or the plain open.

这也应该适用于其他（压缩）流读者喜欢bz2或平原open。

EDIT: as suggested in the comments, 2in second line was replaced by io.SEEK_END, which is definitely more readable and probably more future-proof.

编辑：正如评论中所建议的，2在第二行中被替换为io.SEEK_END，这绝对更具可读性，并且可能更具前瞻性。

EDIT: Works only in Python 3.

编辑：仅适用于 Python 3。

Answer 6

回答by Mark Adler

Despite what the other answers say, the last four bytes are not a reliable way to get the uncompressed length of a gzip file. First, there may be multiple members in the gzip file, so that would only be the length of the last member. Second, the length may be more than 4 GB, in which case the last four bytes represent the length modulo 2³². Not the length.

不管其他答案怎么说，最后四个字节并不是获取 gzip 文件未压缩长度的可靠方法。首先，gzip 文件中可能有多个成员，因此只有最后一个成员的长度。其次，长度可能超过4 GB，在这种情况下，最后四个字节代表长度模2 ³²。不是长度。

However for what you want, there is no need to get the uncompressed length. You can instead base your progress bar on the amount of inputconsumed, as compared to the length of the gzip file, which is readily obtained. For typical homogenous data, that progress bar would show exactly the same thing as a progress bar based instead on the uncompressed data.

但是，对于您想要的，无需获取未压缩的长度。与 gzip 文件的长度相比，您可以将进度条基于消耗的输入量，后者很容易获得。对于典型的同质数据，该进度条将显示与基于未压缩数据的进度条完全相同的内容。

Answer 7

回答by Noel Burton-Krahn

    f = gzip.open(filename)
    # kludge - report uncompressed file position so progess bars
    # don't go to 400%
    f.tell = f.fileobj.tell

Answer 8

回答by Matt Anderson

Looking at the source for the gzipmodule, I see that the underlying file object for GzipFileseems to be fileobj. So:

查看gzip模块的源代码，我发现它的底层文件对象GzipFile似乎是fileobj. 所以：

mygzipfile = gzip.GzipFile()
...
mygzipfile.fileobj.tell()

?

Maybe it would be good to do some sanity checking before doing that, like checking that the attribute exists with hasattr.

也许在这样做之前做一些健全性检查会很好，比如检查属性是否存在hasattr.

Not exactly a public API, but...

不完全是公共 API，但是...

Answer 9

回答by Guilherme Salgado

GzipFile.size stores the uncompressed size, but it's only incremented when you read the file, so you should prefer len(fd.read()) instead of the non-public GzipFile.size.

GzipFile.size 存储未压缩的大小，但它只会在您读取文件时增加，因此您应该更喜欢 len(fd.read()) 而不是非公开的 GzipFile.size。

Answer 10

回答by user2165857

import gzip

File = gzip.open("input.gz", "r")
Size = gzip.read32(File)

python 在python中获取.gz文件的未压缩大小

提问by Paul Oyster

采纳答案by Jorge Israel Pe?a

回答by Brice M. Dempsey

回答by yk4ever

回答by John La Rooy

回答by norok2

回答by Mark Adler

回答by Noel Burton-Krahn

回答by Matt Anderson

回答by Guilherme Salgado

回答by user2165857

相关推荐

最近更新

标签

python 在python中获取.gz文件的未压缩大小

提问by Paul Oyster

采纳答案by Jorge Israel Pe?a

回答by Brice M. Dempsey

回答by yk4ever

回答by John La Rooy

回答by norok2

回答by Mark Adler

回答by Noel Burton-Krahn

回答by Matt Anderson

回答by Guilherme Salgado

回答by user2165857

相关推荐

python 如何从 PIL 图像创建 OpenCV 图像？

Python Array 是只读的，不能附加值

如何在 Python 中生成随机数？

是什么让 Python 成为一种好的脚本语言？

相关推荐

最近更新

标签