python 在python中获取.gz文件的未压缩大小
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1704458/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get uncompressed size of a .gz file in python
提问by Paul Oyster
Using gzip, tell() returns the offset in the uncompressed file.
In order to show a progress bar, I want to know the original (uncompressed) size of the file.
Is there an easy way to find out?
使用 gzip,tell() 返回未压缩文件中的偏移量。
为了显示进度条,我想知道文件的原始(未压缩)大小。
有没有简单的方法可以查到?
采纳答案by Jorge Israel Pe?a
The gzip formatspecifies a field called ISIZE
that:
该gzip格式指定一个名为领域ISIZE
是:
This contains the size of the original (uncompressed) input data modulo 2^32.
这包含原始(未压缩)输入数据模 2^32 的大小。
In gzip.py, which I assume is what you're using for gzip support, there is a method called _read_eof
defined as such:
在gzip.py 中,我假设这是您用于 gzip 支持的方法,有一个方法被_read_eof
定义为:
def _read_eof(self):
# We've read to the end of the file, so we have to rewind in order
# to reread the 8 bytes containing the CRC and the file size.
# We check the that the computed CRC and size of the
# uncompressed data matches the stored values. Note that the size
# stored is the true file size mod 2**32.
self.fileobj.seek(-8, 1)
crc32 = read32(self.fileobj)
isize = U32(read32(self.fileobj)) # may exceed 2GB
if U32(crc32) != U32(self.crc):
raise IOError, "CRC check failed"
elif isize != LOWU32(self.size):
raise IOError, "Incorrect length of data produced"
There you can see that the ISIZE
field is being read, but only to to compare it to self.size
for error detection. This then should mean that GzipFile.size
stores the actual uncompressed size. However, I thinkit's not exposed publicly, so you might have to hack it in to expose it. Not so sure, sorry.
在那里您可以看到ISIZE
正在读取该字段,但只是为了将其与self.size
错误检测进行比较。这应该意味着GzipFile.size
存储实际未压缩的大小。但是,我认为它没有公开公开,因此您可能必须破解它才能公开它。不太确定,抱歉。
I just looked all of this up right now, and I haven't tried it so I could be wrong. I hope this is of some use to you. Sorry if I misunderstood your question.
我现在只是查看了所有这些,我还没有尝试过,所以我可能是错的。我希望这对你有用。对不起,如果我误解了你的问题。
回答by Brice M. Dempsey
Uncompressed size is stored in the last 4 bytes of the gzip file. We can read the binary data and convert it to an int. (This will only work for files under 4GB)
未压缩的大小存储在 gzip 文件的最后 4 个字节中。我们可以读取二进制数据并将其转换为 int。(这仅适用于 4GB 以下的文件)
import struct
def getuncompressedsize(filename):
with open(filename, 'rb') as f:
f.seek(-4, 2)
return struct.unpack('I', f.read(4))[0]
回答by yk4ever
Unix way: use "gunzip -l file.gz" via subprocess.call / os.popen, capture and parse its output.
Unix 方式:通过 subprocess.call / os.popen 使用“gunzip -l file.gz”,捕获并解析其输出。
回答by John La Rooy
The last 4 bytes of the .gz hold the original size of the file
.gz 的最后 4 个字节保存文件的原始大小
回答by norok2
I am not sure about performance, but this could be achieved without knowing gzip
magic by using:
我不确定性能,但这可以在不了解gzip
魔法的情况下通过使用:
with gzip.open(filepath, 'rb') as file_obj:
file_size = file_obj.seek(0, io.SEEK_END)
This should also work for other (compressed) stream readers like bz2
or the plain open
.
这也应该适用于其他(压缩)流读者喜欢bz2
或平原open
。
EDIT:
as suggested in the comments, 2
in second line was replaced by io.SEEK_END
, which is definitely more readable and probably more future-proof.
编辑:正如评论中所建议的,2
在第二行中被替换为io.SEEK_END
,这绝对更具可读性,并且可能更具前瞻性。
EDIT: Works only in Python 3.
编辑:仅适用于 Python 3。
回答by Mark Adler
Despite what the other answers say, the last four bytes are not a reliable way to get the uncompressed length of a gzip file. First, there may be multiple members in the gzip file, so that would only be the length of the last member. Second, the length may be more than 4 GB, in which case the last four bytes represent the length modulo 232. Not the length.
不管其他答案怎么说,最后四个字节并不是获取 gzip 文件未压缩长度的可靠方法。首先,gzip 文件中可能有多个成员,因此只有最后一个成员的长度。其次,长度可能超过4 GB,在这种情况下,最后四个字节代表长度模2 32。不是长度。
However for what you want, there is no need to get the uncompressed length. You can instead base your progress bar on the amount of inputconsumed, as compared to the length of the gzip file, which is readily obtained. For typical homogenous data, that progress bar would show exactly the same thing as a progress bar based instead on the uncompressed data.
但是,对于您想要的,无需获取未压缩的长度。与 gzip 文件的长度相比,您可以将进度条基于消耗的输入量,后者很容易获得。对于典型的同质数据,该进度条将显示与基于未压缩数据的进度条完全相同的内容。
回答by Noel Burton-Krahn
f = gzip.open(filename)
# kludge - report uncompressed file position so progess bars
# don't go to 400%
f.tell = f.fileobj.tell
回答by Matt Anderson
Looking at the source for the gzip
module, I see that the underlying file object for GzipFile
seems to be fileobj
. So:
查看gzip
模块的源代码,我发现它的底层文件对象GzipFile
似乎是fileobj
. 所以:
mygzipfile = gzip.GzipFile()
...
mygzipfile.fileobj.tell()
?
?
Maybe it would be good to do some sanity checking before doing that, like checking that the attribute exists with hasattr
.
也许在这样做之前做一些健全性检查会很好,比如检查属性是否存在hasattr
.
Not exactly a public API, but...
不完全是公共 API,但是...
回答by Guilherme Salgado
GzipFile.size stores the uncompressed size, but it's only incremented when you read the file, so you should prefer len(fd.read()) instead of the non-public GzipFile.size.
GzipFile.size 存储未压缩的大小,但它只会在您读取文件时增加,因此您应该更喜欢 len(fd.read()) 而不是非公开的 GzipFile.size。
回答by user2165857
import gzip
File = gzip.open("input.gz", "r")
Size = gzip.read32(File)