在 Python 中解压 .bz2 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1250688/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 21:47:02  来源:igfitidea点击:

Decompressing a .bz2 file in Python

pythoncompression

提问by user153186

So, this is a seemingly simple question, but I'm apparently very very dull. I have a little script that downloads all the .bz2 files from a webpage, but for some reason the decompressing of that file is giving me a MAJOR headache.

所以,这是一个看似简单的问题,但我显然非常非常沉闷。我有一个小脚本,可以从网页下载所有 .bz2 文件,但由于某种原因,该文件的解压让我很头疼。

I'm quite a Python newbie, so the answer is probably quite obvious, please help me.

我是一个 Python 新手,所以答案可能很明显,请帮助我。

In this bit of the script, I already have the file, and I just want to read it out to a variable, then decompress that? Is that right? I've tried all sorts of way to do this, I usually get "ValueError: couldn't find end of stream" error on the last line in this snippet. I've tried to open up the zipfile and write it out to a string in a zillion different ways. This is the latest.

在脚本的这一部分中,我已经有了该文件,我只想将其读出到一个变量中,然后将其解压缩?是对的吗?我已经尝试了各种方法来做到这一点,我通常会在此代码段的最后一行收到“ValueError:找不到流结束”错误。我试图打开 zipfile 并以无数种不同的方式将其写出一个字符串。这是最新的。

openZip = open(zipFile, "r")
s = ''
while True:
    newLine = openZip.readline()
    if(len(newLine)==0):
       break
    s+=newLine
    print s                   
    uncompressedData = bz2.decompress(s)

Hi Alex, I should've listed all the other methods I've tried, as I've tried the read() way.

嗨亚历克斯,我应该列出我尝试过的所有其他方法,因为我已经尝试过 read() 方式。

METHOD A:

方法一:

print 'decompressing ' + filename

fileHandle = open(zipFile)
uncompressedData = ''

while True:            
    s = fileHandle.read(1024)
    if not s:
        break
        print('RAW "%s"', s)
        uncompressedData += bz2.decompress(s)

        uncompressedData += bz2.flush()

        newFile = open(steamTF2mapdir + filename.split(".bz2")[0],"w")
        newFile.write(uncompressedData)
        newFile.close()   

I get the error:

我收到错误:

uncompressedData += bz2.decompress(s)
ValueError: couldn't find end of stream

METHOD B

方法B

zipFile = steamTF2mapdir + filename
print 'decompressing ' + filename
fileHandle = open(zipFile)

s = fileHandle.read()
uncompressedData = bz2.decompress(s)

Same error :

同样的错误:

uncompressedData = bz2.decompress(s)
ValueError: couldn't find end of stream

Thanks so much for you prompt reply. I'm really banging my head against the wall, feeling inordinately thick for not being able to decompress a simple .bz2 file.

非常感谢您的及时回复。我真的把我的头撞在墙上,因为无法解压一个简单的 .bz2 文件而感到异常沉重。

By the by, used 7zip to decompress it manually, to make sure the file isn't wonky or anything, and it decompresses fine.

顺便说一句,使用 7zip 手动解压缩它,以确保文件不是不稳定的或任何东西,并且解压缩得很好。

回答by Alex Martelli

You're opening and reading the compressed file as if it was a textfile made up of lines. DON'T! It's NOT.

您正在打开和读取压缩文件,就好像它是由行组成的文本文件。别!不是。

uncompressedData = bz2.BZ2File(zipFile).read()

seems to be closer to what you're angling for.

似乎更接近你钓鱼的目的。

Edit: the OP has shown a few more things he's tried (though I don't see any notes about having tried the best method -- the one-liner I recommend above!) but they seem to all have one error in common, and I repeat the key bits from above:

编辑:OP 还展示了他尝试过的更多事情(尽管我没有看到任何关于尝试最佳方法的说明——我在上面推荐的单行!)但他们似乎都有一个共同的错误,并且我重复上面的关键部分:

opening ... the compressed file as if it was a textfile ... It's NOT.

打开...压缩文件就好像它是一个文本文件...它不是。

open(filename)and even the more explicit open(filename, 'r')open, for reading, a textfile -- a compressed file is a binaryfile, so in order to read it correctly you must open it with open(filename, 'rb'). ((my recommended bz2.BZ2FileKNOWS it's dealing with a compressed file, of course, so there's no need to tell it anything more)).

open(filename)甚至更显式地open(filename, 'r')打开文本文件进行读取——压缩文件是二进制文件,因此为了正确读取它,您必须使用open(filename, 'rb'). ((bz2.BZ2File当然,我推荐知道它正在处理压缩文件,因此无需再告诉它任何内容))。

In Python 2.*, on Unix-y systems (i.e. every system except Windows), you could get away with a sloppy use of open(but in Python 3.*you can't, as text is Unicode, while binary is bytes -- different types).

在 Python 中2.*,在 Unix-y 系统(即除 Windows 之外的每个系统)上,您可以草率使用open(但在 Python 中3.*您不能,因为文本是 Unicode,而二进制是字节 - 不同类型)。

In Windows (and before then in DOS) it's always been indispensable to distinguish, as Windows' text files, for historical reason, are peculiar (use two bytes rather than one to end lines, and, at least in some cases, take a byte worth '\0x1A'as meaning a logical end of file) and so the reading and writing low-level code must compensate.

在 Windows(以及在此之前的 DOS)中,区分总是必不可少的,因为 Windows 的文本文件,由于历史原因,是特殊的(使用两个字节而不是一个来结束行,并且至少在某些情况下,使用一个字节值得'\0x1A'作为文件的逻辑结尾),因此读取和写入低级代码必须进行补偿。

So I suspect the OP is using Windows and is paying the price for not carefully using the 'rb'option ("read binary") to the openbuilt-in. (though bz2.BZ2Fileis still simpler, whatever platform you're using!-).

所以我怀疑 OP 正在使用 Windows 并且正在为不小心使用内置'rb'选项(“读取二进制文件”)而付出代价open。(尽管bz2.BZ2File仍然更简单,无论您使用什么平台!-)。

回答by Martin

openZip = open(zipFile, "r")

openZip = open(zipFile, "r")

If you're running on Windows, you may want to do say openZip = open(zipFile, "rb")here since the file is likely to contain CR/LF combinations, and you don't want them to be translated.

如果您在 Windows 上运行,您可能想在这里说openZip = open(zipFile, "rb")因为文件可能包含 CR/LF 组合,并且您不希望它们被翻译。

newLine = openZip.readline()

newLine = openZip.readline()

As Alex pointed out, this is very wrong, as the concept of "lines" is foreign to a compressed stream.

正如亚历克斯指出的那样,这是非常错误的,因为“行”的概念对于压缩流来说是陌生的。

s = fileHandle.read(1024) [...] uncompressedData += bz2.decompress(s)

s = fileHandle.read(1024) [...] uncompressedData += bz2.decompress(s)

This is wrong for the same reason. 1024-byte chunks aren't likely to mean much to the decompressor, since it's going to want to work with it's own block-size.

出于同样的原因,这是错误的。1024 字节的块对解压缩器来说意义不大,因为它会想要使用它自己的块大小。

s = fileHandle.read() uncompressedData = bz2.decompress(s)

s = fileHandle.read() uncompressedData = bz2.decompress(s)

If that doesn't work, I'd say it's the new-line translation problem I mentioned above.

如果这不起作用,我会说这是我上面提到的换行翻译问题。

回答by Jon L ehto

This was very helpful. 44 of 2300 files gave an end of file missing error, on Windows open. Adding the b(inary) flag to open fixed the problem.

这非常有帮助。在 Windows 打开时,2300 个文件中有 44 个给出了文件丢失的结尾错误。添加 b(inary) 标志以打开修复了问题。

for line in bz2.BZ2File(filename, 'rb', 10000000) :

works well. (the 10M is the buffering size that works well with the large files involved)

效果很好。(10M 是缓冲大小,适用于所涉及的大文件)

Thanks!

谢谢!