在python中读取/gunzip一个大文件的更干净的方法

Question

提问by LittleBobbyTables

So I have some fairly gigantic .gz files - we're talking 10 to 20 gb each when decompressed.

所以我有一些相当大的 .gz 文件 - 我们在解压缩时每个都是 10 到 20 GB。

I need to loop through each line of them, so I'm using the standard:

我需要遍历它们的每一行，所以我使用标准：

import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
    #(yadda yadda)
f.close()

However, both the open()and close()commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killedto the terminal. Maybe it is loading the entire extracted file into memory?

但是，open()和close()命令都占用 AGES，占用了 98% 的内存 + CPU。以至于程序退出并打印Killed到终端。也许它正在将整个提取的文件加载到内存中？

I'm now using something like:

我现在正在使用类似的东西：

from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file

This works. But is there a cleaner way?

这有效。但是有更干净的方法吗？

Answer 1

采纳答案by abarnert

I'm 99% sure that your problem is not in the gzip.open(), but in the readlines().

我99％肯定，你的问题是不是在gzip.open()，但在readlines()。

As the documentationexplains:

正如文档解释的那样：

f.readlines() returns a list containing all the lines of data in the file.

f.readlines() 返回一个包含文件中所有数据行的列表。

Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.

显然，这需要读取并解压缩整个文件，并构建一个绝对巨大的列表。

Most likely, it's actually the malloccalls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.

最有可能的是，它实际上malloc是永远占用所有内存的调用。然后，在此范围的末尾（假设您使用的是 CPython），它必须对整个庞大的列表进行 GC，这也将花费很长时间。

You almost never want to use readlines. Unless you're using a very old Python, just do this:

您几乎从不想使用readlines. 除非您使用的是非常旧的 Python，否则请执行以下操作：

for line in f:

A fileis an iterable full of lines, just like the listreturned by readlines—except that it's not actually a list, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.

Afile是一个充满行的可迭代对象，就像由 —list返回的readlines一样，除了它实际上不是 a 之外list，它通过读取缓冲区动态生成更多行。因此，在任何给定时间，您将只有一行和几个缓冲区，每个缓冲区大小为 10MB，而不是 25GB list。并且读取和解压缩将在循环的整个生命周期中进行，而不是一次性完成。

From a quick test, with a 3.5GB gzip file, gzip.open()is effectively instant, for line in f: passtakes a few seconds, gzip.close()is effectively instant. But if I do for line in f.readlines(): pass, it takes…?well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…

从快速测试来看，使用 3.5GB gzip 文件，gzip.open()实际上是即时的，for line in f: pass需要几秒钟，gzip.close()实际上是即时的。但是如果我这样做了for line in f.readlines(): pass，它需要......？好吧，我不确定多长时间，因为大约一分钟后，我的系统进入了交换颠簸地狱，我不得不强制杀死解释器以使其响应任何事情......

Since this has come up a dozen more times since this answer, I wrote this blog postwhich explains a bit more.

由于自这个答案以来这已经出现了十几次，我写了这篇博客文章，解释了更多。

Answer 2

回答by Francesco Montesano

Have a look at pandas, in particular IO tools. They support gzip compression when reading files and you can read files in chunks. Besides, pandas is very fast and memory efficient.

看看熊猫，特别是 IO 工具。它们在读取文件时支持 gzip 压缩，您可以分块读取文件。此外，pandas 速度非常快，内存效率也很高。

As I never tried, I don't know how well the compression and reading in chunks live together, but it might be worth giving a try

由于我从未尝试过，我不知道压缩和块读取的效果如何，但可能值得一试

在python中读取/gunzip一个大文件的更干净的方法

提问by LittleBobbyTables

采纳答案by abarnert

回答by Francesco Montesano

相关推荐

最近更新

标签

在python中读取/gunzip一个大文件的更干净的方法

提问by LittleBobbyTables

采纳答案by abarnert

回答by Francesco Montesano

相关推荐

Python Pandas：带有 aggfunc = count unique distinct 的数据透视表

在python中将pdf转换为text/html以便我可以解析它

Python 读取大输出时，Paramiko 通道卡住

Python 在深度优先搜索中跟踪和返回路径

相关推荐

最近更新

标签