Python:创建一个类似于 gzip 的流式文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2192529/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: Creating a streaming gzip'd file-like?
提问by David Wolever
I'm trying to figure out the best way to compress a stream with Python's zlib
.
我正在尝试找出使用 Python 的zlib
.
I've got a file-like input stream (input
, below) and an output function which accepts a file-like (output_function
, below):
我有一个类似文件的输入流(input
下面)和一个接受类似文件的输出函数(output_function
下面):
with open("file") as input:
output_function(input)
And I'd like to gzip-compress input
chunks before sending them to output_function
:
我想input
在将它们发送到之前对块进行 gzip 压缩output_function
:
with open("file") as input:
output_function(gzip_stream(input))
It looks like the gzipmodule assumes that either the input or the output will be a gzip'd file-on-disk… So I assume that the zlibmodule is what I want.
看起来gzip模块假设输入或输出将是一个 gzip 的磁盘文件......所以我假设zlib模块是我想要的。
However, it doesn't natively offer a simple way to create a stream file-like… And the stream-compression it does support comes by way of manually adding data to a compression buffer, then flushing that buffer.
然而,它本身并没有提供一种创建类似流文件的简单方法……它支持的流压缩是通过手动将数据添加到压缩缓冲区,然后刷新该缓冲区来实现的。
Of course, I could write a wrapper around zlib.Compress.compress
and zlib.Compress.flush
(Compress
is returned by zlib.compressobj()
), but I'd be worried about getting buffer sizes wrong, or something similar.
当然,我可以围绕zlib.Compress.compress
and zlib.Compress.flush
(Compress
由zlib.compressobj()
)编写一个包装器,但我会担心缓冲区大小错误或类似的东西。
So, what's the simplest way to create a streaming, gzip-compressing file-like with Python?
那么,使用 Python 创建类似 gzip 的流式压缩文件的最简单方法是什么?
Edit: To clarify, the input stream and the compressed output stream are both too large to fit in memory, so something like output_function(StringIO(zlib.compress(input.read())))
doesn't really solve the problem.
编辑:澄清一下,输入流和压缩输出流都太大而无法放入内存,所以类似的东西output_function(StringIO(zlib.compress(input.read())))
并不能真正解决问题。
采纳答案by Ricardo Cárdenes
It's quite kludgy (self referencing, etc; just put a few minutes writing it, nothing really elegant), but it does what you want if you're still interested in using gzip
instead of zlib
directly.
它很笨拙(自引用等;只需花几分钟写它,没什么特别优雅的),但是如果您仍然有兴趣使用gzip
而不是zlib
直接使用它,它可以满足您的需求。
Basically, GzipWrap
is a (very limited) file-like object that produces a gzipped file out of a given iterable (e.g., a file-like object, a list of strings, any generator...)
基本上,GzipWrap
是一个(非常有限的)类文件对象,它从给定的可迭代对象(例如,类文件对象、字符串列表、任何生成器......)
Of course, it produces binary so there was no sense in implementing "readline".
当然,它会生成二进制文件,因此实现“readline”是没有意义的。
You should be able to expand it to cover other cases or to be used as an iterable object itself.
您应该能够扩展它以涵盖其他情况或用作可迭代对象本身。
from gzip import GzipFile
class GzipWrap(object):
# input is a filelike object that feeds the input
def __init__(self, input, filename = None):
self.input = input
self.buffer = ''
self.zipper = GzipFile(filename, mode = 'wb', fileobj = self)
def read(self, size=-1):
if (size < 0) or len(self.buffer) < size:
for s in self.input:
self.zipper.write(s)
if size > 0 and len(self.buffer) >= size:
self.zipper.flush()
break
else:
self.zipper.close()
if size < 0:
ret = self.buffer
self.buffer = ''
else:
ret, self.buffer = self.buffer[:size], self.buffer[size:]
return ret
def flush(self):
pass
def write(self, data):
self.buffer += data
def close(self):
self.input.close()
回答by Collin
Here is a cleaner, non-self-referencing version based on Ricardo Cárdenes' very helpful answer.
这是一个基于 Ricardo Cárdenes 非常有用的答案的更简洁的非自引用版本。
from gzip import GzipFile
from collections import deque
CHUNK = 16 * 1024
class Buffer (object):
def __init__ (self):
self.__buf = deque()
self.__size = 0
def __len__ (self):
return self.__size
def write (self, data):
self.__buf.append(data)
self.__size += len(data)
def read (self, size=-1):
if size < 0: size = self.__size
ret_list = []
while size > 0 and len(self.__buf):
s = self.__buf.popleft()
size -= len(s)
ret_list.append(s)
if size < 0:
ret_list[-1], remainder = ret_list[-1][:size], ret_list[-1][size:]
self.__buf.appendleft(remainder)
ret = ''.join(ret_list)
self.__size -= len(ret)
return ret
def flush (self):
pass
def close (self):
pass
class GzipCompressReadStream (object):
def __init__ (self, fileobj):
self.__input = fileobj
self.__buf = Buffer()
self.__gzip = GzipFile(None, mode='wb', fileobj=self.__buf)
def read (self, size=-1):
while size < 0 or len(self.__buf) < size:
s = self.__input.read(CHUNK)
if not s:
self.__gzip.close()
break
self.__gzip.write(s)
return self.__buf.read(size)
Advantages:
好处:
- Avoids repeated string concatenation, which would cause the entire string to be copied repeatedly.
- Reads a fixed CHUNK size from the input stream, instead of reading whole lines at a time (which can be arbitrarily long).
- Avoids circular references.
- Avoids misleading public "write" method of GzipCompressStream(), which is really only used internally.
- Takes advantage of name mangling for internal member variables.
- 避免重复的字符串连接,这会导致整个字符串被重复复制。
- 从输入流中读取固定的 CHUNK 大小,而不是一次读取整行(可以任意长)。
- 避免循环引用。
- 避免误导性的 GzipCompressStream() 公共“写入”方法,该方法实际上仅在内部使用。
- 利用内部成员变量的名称修改。
回答by user249290
The gzip module supports compressing to a file-like object, pass a fileobj parameter to GzipFile, as well as a filename. The filename you pass in doesn't need to exist, but the gzip header has a filename field which needs to be filled out.
gzip 模块支持压缩为类似文件的对象,将 fileobj 参数以及文件名传递给 GzipFile。您传入的文件名不需要存在,但 gzip 标头有一个需要填写的文件名字段。
Update
更新
This answer does not work. Example:
这个答案不起作用。例子:
# tmp/try-gzip.py
import sys
import gzip
fd=gzip.GzipFile(fileobj=sys.stdin)
sys.stdout.write(fd.read())
output:
输出:
===> cat .bash_history | python tmp/try-gzip.py > tmp/history.gzip
Traceback (most recent call last):
File "tmp/try-gzip.py", line 7, in <module>
sys.stdout.write(fd.read())
File "/usr/lib/python2.7/gzip.py", line 254, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 288, in _read
pos = self.fileobj.tell() # Save current position
IOError: [Errno 29] Illegal seek
回答by jcdyer
Use the cStringIO (or StringIO) module in conjunction with zlib:
将 cStringIO(或 StringIO)模块与 zlib 结合使用:
>>> import zlib
>>> from cStringIO import StringIO
>>> s.write(zlib.compress("I'm a lumberHyman"))
>>> s.seek(0)
>>> zlib.decompress(s.read())
"I'm a lumberHyman"
回答by user582175
This works (at least in python 3):
这有效(至少在python 3中):
with s3.open(path, 'wb') as f:
gz = gzip.GzipFile(filename, 'wb', 9, f)
gz.write(b'hello')
gz.flush()
gz.close()
Here it writes to s3fs's file object with a gzip compression on it.
The magic is the f
parameter, which is GzipFile's fileobj
. You have to provide a file name for gzip's header.
在这里,它使用 gzip 压缩写入 s3fs 的文件对象。神奇的是f
参数,它是 GzipFile 的fileobj
. 您必须为 gzip 的标头提供文件名。