Python：创建一个类似于 gzip 的流式文件？

Question

提问by David Wolever

I'm trying to figure out the best way to compress a stream with Python's zlib.

我正在尝试找出使用 Python 的zlib.

I've got a file-like input stream (input, below) and an output function which accepts a file-like (output_function, below):

我有一个类似文件的输入流（input下面）和一个接受类似文件的输出函数（output_function下面）：

with open("file") as input:
    output_function(input)

And I'd like to gzip-compress inputchunks before sending them to output_function:

我想input在将它们发送到之前对块进行 gzip 压缩output_function：

with open("file") as input:
    output_function(gzip_stream(input))

It looks like the gzipmodule assumes that either the input or the output will be a gzip'd file-on-disk… So I assume that the zlibmodule is what I want.

看起来gzip模块假设输入或输出将是一个 gzip 的磁盘文件......所以我假设zlib模块是我想要的。

However, it doesn't natively offer a simple way to create a stream file-like… And the stream-compression it does support comes by way of manually adding data to a compression buffer, then flushing that buffer.

然而，它本身并没有提供一种创建类似流文件的简单方法……它支持的流压缩是通过手动将数据添加到压缩缓冲区，然后刷新该缓冲区来实现的。

Of course, I could write a wrapper around zlib.Compress.compressand zlib.Compress.flush(Compressis returned by zlib.compressobj()), but I'd be worried about getting buffer sizes wrong, or something similar.

当然，我可以围绕zlib.Compress.compressand zlib.Compress.flush(Compress由zlib.compressobj())编写一个包装器，但我会担心缓冲区大小错误或类似的东西。

So, what's the simplest way to create a streaming, gzip-compressing file-like with Python?

那么，使用 Python 创建类似 gzip 的流式压缩文件的最简单方法是什么？

Edit: To clarify, the input stream and the compressed output stream are both too large to fit in memory, so something like output_function(StringIO(zlib.compress(input.read())))doesn't really solve the problem.

编辑：澄清一下，输入流和压缩输出流都太大而无法放入内存，所以类似的东西output_function(StringIO(zlib.compress(input.read())))并不能真正解决问题。

Answer 1

采纳答案by Ricardo Cárdenes

It's quite kludgy (self referencing, etc; just put a few minutes writing it, nothing really elegant), but it does what you want if you're still interested in using gzipinstead of zlibdirectly.

它很笨拙（自引用等；只需花几分钟写它，没什么特别优雅的），但是如果您仍然有兴趣使用gzip而不是zlib直接使用它，它可以满足您的需求。

Basically, GzipWrapis a (very limited) file-like object that produces a gzipped file out of a given iterable (e.g., a file-like object, a list of strings, any generator...)

基本上，GzipWrap是一个（非常有限的）类文件对象，它从给定的可迭代对象（例如，类文件对象、字符串列表、任何生成器......）

Of course, it produces binary so there was no sense in implementing "readline".

当然，它会生成二进制文件，因此实现“readline”是没有意义的。

You should be able to expand it to cover other cases or to be used as an iterable object itself.

您应该能够扩展它以涵盖其他情况或用作可迭代对象本身。

from gzip import GzipFile

class GzipWrap(object):
    # input is a filelike object that feeds the input
    def __init__(self, input, filename = None):
        self.input = input
        self.buffer = ''
        self.zipper = GzipFile(filename, mode = 'wb', fileobj = self)

    def read(self, size=-1):
        if (size < 0) or len(self.buffer) < size:
            for s in self.input:
                self.zipper.write(s)
                if size > 0 and len(self.buffer) >= size:
                    self.zipper.flush()
                    break
            else:
                self.zipper.close()
            if size < 0:
                ret = self.buffer
                self.buffer = ''
        else:
            ret, self.buffer = self.buffer[:size], self.buffer[size:]
        return ret

    def flush(self):
        pass

    def write(self, data):
        self.buffer += data

    def close(self):
        self.input.close()

Answer 2

回答by Collin

Here is a cleaner, non-self-referencing version based on Ricardo Cárdenes' very helpful answer.

这是一个基于 Ricardo Cárdenes 非常有用的答案的更简洁的非自引用版本。

from gzip import GzipFile
from collections import deque


CHUNK = 16 * 1024


class Buffer (object):
    def __init__ (self):
        self.__buf = deque()
        self.__size = 0
    def __len__ (self):
        return self.__size
    def write (self, data):
        self.__buf.append(data)
        self.__size += len(data)
    def read (self, size=-1):
        if size < 0: size = self.__size
        ret_list = []
        while size > 0 and len(self.__buf):
            s = self.__buf.popleft()
            size -= len(s)
            ret_list.append(s)
        if size < 0:
            ret_list[-1], remainder = ret_list[-1][:size], ret_list[-1][size:]
            self.__buf.appendleft(remainder)
        ret = ''.join(ret_list)
        self.__size -= len(ret)
        return ret
    def flush (self):
        pass
    def close (self):
        pass


class GzipCompressReadStream (object):
    def __init__ (self, fileobj):
        self.__input = fileobj
        self.__buf = Buffer()
        self.__gzip = GzipFile(None, mode='wb', fileobj=self.__buf)
    def read (self, size=-1):
        while size < 0 or len(self.__buf) < size:
            s = self.__input.read(CHUNK)
            if not s:
                self.__gzip.close()
                break
            self.__gzip.write(s)
        return self.__buf.read(size)

Advantages:

好处：

Avoids repeated string concatenation, which would cause the entire string to be copied repeatedly.
Reads a fixed CHUNK size from the input stream, instead of reading whole lines at a time (which can be arbitrarily long).
Avoids circular references.
Avoids misleading public "write" method of GzipCompressStream(), which is really only used internally.
Takes advantage of name mangling for internal member variables.

避免重复的字符串连接，这会导致整个字符串被重复复制。
从输入流中读取固定的 CHUNK 大小，而不是一次读取整行（可以任意长）。
避免循环引用。
避免误导性的 GzipCompressStream() 公共“写入”方法，该方法实际上仅在内部使用。
利用内部成员变量的名称修改。

Answer 3

回答by user249290

The gzip module supports compressing to a file-like object, pass a fileobj parameter to GzipFile, as well as a filename. The filename you pass in doesn't need to exist, but the gzip header has a filename field which needs to be filled out.

gzip 模块支持压缩为类似文件的对象，将 fileobj 参数以及文件名传递给 GzipFile。您传入的文件名不需要存在，但 gzip 标头有一个需要填写的文件名字段。

Update

更新

This answer does not work. Example:

这个答案不起作用。例子：

# tmp/try-gzip.py 
import sys
import gzip

fd=gzip.GzipFile(fileobj=sys.stdin)
sys.stdout.write(fd.read())

output:

输出：

===> cat .bash_history  | python tmp/try-gzip.py  > tmp/history.gzip
Traceback (most recent call last):
  File "tmp/try-gzip.py", line 7, in <module>
    sys.stdout.write(fd.read())
  File "/usr/lib/python2.7/gzip.py", line 254, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 288, in _read
    pos = self.fileobj.tell()   # Save current position
IOError: [Errno 29] Illegal seek

Answer 4

回答by jcdyer

Use the cStringIO (or StringIO) module in conjunction with zlib:

将 cStringIO（或 StringIO）模块与 zlib 结合使用：

>>> import zlib
>>> from cStringIO import StringIO
>>> s.write(zlib.compress("I'm a lumberHyman"))
>>> s.seek(0)
>>> zlib.decompress(s.read())
"I'm a lumberHyman"

Answer 5

回答by user582175

This works (at least in python 3):

这有效（至少在python 3中）：

with s3.open(path, 'wb') as f:
    gz = gzip.GzipFile(filename, 'wb', 9, f)
    gz.write(b'hello')
    gz.flush()
    gz.close()

Here it writes to s3fs's file object with a gzip compression on it. The magic is the fparameter, which is GzipFile's fileobj. You have to provide a file name for gzip's header.

在这里，它使用 gzip 压缩写入 s3fs 的文件对象。神奇的是f参数，它是 GzipFile 的fileobj. 您必须为 gzip 的标头提供文件名。

Python：创建一个类似于 gzip 的流式文件？

提问by David Wolever

采纳答案by Ricardo Cárdenes

回答by Collin

回答by user249290

回答by jcdyer

回答by user582175

相关推荐

最近更新

标签

Python：创建一个类似于 gzip 的流式文件？

提问by David Wolever

采纳答案by Ricardo Cárdenes

回答by Collin

回答by user249290

回答by jcdyer

回答by user582175

相关推荐

python 有人可以解释一个只检查值是否与某种模式匹配的货币正则表达式吗？

Python：将列表转换为普通值

Joe 的 Erlang websocket 示例的 Python 示例

python Python对象转换

相关推荐

最近更新

标签