Python 减小 cPickle 对象的大小

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18474791/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:50:56  来源:igfitidea点击:

Decreasing the size of cPickle objects

pythonserializationpickle

提问by ddn

I am running code that creates large objects, containing multiple user-defined classes, which I must then serialize for later use. From what I can tell, only pickling is versatile enough for my requirements. I've been using cPickle to store them but the objects it generates are approximately 40G in size, from code that runs in 500 mb of memory. Speed of serialization isn't an issue, but size of the object is. Are there any tips or alternate processes I can use to make the pickles smaller?

我正在运行创建大型对象的代码,其中包含多个用户定义的类,然后我必须对其进行序列化以供以后使用。据我所知,只有酸洗才能满足我的要求。我一直在使用 cPickle 来存储它们,但它生成的对象大小约为 40G,来自在 500 mb 内存中运行的代码。序列化的速度不是问题,但对象的大小是问题。有什么技巧或替代方法可以使泡菜变小吗?

采纳答案by Viktor Kerkez

If you must use pickle and no other method of serialization works for you, you can always pipe the pickle through bzip2. The only problem is that bzip2is a little bit slowish... gzipshould be faster, but the file size is almost 2x bigger:

如果您必须使用 pickle 并且没有其他序列化方法适合您,您可以始终通过bzip2. 唯一的问题是bzip2有点慢......gzip应该更快,但文件大小几乎是 2 倍:

In [1]: class Test(object):
            def __init__(self):
                self.x = 3841984789317471348934788731984731749374
                self.y = 'kdjsaflkjda;sjfkdjsf;klsdjakfjdafjdskfl;adsjfl;dasjf;ljfdlf'
        l = [Test() for i in range(1000000)]

In [2]: import cPickle as pickle          
        with open('test.pickle', 'wb') as f:
            pickle.dump(l, f)
        !ls -lh test.pickle
-rw-r--r--  1 viktor  staff    88M Aug 27 22:45 test.pickle

In [3]: import bz2
        import cPickle as pickle
        with bz2.BZ2File('test.pbz2', 'w') as f:
            pickle.dump(l, f)
        !ls -lh test.pbz2
-rw-r--r--  1 viktor  staff   2.3M Aug 27 22:47 test.pbz2

In [4]: import gzip
        import cPickle as pickle
        with gzip.GzipFile('test.pgz', 'w') as f:
            pickle.dump(l, f)
        !ls -lh test.pgz
-rw-r--r--  1 viktor  staff   4.8M Aug 27 22:51 test.pgz

So we see that the file size of the bzip2is almost 40x smaller, gzipis 20x smaller. And gzip is pretty close in performance to the raw cPickle, as you can see:

所以我们看到文件大小bzip2几乎小了 40gzip倍,小了 20 倍。并且 gzip 在性能上与原始 cPickle 非常接近,如您所见:

cPickle : best of 3: 18.9 s per loop
bzip2   : best of 3: 54.6 s per loop
gzip    : best of 3: 24.4 s per loop

回答by John Lyon

You can combine your cPickle dumpcall with a zipfile:

您可以将 cPickledump调用与 zipfile结合使用:

import cPickle
import gzip

def save_zipped_pickle(obj, filename, protocol=-1):
    with gzip.open(filename, 'wb') as f:
        cPickle.dump(obj, f, protocol)

And to re-load a zipped pickled object:

并重新加载压缩的腌制对象:

def load_zipped_pickle(filename):
    with gzip.open(filename, 'rb') as f:
        loaded_object = cPickle.load(f)
        return loaded_object

回答by Moot

You might want to use a more efficient pickling protocol.

您可能希望使用更有效的酸洗协议。

As of now, there are three pickle protocols:

截至目前,有三种pickle协议

  • Protocol version 0 is the original ASCII protocol and is backwards compatible with earlier versions of Python.
  • Protocol version 1 is the old binary format which is also compatible with earlier versions of Python.
  • Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes.
  • 协议版本 0 是原始 ASCII 协议,向后兼容早期版本的 Python。
  • 协议版本 1 是旧的二进制格式,它也与早期版本的 Python 兼容。
  • 协议版本 2 是在 Python 2.3 中引入的。它提供了更有效的新型类酸洗。

and furthermore, the default is protocol 0, the least efficient one:

此外,默认是协议 0,效率最低的一个:

If a protocol is not specified, protocol 0 is used. If protocol is specified as a negative value or HIGHEST_PROTOCOL, the highest protocol version available will be used.

如果未指定协议,则使用协议 0。如果协议指定为负值或 HIGHEST_PROTOCOL,则将使用可用的最高协议版本。

Let's check the difference in size between using the latest protocol, which is currently protocol 2 (the most efficient one) and using protocol 0 (the default) for an arbitrary example. Note that I use protocol=-1 here, to make sure we are always using the latest protocol, and that I import cPickle to make sure we are using the faster C implementation:

让我们检查一下使用最新协议(当前是协议 2(最有效的协议))和使用协议 0(默认)之间的大小差异,作为任意示例。请注意,我在这里使用 protocol=-1,以确保我们始终使用最新的协议,并导入 cPickle 以确保我们使用更快的 C 实现:

import numpy
from sys import getsizeof
import cPickle as pickle

# Create list of 10 arrays of size 100x100
a = [numpy.random.random((100, 100)) for _ in xrange(10)]

# Pickle to a string in two ways
str_old = pickle.dumps(a, protocol=0)
str_new = pickle.dumps(a, protocol=-1)

# Measure size of strings
size_old = getsizeof(str_old)
size_new = getsizeof(str_new)

# Print size (in kilobytes) using old, using new, and the ratio
print size_old / 1024.0, size_new / 1024.0, size_old / float(size_new)

The print out I get is:

我得到的打印结果是:

2172.27246094 781.703125 2.77889698975

Indicating that pickling using the old protocol used up 2172KB, pickling using the new protocol used up 782KB and the difference is a factor of x2.8. Note that this factor is specific to this example - your results might vary, depending on the object you are pickling.

表明使用旧协议进行酸洗使用了 2172KB,使用新协议进行酸洗使用了 782KB,差异是 x2.8 的因子。请注意,此因素特定于此示例 - 您的结果可能会有所不同,具体取决于您酸洗的对象。