如何减少在python中加载pickle文件所花费的时间

Question

提问by iNikkz

I have created a dictionary in python and dumped into pickle. Its size went to 300MB. Now, I want to load the same pickle.

我在 python 中创建了一个字典并转储到泡菜中。它的大小达到了 300MB。现在，我想加载相同的泡菜。

output = open('myfile.pkl', 'rb')
mydict = pickle.load(output)

Loading this pickle takes around 15 seconds. How can I reduce this time?

加载这个泡菜大约需要15 秒。我怎样才能减少这个时间？

Hardware Specification: Ubuntu 14.04, 4GB RAM

硬件规格：Ubuntu 14.04，4GB RAM

The code bellow shows how much time takes to dump or load a file using json, pickle, cPickle.

下面的代码显示了使用 json、pickle、cPickle 转储或加载文件所需的时间。

After dumping, file size would be around 300MB.

转储后，文件大小约为 300MB。

import json, pickle, cPickle
import os, timeit
import json

mydict= {all values to be added}

def dump_json():    
    output = open('myfile1.json', 'wb')
    json.dump(mydict, output)
    output.close()    

def dump_pickle():    
    output = open('myfile2.pkl', 'wb')
    pickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL)
    output.close()

def dump_cpickle():    
    output = open('myfile3.pkl', 'wb')
    cPickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL)
    output.close()

def load_json():
    output = open('myfile1.json', 'rb')
    mydict = json.load(output)
    output.close()

def load_pickle():
    output = open('myfile2.pkl', 'rb')
    mydict = pickle.load(output)
    output.close()

def load_cpickle():
    output = open('myfile3.pkl', 'rb')
    mydict = pickle.load(output)
    output.close()


if __name__ == '__main__':
    print "Json dump: "
    t = timeit.Timer(stmt="pickle_wr.dump_json()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "Pickle dump: "
    t = timeit.Timer(stmt="pickle_wr.dump_pickle()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "cPickle dump: "
    t = timeit.Timer(stmt="pickle_wr.dump_cpickle()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "Json load: "
    t = timeit.Timer(stmt="pickle_wr.load_json()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "pickle load: "
    t = timeit.Timer(stmt="pickle_wr.load_pickle()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

    print "cPickle load: "
    t = timeit.Timer(stmt="pickle_wr.load_cpickle()", setup="import pickle_wr")  
    print t.timeit(1),'\n'

Output :

输出：

Json dump: 
42.5809804916 

Pickle dump: 
52.87407804489 

cPickle dump: 
1.1903790187836 

Json load: 
12.240660209656 

pickle load: 
24.48748306274 

cPickle load: 
24.4888298893

I have seen that cPickle takes less time to dump and load but loading a file still takes a long time.

我已经看到 cPickle 转储和加载所需的时间更少，但加载文件仍然需要很长时间。

Answer 1

回答by twasbrillig

Try using the jsonlibraryinstead of pickle. This should be an option in your case because you're dealing with a dictionary which is a relatively simple object.

尝试使用json库而不是pickle. 在您的情况下，这应该是一个选项，因为您正在处理一个相对简单的对象的字典。

According to this website,

根据这个网站，

JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).

JSON 读取（加载）快 25 倍，写入（转储）快 15 倍。

Also see this question: What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?

另请参阅此问题：什么更快 - 加载腌制字典对象或加载 JSON 文件 - 到字典？

Upgrading Python or using the marshalmodulewith a fixed Python version also helps boost speed (code adapted from here):

升级 Python 或使用具有固定 Python 版本的marshal模块也有助于提高速度（改编自此处的代码）：

try: import cPickle
except: import pickle as cPickle
import pickle
import json, marshal, random
from time import time
from hashlib import md5

test_runs = 1000

if __name__ == "__main__":
    payload = {
        "float": [(random.randrange(0, 99) + random.random()) for i in range(1000)],
        "int": [random.randrange(0, 9999) for i in range(1000)],
        "str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)]
    }
    modules = [json, pickle, cPickle, marshal]

    for payload_type in payload:
        data = payload[payload_type]
        for module in modules:
            start = time()
            if module.__name__ in ['pickle', 'cPickle']:
                for i in range(test_runs): serialized = module.dumps(data, protocol=-1)
            else:
                for i in range(test_runs): serialized = module.dumps(data)
            w = time() - start
            start = time()
            for i in range(test_runs):
                unserialized = module.loads(serialized)
            r = time() - start
            print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r))

Results:

结果：

C:\Python27\python.exe -u "serialization_benchmark.py"
json int W 0.125 R 0.156
pickle int W 2.808 R 1.139
cPickle int W 0.047 R 0.046
marshal int W 0.016 R 0.031
json float W 1.981 R 0.624
pickle float W 2.607 R 1.092
cPickle float W 0.063 R 0.062
marshal float W 0.047 R 0.031
json str W 0.172 R 0.437
pickle str W 5.149 R 2.309
cPickle str W 0.281 R 0.156
marshal str W 0.109 R 0.047

C:\pypy-1.6\pypy-c -u "serialization_benchmark.py"
json int W 0.515 R 0.452
pickle int W 0.546 R 0.219
cPickle int W 0.577 R 0.171
marshal int W 0.032 R 0.031
json float W 2.390 R 1.341
pickle float W 0.656 R 0.436
cPickle float W 0.593 R 0.406
marshal float W 0.327 R 0.203
json str W 1.141 R 1.186
pickle str W 0.702 R 0.546
cPickle str W 0.828 R 0.562
marshal str W 0.265 R 0.078

c:\Python34\python -u "serialization_benchmark.py"
json int W 0.203 R 0.140
pickle int W 0.047 R 0.062
pickle int W 0.031 R 0.062
marshal int W 0.031 R 0.047
json float W 1.935 R 0.749
pickle float W 0.047 R 0.062
pickle float W 0.047 R 0.062
marshal float W 0.047 R 0.047
json str W 0.281 R 0.187
pickle str W 0.125 R 0.140
pickle str W 0.125 R 0.140
marshal str W 0.094 R 0.078

Python 3.4 uses pickle protocol 3 as default, which gave no difference compared to protocol 4. Python 2 has protocol 2 as highest pickle protocol (selected if negative value is provided to dump), which is twice as slow as protocol 3.

Python 3.4 默认使用 pickle 协议 3，与协议 4 相比没有区别。 Python 2 将协议 2 作为最高的 pickle 协议（如果提供负值进行转储，则选择），其速度是协议 3 的两倍。

Answer 2

回答by Mike McKerns

If you are trying to store the dictionary to a single file, it's the load time for the large file that is slowing you down. One of the easiest things you can do is to write the dictionary to a directoryon disk, with each dictionary entry being an individual file. Then you can have the files pickled and unpickled in multiple threads (or using multiprocessing). For a very large dictionary, this should be much faster than reading to and from a single file, regardless of the serializer you choose. There are some packages like kleptoand joblibthat already do much (if not all of the above) for you. I'd check those packages out. (Note: I am the kleptoauthor. See https://github.com/uqfoundation/klepto).

如果您尝试将字典存储到单个文件中，那么大文件的加载时间会减慢您的速度。您可以做的最简单的事情之一是将字典写入磁盘上的目录，每个字典条目都是一个单独的文件。然后，您可以在多个线程（或使用多处理）中腌制和取消腌制文件。对于非常大的字典，这应该比从单个文件读取和读取要快得多，无论您选择哪种序列化程序。有喜欢的一些包klepto和joblib已经做很多（如果不是全部的以上）为您服务。我会检查这些包裹。（注：我是klepto作者。见https://github.com/uqfoundation/klepto）。

Answer 3

回答by Tejas Shah

I've had nice results in reading huge files (e.g: ~750 MB igraph object - a binary pickle file) using cPickle itself. This was achieved by simply wrapping up the pickle load call as mentioned here

我在使用 cPickle 本身读取大文件（例如：~750 MB igraph 对象 - 一个二进制 pickle 文件）方面取得了不错的结果。这是通过简单地结束这里提到的泡菜加载调用来实现的

Example snippet in your case would be something like:

您的案例中的示例代码段类似于：

import timeit
import cPickle as pickle
import gc


def load_cpickle_gc():
    output = open('myfile3.pkl', 'rb')

    # disable garbage collector
    gc.disable()

    mydict = pickle.load(output)

    # enable garbage collector again
    gc.enable()
    output.close()


if __name__ == '__main__':
    print "cPickle load (with gc workaround): "
    t = timeit.Timer(stmt="pickle_wr.load_cpickle_gc()", setup="import pickle_wr")
    print t.timeit(1),'\n'

Surely, there might be more apt ways to get the task done, however, this workaround does reduce the time required drastically. (For me, it reduced from 843.04s to 41.28s, around 20x)

当然，可能有更合适的方法来完成任务，但是，这种解决方法确实大大减少了所需的时间。（对我来说，它从 843.04s 减少到 41.28s，大约 20 倍）

如何减少在python中加载pickle文件所花费的时间

提问by iNikkz

回答by twasbrillig

回答by Mike McKerns

回答by Tejas Shah

相关推荐

最近更新

标签

如何减少在python中加载pickle文件所花费的时间

提问by iNikkz

回答by twasbrillig

回答by Mike McKerns

回答by Tejas Shah

相关推荐

使用 html 按钮运行 python 脚本

Python 将十六进制数据写入文件

python中的多处理-在多个进程之间共享大对象（例如pandas数据帧）

Python 任何人都可以识别这种编码吗？

相关推荐

最近更新

标签