Python Pickle dump 大文件没有内存错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17513036/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pickle dump huge file without memory error
提问by user2543682
I have a program where I basically adjust the probability of certain things happening based on what is already known. My file of data is already saved as a pickle
Dictionary object at Dictionary.txt
.
我有一个程序,我基本上可以根据已知情况调整某些事情发生的概率。我的数据文件已pickle
在Dictionary.txt
.
The problem is that everytime that I run the program it pulls in the Dictionary.txt
, turns it into a dictionary object, makes it's edits and overwrites Dictionary.txt
. This is pretty memory intensive as the Dictionary.txt
is 123 MB. When I dump I am getting the MemoryError, everything seems fine when I pull it in..
问题是,每次我运行它拉的程序Dictionary.txt
,把它变成一个字典对象,使得它的修改和重写Dictionary.txt
。这是非常占用内存的,因为它Dictionary.txt
是 123 MB。当我转储时,我收到了MemoryError,当我把它拉进来时,一切似乎都很好..
Is there a better (more efficient) way of doing the edits? (Perhaps w/o having to overwrite the entire file everytime)
Is there a way that I can invoke garbage collection (through
gc
module)? (I already have it auto-enabled viagc.enable()
)I know that besides
readlines()
you can read line-by-line. Is there a way to edit the dictionary incrementally line-by-line when I already have a fully completed Dictionary object File in the program.Any other solutions?
有没有更好(更有效)的编辑方式?(也许不必每次都覆盖整个文件)
有没有办法调用垃圾收集(通过
gc
模块)?(我已经通过它自动启用了gc.enable()
)我知道除了
readlines()
你可以逐行阅读。当我在程序中已经有一个完整的字典对象文件时,有没有办法逐行增量编辑字典。还有其他解决方案吗?
Thank you for your time.
感谢您的时间。
回答by Imran
If your key and values are string, you can use one of the embedded persistent key-value storage engines available in Python standard library. Example from the anydbm
module docs:
如果您的键和值是字符串,您可以使用 Python 标准库中提供的嵌入式持久键值存储引擎之一。anydbm
模块文档中的示例:
import anydbm
# Open database, creating it if necessary.
db = anydbm.open('cache', 'c')
# Record some values
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Loop through contents. Other dictionary methods
# such as .keys(), .values() also work.
for k, v in db.iteritems():
print k, '\t', v
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# Close when done.
db.close()
回答by Chris Wheadon
Have you tried using streaming pickle: https://code.google.com/p/streaming-pickle/
您是否尝试过使用流泡菜:https: //code.google.com/p/streaming-pickle/
I have just solved a similar memory error by switching to streaming pickle.
我刚刚通过切换到流式 pickle 解决了类似的内存错误。
回答by richie
How about this?
这个怎么样?
import cPickle as pickle
p = pickle.Pickler(open("temp.p","wb"))
p.fast = True
p.dump(d) # d could be your dictionary or any file
回答by denfromufa
I had memory error and resolved it by using protocol=2:
我有内存错误并通过使用协议 = 2 解决了它:
cPickle.dump(obj, file, protocol=2)
cPickle.dump(obj, file, protocol=2)
回答by Andrew Scott Evans
I recently had this problem. After trying cpickle with ASCII and the binary protocol 2, I found that my SVM from sci-kit learn trained on 20+ gb of data was not pickling due to a memory error. However, the dill package seemed to resolve the issue. Dill will not create many improvements for a dictionary but may help with streaming. It is meant to stream pickled bytes across a network.
我最近遇到了这个问题。在使用 ASCII 和二进制协议 2 尝试 cpickle 后,我发现我的 SVM 从 sci-kit 学习训练了 20+ GB 的数据,由于内存错误而没有酸洗。然而,莳萝包似乎解决了这个问题。Dill 不会为字典带来很多改进,但可能有助于流式传输。它旨在通过网络流式传输腌制字节。
import dill
with open(path,'wb') as fp:
dill.dump(outpath,fp)
dill.load(fp)
If efficiency is an issue, try loading/saving to a database. In this instance, your storage solution may be an issue. At 123 mb Pandas should be fine. However, if the machine has limited memory SQL offers fast,optimized, bag operations over data, usually with multithreaded support. My poly kernel svm saved.
如果效率是一个问题,请尝试加载/保存到数据库。在这种情况下,您的存储解决方案可能有问题。123 mb Pandas 应该没问题。但是,如果机器内存有限,SQL 会提供快速、优化的数据包操作,通常具有多线程支持。我的多核 svm 保存了。
回答by Mike McKerns
I am the author of a package called klepto
(and also the author of dill
).
klepto
is built to store and retrieve objects in a very simple way, and provides a simple dictionary interface to databases, memory cache, and storage on disk. Below, I show storing large objects in a "directory archive", which is a filesystem directory with one file per entry. I choose to serialize the objects (it's slower, but uses dill
, so you can store almost any object), and I choose a cache. Using a memory cache enables me to have fast access to the directory archive, without having to have the entire archive in memory. Interacting with a database or file can be slow, but interacting with memory is fast… and you can populate the memory cache as you like from the archive.
我是一个名为的包的作者klepto
(也是 )的作者dill
。
klepto
旨在以非常简单的方式存储和检索对象,并为数据库、内存缓存和磁盘存储提供简单的字典接口。下面,我展示了将大对象存储在“目录存档”中,这是一个文件系统目录,每个条目一个文件。我选择序列化对象(它比较慢,但使用dill
,因此您几乎可以存储任何对象),并选择缓存。使用内存缓存使我能够快速访问目录存档,而不必将整个存档保存在内存中。与数据库或文件的交互可能很慢,但与内存的交互很快……您可以根据需要从存档中填充内存缓存。
>>> import klepto
>>> d = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> d
dir_archive('stuff', {}, cached=True)
>>> import numpy
>>> # add three entries to the memory cache
>>> d['big1'] = numpy.arange(1000)
>>> d['big2'] = numpy.arange(1000)
>>> d['big3'] = numpy.arange(1000)
>>> # dump from memory cache to the on-disk archive
>>> d.dump()
>>> # clear the memory cache
>>> d.clear()
>>> d
dir_archive('stuff', {}, cached=True)
>>> # only load one entry to the cache from the archive
>>> d.load('big1')
>>> d['big1'][-3:]
array([997, 998, 999])
>>>
klepto
provides fast and flexible access to large amounts of storage, and if the archive allows parallel access (e.g. some databases) then you can read results in parallel. It's also easy to share results in different parallel processes or on different machines. Here I create a second archive instance, pointed at the same directory archive. It's easy to pass keys between the two objects, and works no differently from a different process.
klepto
提供对大量存储的快速灵活访问,如果存档允许并行访问(例如某些数据库),则您可以并行读取结果。在不同的并行进程或不同的机器上共享结果也很容易。在这里,我创建了第二个归档实例,指向同一个目录归档。在两个对象之间传递密钥很容易,并且与不同的进程没有什么不同。
>>> f = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> f
dir_archive('stuff', {}, cached=True)
>>> # add some small objects to the first cache
>>> d['small1'] = lambda x:x**2
>>> d['small2'] = (1,2,3)
>>> # dump the objects to the archive
>>> d.dump()
>>> # load one of the small objects to the second cache
>>> f.load('small2')
>>> f
dir_archive('stuff', {'small2': (1, 2, 3)}, cached=True)
You can also pick from various levels of file compression, and whether you want the files to be memory-mapped. There are a lot of different options, both for file backends and database backends. The interface is identical, however.
您还可以从不同级别的文件压缩中进行选择,以及是否希望对文件进行内存映射。对于文件后端和数据库后端,有很多不同的选项。但是,界面是相同的。
With regard to your other questions about garbage collection and editing of portions of the dictionary, both are possible with klepto
, as you can individually load and remove objects from the memory cache, dump, load, and synchronize with the archive backend, or any of the other dictionary methods.
关于您关于垃圾收集和字典部分编辑的其他问题,两者都可以使用klepto
,因为您可以单独从内存缓存中加载和删除对象,转储,加载和同步存档后端,或任何其他字典方法。
See a longer tutorial here: https://github.com/mmckerns/tlkklp
在此处查看更长的教程:https: //github.com/mmckerns/tlkklp
Get klepto
here: https://github.com/uqfoundation
获取klepto
此:https://github.com/uqfoundation
回答by gidim
None of the above answers worked for me. I ended up using Hickle which is a drop-in replacement for pickle based on HDF5. Instead of saving it to a pickle it's saving the data to HDF5 file. The API is identical for most use cases and it has some really cool features such as compression.
以上答案都不适合我。我最终使用了 Hickle,它是基于 HDF5 的泡菜的直接替代品。它不是将其保存到泡菜中,而是将数据保存到 HDF5 文件中。对于大多数用例,API 是相同的,并且它具有一些非常酷的功能,例如压缩。
pip install hickle
Example:
例子:
# Create a numpy array of data
array_obj = np.ones(32768, dtype='float32')
# Dump to file
hkl.dump(array_obj, 'test.hkl', mode='w')
# Load data
array_hkl = hkl.load('test.hkl')
回答by Ch HaXam
I was having the same issue. I use joblib and work was done. In case if someone wants to know other possibilities.
我遇到了同样的问题。我使用 joblib 并完成了工作。如果有人想知道其他可能性。
save the model to disk
将模型保存到磁盘
from sklearn.externals import joblib
filename = 'finalized_model.sav'
joblib.dump(model, filename)
some time later... load the model from disk
一段时间后...从磁盘加载模型
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)
回答by lyron
This may seem trivial, but try to use the 64bit Python if you are not.
这可能看起来微不足道,但如果不是,请尝试使用 64 位 Python。