Python 中 Pickle 的 MemoryError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28068872/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
MemoryError with Pickle in Python
提问by flotr
I am processing some data and I have stored the results in three dictionaries, and I have saved them to the disk with Pickle. Each dictionary has 500-1000MB.
我正在处理一些数据,并将结果存储在三个字典中,并使用 Pickle 将它们保存到磁盘。每本词典有 500-1000MB。
Now I am loading them with:
现在我正在加载它们:
import pickle
with open('dict1.txt', "rb") as myFile:
dict1 = pickle.load(myFile)
However, already at loading the first dictionary I get:
但是,在加载第一本词典时,我得到了:
*** set a breakpoint in malloc_error_break to debug
python(3716,0xa08ed1d4) malloc: *** mach_vm_map(size=1048576) failed (error code=3)
*** error: can't allocate region securely
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1019, in load_empty_dictionary
self.stack.append({})
MemoryError
How to solve this? My computer has 16GB of RAM so I find it unusual that loading a 800MB dictionary crashes. What I also find unusual is that there were no problems while saving the dictionaries.
如何解决这个问题?我的电脑有 16GB 的内存,所以我发现加载 800MB 的字典崩溃是不寻常的。我还发现不寻常的是,保存字典时没有问题。
Further, in future I plan to process more data resulting in larger dictionaries (3-4GB on the disk), so any advice how to improve the efficiency is appreciated.
此外,将来我计划处理更多数据,从而产生更大的字典(磁盘上 3-4GB),因此我们不胜感激任何有关如何提高效率的建议。
采纳答案by Mike McKerns
If your data in the dictionaries are numpy
arrays, there are packages (such as joblib
and klepto
) that make pickling large arrays efficient, as both the klepto
and joblib
understand how to use minimal state representation for a numpy.array
. If you don't have array
data, my suggestion would be to use klepto
to store the dictionary entries in several files (instead of a single file) or to a database.
如果字典中的数据是numpy
数组,则有一些包(例如joblib
和klepto
)可以使酸洗大数组高效,因为klepto
和 都joblib
了解如何使用numpy.array
. 如果您没有array
数据,我的建议是klepto
将字典条目存储在多个文件(而不是单个文件)或数据库中。
See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file, would like to save/load your data in parallel, or would like to easily experiment with a storage format and backend to see which works best for your case. Also see: https://stackoverflow.com/a/21948720/2379433for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
请参阅我对一个非常相关的问题https://stackoverflow.com/a/25244747/2379433 的回答,如果您可以对多个文件而不是单个文件进行酸洗,则希望并行保存/加载您的数据,或者想轻松地试验存储格式和后端,看看哪种最适合您的情况。另请参阅:https: //stackoverflow.com/a/21948720/2379433了解其他潜在的改进,还有:https: //stackoverflow.com/a/24471659/2379433。
As the links above discuss, you could use klepto
-- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto
also enables you to pick a storage format (pickle
, json
, etc.) --also HDF5
(or a SQL database) is another good option as it allows parallel access. klepto
can utilize both specialized pickle formats (like numpy
's) and compression (if you care about size and not speed of accessing the data).
正如上面的链接所讨论的,您可以使用klepto
-- 它使您能够使用通用 API 轻松地将字典存储到磁盘或数据库。 klepto
还可以让您选择存储格式(pickle
、json
等)-- HDF5
(或 SQL 数据库)也是另一个不错的选择,因为它允许并行访问。 klepto
可以使用专门的 pickle 格式(如numpy
's)和压缩(如果您关心大小而不是访问数据的速度)。
klepto
gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel. For examples, see the above links.
klepto
为您提供了使用“多合一”文件或“每个条目一个”文件存储字典的选项,并且还可以利用多处理或多线程——这意味着您可以将字典项保存到后端/从后端加载在平行下。例如,请参阅上面的链接。
回答by inixmon
This is an inherent problem of pickle, which is intended for use with rather small amounts of data. The size of the dictionaries, when loaded into memory, are many times larger than on disk.
这是 pickle 的一个固有问题,它旨在用于相当少量的数据。加载到内存中时,字典的大小比磁盘大很多倍。
After loading a pickle file of 100MB, you may well have a dictionary of almost 1GB or so. There are some formulas on the web to calculate the overhead, but I can only recommend to use some decent database like MySQL or PostgreSQL for such amounts of Data.
加载 100MB 的 pickle 文件后,您可能拥有将近 1GB 左右的字典。网络上有一些计算开销的公式,但我只能建议使用一些像 MySQL 或 PostgreSQL 这样像样的数据库来处理如此大量的数据。
回答by Jett
I supose you use 32bits Python and it has 4GB limited. You should use 64 bits instead of 32 bits. I have try it, my pickled dict beyond 1.7GB, and I didn't get any problem except time goes longer.
我假设你使用 32 位 Python,它有 4GB 的限制。您应该使用 64 位而不是 32 位。我试过了,我的腌制字典超过 1.7GB,除了时间变长之外,我没有遇到任何问题。