Python 中 Pickle 的 MemoryError

Question

提问by flotr

I am processing some data and I have stored the results in three dictionaries, and I have saved them to the disk with Pickle. Each dictionary has 500-1000MB.

我正在处理一些数据，并将结果存储在三个字典中，并使用 Pickle 将它们保存到磁盘。每本词典有 500-1000MB。

Now I am loading them with:

现在我正在加载它们：

import pickle
with open('dict1.txt', "rb") as myFile:
    dict1 = pickle.load(myFile)

However, already at loading the first dictionary I get:

但是，在加载第一本词典时，我得到了：

*** set a breakpoint in malloc_error_break to debug
python(3716,0xa08ed1d4) malloc: *** mach_vm_map(size=1048576) failed (error code=3)
*** error: can't allocate region securely
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1019, in load_empty_dictionary
    self.stack.append({})
MemoryError

How to solve this? My computer has 16GB of RAM so I find it unusual that loading a 800MB dictionary crashes. What I also find unusual is that there were no problems while saving the dictionaries.

如何解决这个问题？我的电脑有 16GB 的内存，所以我发现加载 800MB 的字典崩溃是不寻常的。我还发现不寻常的是，保存字典时没有问题。

Further, in future I plan to process more data resulting in larger dictionaries (3-4GB on the disk), so any advice how to improve the efficiency is appreciated.

此外，将来我计划处理更多数据，从而产生更大的字典（磁盘上 3-4GB），因此我们不胜感激任何有关如何提高效率的建议。

Answer 1

采纳答案by Mike McKerns

If your data in the dictionaries are numpyarrays, there are packages (such as jobliband klepto) that make pickling large arrays efficient, as both the kleptoand joblibunderstand how to use minimal state representation for a numpy.array. If you don't have arraydata, my suggestion would be to use kleptoto store the dictionary entries in several files (instead of a single file) or to a database.

如果字典中的数据是numpy数组，则有一些包（例如joblib和klepto）可以使酸洗大数组高效，因为klepto和都joblib了解如何使用numpy.array. 如果您没有array数据，我的建议是klepto将字典条目存储在多个文件（而不是单个文件）或数据库中。

See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file, would like to save/load your data in parallel, or would like to easily experiment with a storage format and backend to see which works best for your case. Also see: https://stackoverflow.com/a/21948720/2379433for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.

请参阅我对一个非常相关的问题https://stackoverflow.com/a/25244747/2379433 的回答，如果您可以对多个文件而不是单个文件进行酸洗，则希望并行保存/加载您的数据，或者想轻松地试验存储格式和后端，看看哪种最适合您的情况。另请参阅：https: //stackoverflow.com/a/21948720/2379433了解其他潜在的改进，还有：https: //stackoverflow.com/a/24471659/2379433。

As the links above discuss, you could use klepto-- which provides you with the ability to easily store dictionaries to disk or database, using a common API. kleptoalso enables you to pick a storage format (pickle, json, etc.) --also HDF5(or a SQL database) is another good option as it allows parallel access. kleptocan utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed of accessing the data).

正如上面的链接所讨论的，您可以使用klepto-- 它使您能够使用通用 API 轻松地将字典存储到磁盘或数据库。 klepto还可以让您选择存储格式（pickle、json等）-- HDF5（或 SQL 数据库）也是另一个不错的选择，因为它允许并行访问。 klepto可以使用专门的 pickle 格式（如numpy's）和压缩（如果您关心大小而不是访问数据的速度）。

kleptogives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel. For examples, see the above links.

klepto为您提供了使用“多合一”文件或“每个条目一个”文件存储字典的选项，并且还可以利用多处理或多线程——这意味着您可以将字典项保存到后端/从后端加载在平行下。例如，请参阅上面的链接。

Answer 2

回答by inixmon

This is an inherent problem of pickle, which is intended for use with rather small amounts of data. The size of the dictionaries, when loaded into memory, are many times larger than on disk.

这是 pickle 的一个固有问题，它旨在用于相当少量的数据。加载到内存中时，字典的大小比磁盘大很多倍。

After loading a pickle file of 100MB, you may well have a dictionary of almost 1GB or so. There are some formulas on the web to calculate the overhead, but I can only recommend to use some decent database like MySQL or PostgreSQL for such amounts of Data.

加载 100MB 的 pickle 文件后，您可能拥有将近 1GB 左右的字典。网络上有一些计算开销的公式，但我只能建议使用一些像 MySQL 或 PostgreSQL 这样像样的数据库来处理如此大量的数据。

Answer 3

回答by Jett

I supose you use 32bits Python and it has 4GB limited. You should use 64 bits instead of 32 bits. I have try it, my pickled dict beyond 1.7GB, and I didn't get any problem except time goes longer.

我假设你使用 32 位 Python，它有 4GB 的限制。您应该使用 64 位而不是 32 位。我试过了，我的腌制字典超过 1.7GB，除了时间变长之外，我没有遇到任何问题。

Python 中 Pickle 的 MemoryError

提问by flotr

采纳答案by Mike McKerns

回答by inixmon

回答by Jett

相关推荐

最近更新

标签

Python 中 Pickle 的 MemoryError

提问by flotr

采纳答案by Mike McKerns

回答by inixmon

回答by Jett

相关推荐

Python 使用 Flask 创建 RESTful API？

Python Pandas scatter_matrix - 绘制分类变量

Python 如何从另一个熊猫数据框中减去一个熊猫数据框的行？

Python 我应该在 .gitignore 文件中添加 Django 迁移文件吗？

相关推荐

最近更新

标签