如何解决Python中的内存错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37347397/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to solve the memory error in Python
提问by flyingmouse
I am dealing with several large txt file, each of them has about 8000000 lines. A short example of the lines are:
我正在处理几个大的 txt 文件,每个文件大约有 8000000 行。这些行的一个简短示例是:
usedfor zipper fasten_coat
usedfor zipper fasten_Hymanet
usedfor zipper fasten_pant
usedfor your_foot walk
atlocation camera cupboard
atlocation camera drawer
atlocation camera house
relatedto more plenty
The code to store them in a dictionary is:
将它们存储在字典中的代码是:
dicCSK = collections.defaultdict(list)
for line in finCSK:
line=line.strip('\n')
try:
r, c1, c2 = line.split(" ")
except ValueError:
print line
dicCSK[c1].append(r+" "+c2)
It runs good in the first txt file, but when it runs to the second txt file, I got an error MemoryError
.
它在第一个 txt 文件中运行良好,但是当它运行到第二个 txt 文件时,出现错误MemoryError
。
I am using window 7 64bit with python 2.7 32bit, intel i5 cpu, with 8Gb memory. How can I solve the problem?
我正在使用 windows 7 64bit 和 python 2.7 32bit,intel i5 cpu,8Gb 内存。我该如何解决问题?
Further explaining:
I have four large files, each file contains different information for many entities. For example, I want to find all information for cat
, its father node animal
and its child node persian cat
and so on. So my program first read all txt files in the dictionary, then I scan all dictionaries to find information for cat
and its father and its children.
进一步解释:我有四个大文件,每个文件包含许多实体的不同信息。例如,我想查找cat
、其父节点animal
和其子节点persian cat
等的所有信息。所以我的程序首先读取字典中的所有 txt 文件,然后我扫描所有字典以查找cat
其父亲和孩子的信息。
回答by ShadowRanger
Simplest solution: You're probably running out of virtual address space (any other form of error usually means running really slowly for a long time before you finally get a MemoryError
). This is because a 32 bit application on Windows (and most OSes) is limited to 2 GB of user mode address space (Windows can be tweaked to make it 3 GB, but that's still a low cap). You've got 8 GB of RAM, but your program can't use (at least) 3/4 of it. Python has a fair amount of per-object overhead (object header, allocation alignment, etc.), odds are the strings alone are using close to a GB of RAM, and that's before you deal with the overhead of the dictionary, the rest of your program, the rest of Python, etc. If memory space fragments enough, and the dictionary needs to grow, it may not have enough contiguous space to reallocate, and you'll get a MemoryError
.
最简单的解决方案:您可能正在耗尽虚拟地址空间(任何其他形式的错误通常意味着在您最终获得MemoryError
)。这是因为 Windows(和大多数操作系统)上的 32 位应用程序被限制为 2 GB 的用户模式地址空间(Windows 可以调整为 3 GB,但这仍然是一个低上限)。您有 8 GB 的 RAM,但您的程序无法使用(至少)其中的 3/4。Python 有相当多的每个对象开销(对象头、分配对齐等),很可能仅字符串就使用了接近 1 GB 的 RAM,这是在处理字典开销之前,其余的你的程序、Python 的其余部分等等。如果内存空间碎片足够多,并且字典需要增长,它可能没有足够的连续空间来重新分配,你会得到一个MemoryError
.
Install a 64 bit version of Python (if you can, I'd recommend upgrading to Python 3 for other reasons); it will use more memory, but then, it will have accessto a lotmore memory space (and more physical RAM as well).
安装 64 位版本的 Python(如果可以,我建议出于其他原因升级到 Python 3);它会使用更多的内存,但随后,它会进入到一个很大更多的存储空间(并物理RAM以及)。
If that's not enough, consider converting to a sqlite3
database (or some other DB), so it naturally spills to disk when the data gets too large for main memory, while still having fairly efficient lookup.
如果这还不够,请考虑转换为sqlite3
数据库(或其他一些数据库),这样当数据对于主内存来说太大时,它自然会溢出到磁盘,同时仍然具有相当有效的查找。
回答by Levi Noecker
Assuming your example text is representative of all the text, one line would consume about 75 bytes on my machine:
假设您的示例文本代表所有文本,一行在我的机器上将消耗大约 75 个字节:
In [3]: sys.getsizeof('usedfor zipper fasten_coat')
Out[3]: 75
Doing some rough math:
做一些粗略的数学运算:
75 bytes * 8,000,000 lines / 1024 / 1024 = ~572 MB
So roughly 572 meg to store the strings alone for one of these files. Once you start adding in additional, similarly structured and sized files, you'll quickly approach your virtual address space limits, as mentioned in @ShadowRanger's answer.
所以大约 572 meg 来单独存储这些文件之一的字符串。一旦开始添加额外的、结构和大小相似的文件,您将很快接近虚拟地址空间限制,如@ShadowRanger 的回答中所述。
If upgrading your python isn't feasible for you, or if it only kicks the can down the road (you have finite physical memory after all), you really have two options: write your results to temporary files in-between loading in and reading the input files, or write your results to a database. Since you need to further post-process the strings after aggregating them, writing to a database would be the superior approach.
如果升级你的python对你来说不可行,或者如果它只是在路上(毕竟你的物理内存有限),你真的有两个选择:将你的结果写入加载和读取之间的临时文件输入文件,或将结果写入数据库。由于您需要在聚合字符串后进一步对字符串进行后处理,因此写入数据库将是更好的方法。