将大列表保存在内存中的替代方法 (python)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1989251/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Alternatives to keeping large lists in memory (python)
提问by Vincent
If I have a list(or array, dictionary....) in python that could exceed the available memory address space, (32 bit python) what are the options and there relative speeds? (other than not making a list that large) The list couldexceed the memory but I have no way of knowing before hand. Once it starts exceeding 75% I would like to no longer keep the list in memory (or the new items anyway), is there a way to convert to a file based approach mid-stream?
如果我在 python 中有一个列表(或数组、字典......),可能会超过可用的内存地址空间,(32 位 python)有哪些选项和相对速度?(除了不制作那么大的列表)该列表可能会超出内存,但我无法事先知道。一旦它开始超过 75%,我不想再将列表保留在内存中(或无论如何都是新项目),有没有办法在中游转换为基于文件的方法?
What are the best (speed in and out) file storage options?
最好的(进出速度)文件存储选项是什么?
Just need to store a simple list of numbers. no need to random Nth element access, just append/pop type operations.
只需要存储一个简单的数字列表。不需要随机第 N 个元素访问,只需追加/弹出类型操作。
回答by Alex Martelli
If your "numbers" are simple-enough ones (signed or unsigned integers of up to 4 bytes each, or floats of 4 or 8 bytes each), I recommend the standard library arraymodule as the best way to keep a few millions of them in memory (the "tip" of your "virtual array") with a binary file (open for binary R/W) backing the rest of the structure on disk. array.array
has very fast fromfile
and tofile
methods to facilitate the moving of data back and forth.
如果您的“数字”足够简单(每个最多 4 个字节的有符号或无符号整数,或每个 4 或 8 个字节的浮点数),我建议使用标准库数组模块作为保留数百万个数字的最佳方式在内存中(“虚拟阵列”的“尖端”),带有一个二进制文件(为二进制 R/W 打开)支持磁盘上的其余结构。 array.array
具有非常快fromfile
和tofile
方式,方便数据的来回移动。
I.e., basically, assuming for example unsigned-long numbers, something like:
即,基本上,假设例如无符号长数字,例如:
import os
# no more than 100 million items in memory at a time
MAXINMEM = int(1e8)
class bigarray(object):
def __init__(self):
self.f = open('afile.dat', 'w+')
self.a = array.array('L')
def append(self, n):
self.a.append(n)
if len(self.a) > MAXINMEM:
self.a.tofile(self.f)
del self.a[:]
def pop(self):
if not len(self.a):
try: self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
except IOError: return self.a.pop() # ensure normal IndexError &c
try: self.a.fromfile(self.f, MAXINMEM)
except EOFError: pass
self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
self.f.truncate()
return self.a.pop()
Of course you can add other methods as necessary (e.g. keep track of the overall length, add extend
, whatever), but if pop
and append
are indeed all you need this should serve.
当然,您可以根据需要添加其他方法(例如跟踪总长度,添加extend
,等等),但是如果pop
并且append
确实是您需要的全部,则应该可以使用。
回答by Ned Batchelder
There are probably dozens of ways to store your list data in a file instead of in memory. How you choose to do it will depend entirely on what sort of operations you need to perform on the data. Do you need random access to the Nth element? Do you need to iterate over all elements? Will you be searching for elements that match certain criteria? What form do the list elements take? Will you only be inserting at the end of the list, or also in the middle? Is there metadata you can keep in memory with the bulk of the items on disk? And so on and so on.
可能有几十种方法可以将列表数据存储在文件中而不是内存中。您选择如何执行此操作将完全取决于您需要对数据执行何种类型的操作。您是否需要随机访问第 N 个元素?你需要遍历所有元素吗?您会搜索符合特定条件的元素吗?列表元素采用什么形式?你是只插入列表的末尾,还是中间?是否可以将元数据与磁盘上的大部分项目一起保存在内存中?等等等等。
One possibility is to structure your data relationally, and store it in a SQLite database.
一种可能性是以关系方式构建数据,并将其存储在 SQLite 数据库中。
回答by Dave Kirby
The answer is very much "it depends".
答案是“视情况而定”。
What are you storing in the lists? Strings? integers? Objects?
你在列表中存储了什么?字符串?整数?对象?
How often is the list written to compared with being read? Are items only appended on the end, or can entries be modified or inserted in the middle?
与读取列表相比,写入列表的频率如何?项目是只附加在最后,还是可以修改或插入中间?
If you are only appending to the end then writing to a flat file may be the simplest thing that could possibly work.
如果您只是附加到末尾,那么写入平面文件可能是最简单的方法。
If you are storing objects of variable size such as strings then maybe keep an in-memory index of the start of each string, so you can read it quickly.
如果您正在存储可变大小的对象,例如字符串,那么可能会保留每个字符串开头的内存索引,以便您可以快速读取它。
If you want dictionary behaviour then look at the db modules - dbm, gdbm, bsddb, etc.
如果您想要字典行为,请查看 db 模块 - dbm、gdbm、bsddb 等。
If you want random access writing then maybe a SQL database may be better.
如果您想要随机访问写入,那么 SQL 数据库可能会更好。
Whatever you do, going to disk is going to be orders of magnitude slower than in-memory, but without knowing how the data is going to be used it is impossible to be more specific.
无论你做什么,进入磁盘的速度都比内存慢几个数量级,但如果不知道数据将如何使用,就不可能更具体。
edit:From your updated requirements I would go with a flat file and keep an in-memory buffer of the last N elements.
编辑:根据您更新的要求,我将使用平面文件并保留最后 N 个元素的内存缓冲区。
回答by Shane Holloway
Well, if you are looking for speed and your data is numerical in nature, you could consider using numpy and PyTablesor h5py. From what I remember, the interface is not as nice as simple lists, but the scalability is fantastic!
好吧,如果您正在寻找速度并且您的数据本质上是数字,您可以考虑使用 numpy 和PyTables或h5py。据我所知,界面不像简单的列表那么好,但可扩展性非常棒!
回答by Luka Rahne
Did you check shelve python module which is based on pickle?
您是否检查了基于pickle的搁置python模块?
回答by RyanWilcox
You might want to consider a different kind of structure: not a list, but figuring out how to do (your task) with a generator or a custom iterator.
您可能需要考虑不同类型的结构:不是列表,而是弄清楚如何使用生成器或自定义迭代器来完成(您的任务)。
回答by Bjorn
Modern operating systems will handle this for you without you having to worry about it. It's called virtual memory.
现代操作系统将为您处理这个问题,而您无需担心。它被称为虚拟内存。
回答by northtree
You can try blist: https://pypi.python.org/pypi/blist/
你可以试试 blist: https://pypi.python.org/pypi/blist/
The blist is a drop-in replacement for the Python list the provides better performance when modifying large lists.
blist 是 Python 列表的直接替代品,可在修改大型列表时提供更好的性能。
回答by rob
What about a document oriented database?
There are several alternatives; I think the most known one currently is CouchDB, but you can also go for Tokyo Cabinet, or MongoDB. The last one has the advantage of python bindings directly from the main project, without requiring any additional module.
面向文档的数据库怎么样?
有几种选择;我认为目前最著名的是CouchDB,但您也可以选择Tokyo Cabinet或MongoDB。最后一个具有直接从主项目进行 python 绑定的优点,不需要任何额外的模块。