将大列表保存在内存中的替代方法 (python)

Question

提问by Vincent

If I have a list(or array, dictionary....) in python that could exceed the available memory address space, (32 bit python) what are the options and there relative speeds? (other than not making a list that large) The list couldexceed the memory but I have no way of knowing before hand. Once it starts exceeding 75% I would like to no longer keep the list in memory (or the new items anyway), is there a way to convert to a file based approach mid-stream?

如果我在 python 中有一个列表（或数组、字典......），可能会超过可用的内存地址空间，（32 位 python）有哪些选项和相对速度？（除了不制作那么大的列表）该列表可能会超出内存，但我无法事先知道。一旦它开始超过 75%，我不想再将列表保留在内存中（或无论如何都是新项目），有没有办法在中游转换为基于文件的方法？

What are the best (speed in and out) file storage options?

最好的（进出速度）文件存储选项是什么？

Just need to store a simple list of numbers. no need to random Nth element access, just append/pop type operations.

只需要存储一个简单的数字列表。不需要随机第 N 个元素访问，只需追加/弹出类型操作。

Answer 1

回答by Alex Martelli

If your "numbers" are simple-enough ones (signed or unsigned integers of up to 4 bytes each, or floats of 4 or 8 bytes each), I recommend the standard library arraymodule as the best way to keep a few millions of them in memory (the "tip" of your "virtual array") with a binary file (open for binary R/W) backing the rest of the structure on disk. array.arrayhas very fast fromfileand tofilemethods to facilitate the moving of data back and forth.

如果您的“数字”足够简单（每个最多 4 个字节的有符号或无符号整数，或每个 4 或 8 个字节的浮点数），我建议使用标准库数组模块作为保留数百万个数字的最佳方式在内存中（“虚拟阵列”的“尖端”），带有一个二进制文件（为二进制 R/W 打开）支持磁盘上的其余结构。 array.array具有非常快fromfile和tofile方式，方便数据的来回移动。

I.e., basically, assuming for example unsigned-long numbers, something like:

即，基本上，假设例如无符号长数字，例如：

import os

# no more than 100 million items in memory at a time
MAXINMEM = int(1e8)

class bigarray(object):
  def __init__(self):
    self.f = open('afile.dat', 'w+')
    self.a = array.array('L')
  def append(self, n):
    self.a.append(n)
    if len(self.a) > MAXINMEM:
      self.a.tofile(self.f)
      del self.a[:]
  def pop(self):
    if not len(self.a):
      try: self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      except IOError: return self.a.pop()  # ensure normal IndexError &c
      try: self.a.fromfile(self.f, MAXINMEM)
      except EOFError: pass
      self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      self.f.truncate()
    return self.a.pop()

Of course you can add other methods as necessary (e.g. keep track of the overall length, add extend, whatever), but if popand appendare indeed all you need this should serve.

当然，您可以根据需要添加其他方法（例如跟踪总长度，添加extend，等等），但是如果pop并且append确实是您需要的全部，则应该可以使用。

Answer 2

回答by Ned Batchelder

There are probably dozens of ways to store your list data in a file instead of in memory. How you choose to do it will depend entirely on what sort of operations you need to perform on the data. Do you need random access to the Nth element? Do you need to iterate over all elements? Will you be searching for elements that match certain criteria? What form do the list elements take? Will you only be inserting at the end of the list, or also in the middle? Is there metadata you can keep in memory with the bulk of the items on disk? And so on and so on.

可能有几十种方法可以将列表数据存储在文件中而不是内存中。您选择如何执行此操作将完全取决于您需要对数据执行何种类型的操作。您是否需要随机访问第 N 个元素？你需要遍历所有元素吗？您会搜索符合特定条件的元素吗？列表元素采用什么形式？你是只插入列表的末尾，还是中间？是否可以将元数据与磁盘上的大部分项目一起保存在内存中？等等等等。

One possibility is to structure your data relationally, and store it in a SQLite database.

一种可能性是以关系方式构建数据，并将其存储在 SQLite 数据库中。

Answer 3

回答by Dave Kirby

The answer is very much "it depends".

答案是“视情况而定”。

What are you storing in the lists? Strings? integers? Objects?

你在列表中存储了什么？字符串？整数？对象？

How often is the list written to compared with being read? Are items only appended on the end, or can entries be modified or inserted in the middle?

与读取列表相比，写入列表的频率如何？项目是只附加在最后，还是可以修改或插入中间？

If you are only appending to the end then writing to a flat file may be the simplest thing that could possibly work.

如果您只是附加到末尾，那么写入平面文件可能是最简单的方法。

If you are storing objects of variable size such as strings then maybe keep an in-memory index of the start of each string, so you can read it quickly.

如果您正在存储可变大小的对象，例如字符串，那么可能会保留每个字符串开头的内存索引，以便您可以快速读取它。

If you want dictionary behaviour then look at the db modules - dbm, gdbm, bsddb, etc.

如果您想要字典行为，请查看 db 模块 - dbm、gdbm、bsddb 等。

If you want random access writing then maybe a SQL database may be better.

如果您想要随机访问写入，那么 SQL 数据库可能会更好。

Whatever you do, going to disk is going to be orders of magnitude slower than in-memory, but without knowing how the data is going to be used it is impossible to be more specific.

无论你做什么，进入磁盘的速度都比内存慢几个数量级，但如果不知道数据将如何使用，就不可能更具体。

edit:From your updated requirements I would go with a flat file and keep an in-memory buffer of the last N elements.

编辑：根据您更新的要求，我将使用平面文件并保留最后 N 个元素的内存缓冲区。

Answer 4

回答by Shane Holloway

Well, if you are looking for speed and your data is numerical in nature, you could consider using numpy and PyTablesor h5py. From what I remember, the interface is not as nice as simple lists, but the scalability is fantastic!

好吧，如果您正在寻找速度并且您的数据本质上是数字，您可以考虑使用 numpy 和PyTables或h5py。据我所知，界面不像简单的列表那么好，但可扩展性非常棒！

Answer 5

回答by Luka Rahne

Did you check shelve python module which is based on pickle?

您是否检查了基于pickle的搁置python模块？

http://docs.python.org/library/shelve.html

Answer 6

回答by RyanWilcox

You might want to consider a different kind of structure: not a list, but figuring out how to do (your task) with a generator or a custom iterator.

您可能需要考虑不同类型的结构：不是列表，而是弄清楚如何使用生成器或自定义迭代器来完成（您的任务）。

Answer 7

回答by Bjorn

Modern operating systems will handle this for you without you having to worry about it. It's called virtual memory.

现代操作系统将为您处理这个问题，而您无需担心。它被称为虚拟内存。

Answer 8

回答by northtree

You can try blist: https://pypi.python.org/pypi/blist/

你可以试试 blist: https://pypi.python.org/pypi/blist/

The blist is a drop-in replacement for the Python list the provides better performance when modifying large lists.

blist 是 Python 列表的直接替代品，可在修改大型列表时提供更好的性能。

Answer 9

回答by rob

What about a document oriented database?
There are several alternatives; I think the most known one currently is CouchDB, but you can also go for Tokyo Cabinet, or MongoDB. The last one has the advantage of python bindings directly from the main project, without requiring any additional module.

面向文档的数据库怎么样？
有几种选择；我认为目前最著名的是CouchDB，但您也可以选择Tokyo Cabinet或MongoDB。最后一个具有直接从主项目进行 python 绑定的优点，不需要任何额外的模块。

将大列表保存在内存中的替代方法 (python)

提问by Vincent

回答by Alex Martelli

回答by Ned Batchelder

回答by Dave Kirby

回答by Shane Holloway

回答by Luka Rahne

回答by RyanWilcox

回答by Bjorn

回答by northtree

回答by rob

相关推荐

最近更新

标签

将大列表保存在内存中的替代方法 (python)

提问by Vincent

回答by Alex Martelli

回答by Ned Batchelder

回答by Dave Kirby

回答by Shane Holloway

回答by Luka Rahne

回答by RyanWilcox

回答by Bjorn

回答by northtree

回答by rob

相关推荐

python Django 索引页最佳/最常见的做法

运行 logging.basicConfig 之前的 Python 日志记录？

python 如何在python列表中获取某个索引的值？

python __getattr__ 和 getattr 是什么关系？

相关推荐

最近更新

标签

python getattr 和 getattr 是什么关系？