基于 Python 磁盘的字典

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/226693/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 19:41:05  来源:igfitidea点击:

Python Disk-Based Dictionary

pythondatabasedictionarydisk-based

提问by Claudiu

I was running some dynamic programming code (trying to brute-force disprove the Collatz conjecture =P) and I was using a dict to store the lengths of the chains I had already computed. Obviously, it ran out of memory at some point. Is there any easy way to use some variant of a dictwhich will page parts of itself out to disk when it runs out of room? Obviously it will be slower than an in-memory dict, and it will probably end up eating my hard drive space, but this could apply to other problems that are not so futile.

我正在运行一些动态编程代码(试图用蛮力反驳 Collat​​z 猜想 =P),并且我正在使用一个 dict 来存储我已经计算过的链的长度。显然,它在某个时候耗尽了内存。有没有什么简单的方法可以使用 a 的某些变体,dict当空间不足时,它会将自身的一部分页面输出到磁盘?显然它会比内存中的 dict 慢,并且它可能最终会占用我的硬盘空间,但这可能适用于其他不是那么徒劳的问题。

I realized that a disk-based dictionary is pretty much a database, so I manually implemented one using sqlite3, but I didn't do it in any smart way and had it look up every element in the DB one at a time... it was about 300x slower.

我意识到基于磁盘的字典几乎是一个数据库,所以我使用 sqlite3 手动实现了一个,但我没有以任何聪明的方式来实现它并让它一次查找数据库中的每个元素......它大约慢了 300 倍。

Is the smartest way to just create my own set of dicts, keeping only one in memory at a time, and paging them out in some efficient manner?

是创建我自己的一组 dicts、一次只在内存中保留一个并以某种有效的方式将它们分页的最聪明的方法吗?

采纳答案by Parand

Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.

磁盘上的散列通常使用 Berkeley DB 或类似的东西来解决 - Python 数据持久性文档中列出了几个选项。您可以在内存中缓存它,但我会首先针对本机性能进行测试;有了操作系统缓存,它可能会大致相同。

回答by Matthew Trevor

The 3rd party shovemodule is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module

3rd party shove模块也值得一看。它与 shelve 非常相似,因为它是一个简单的类似 dict 的对象,但是它可以存储到各种后端(例如文件、SVN 和 S3),提供可选的压缩,甚至是线程安全的。这是一个非常方便的模块

from shove import Shove

mem_store = Shove()
file_store = Shove('file://mystore')

file_store['key'] = value

回答by John Fouhy

The shelvemodule may do it; at any rate, it should be simple to test. Instead of:

货架模块可以做到这一点; 无论如何,它应该很容易测试。代替:

self.lengths = {}

do:

做:

import shelve
self.lengths = shelve.open('lengths.shelf')

The only catch is that keys to shelves must be strings, so you'll have to replace

唯一的问题是架子的钥匙必须是绳子,所以你必须更换

self.lengths[indx]

with

self.lengths[str(indx)]

(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)

(根据您对 Charles Duffy 帖子的评论,我假设您的密钥只是整数)

There's no built-in caching in memory, but your operating system may do that for you anyway.

内存中没有内置缓存,但您的操作系统可能会为您执行此操作。

[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]

[实际上,这并不完全正确:您可以在创建时传递参数 'writeback=True'。这样做的目的是确保在架子中存储列表和其他可变内容正常工作。但副作用是整个字典都缓存在内存中。由于这给您带来了问题,因此这可能不是一个好主意:-)]

回答by Charles Duffy

Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.

上次我遇到这样的问题时,我改写为使用 SQLite 而不是 dict,并获得了巨大的性能提升。性能提升至少部分是由于数据库的索引功能;取决于你的算法,YMMV。

A thin wrapper that does SQLite queries in __getitem__and __setitem__isn't much code to write.

一个简单的包装器,可以在其中执行 SQLite 查询__getitem__并且__setitem__不需要编写太多代码。

回答by Dustin Wyatt

With a little bit of thought it seems like you could get the shelve moduleto do what you want.

稍加思考,似乎您可以让搁置模块做您想做的事。

回答by e-satis

I've read you think shelve is too slow and you tried to hack your own dict using sqlite.

我读过您认为搁置太慢,并且您尝试使用 sqlite 破解自己的 dict。

Another did this too :

另一个也这样做了:

http://sebsauvage.net/python/snyppets/index.html#dbdict

http://sebsauvage.net/python/snyppets/index.html#dbdict

It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?

它似乎非常有效(并且 sebsauvage 是一个非常好的编码器)。也许你可以试一试?

回答by Raymond Peng

For simple use cases sqlitedictcan help. However when you have much more complex databases you might one to try one of the more upvoted answers.

对于简单的用例,sqlitedict可以提供帮助。但是,当您拥有更复杂的数据库时,您可能会尝试使用更受欢迎的答案之一。

回答by Vinko Vrsalovic

You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.

如果有一些启发式方法可以知道接下来最有可能检索哪些项目,那么您应该一次携带多个项目,并且不要忘记像 Charles 提到的索引。