Python 字典的内存高效替代品

Question

提问by ricree

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3]returns the number of times these words appear in the text, topDict[word1][word2]returns a dictionary with all the words that appeared following words 1 and 2, etc.

在我目前的一个副项目中，我正在浏览一些文本，查看单词三元组的频率。在我第一次尝试时，我使用了三层深的默认字典。换句话说，topDict[word1][word2][word3]返回这些单词在文本中出现的次数，topDict[word1][word2]返回一个包含在单词 1 和 2 之后出现的所有单词的字典，等等。

This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.

这功能正常，但它非常占用内存。在我最初的测试中，它使用的内存大约是将三元组存储在文本文件中的 20 倍，这似乎是过多的内存开销。

My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.

我的怀疑是，这些字典中的许多创建的插槽比实际使用的插槽多得多，所以我想用以这种方式使用时内存效率更高的其他东西替换字典。我强烈希望有一个解决方案，它允许沿着字典的行进行键查找。

From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best.

根据我对数据结构的了解，使用红黑或 AVL 之类的平衡二叉搜索树可能是理想的，但我真的不想自己实现它们。如果可能，我更愿意坚持使用标准的 Python 库，但如果其他替代方案效果最佳，我绝对愿意接受。

So, does anyone have any suggestions for me?

那么，有人对我有什么建议吗？

Edited to add:

编辑添加：

Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).

感谢您到目前为止的答复。到目前为止，有一些答案建议使用元组，当我将前两个词压缩成一个元组时，这对我来说并没有太大作用。我不愿将所有三个词都用作关键字，因为我希望在给定前两个词的情况下可以轻松查找所有第三个词。（即我想要类似的结果topDict[word1, word2].keys()）。

The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.

我正在使用的当前数据集是Wikipedia For Schools 的最新版本。例如，解析前 1000 页的结果对于文本文件大约是 11MB，其中每行是三个单词，计数全部由制表符分隔。以我现在使用的字典格式存储文本大约需要 185MB。我知道指针和诸如此类的东西会有一些额外的开销，但差异似乎过大。

Answer 1

采纳答案by Darius Bacon

Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:

一些测量。我拿了 10MB 的免费电子书文本并计算了 trigram 频率，生成了一个 24MB 的文件。将它存储在不同的简单 Python 数据结构中占用了这么多以 kB 为单位的空间，以运行 ps 的 RSS 来衡量，其中 d 是一个字典，keys 和 freqs 是列表，而 a,b,c,freq 是三元组记录的字段：

295760     S. Lott's answer
237984     S. Lott's with keys interned before passing in
203172 [*] d[(a,b,c)] = int(freq)
203156     d[a][b][c] = int(freq)
189132     keys.append((a,b,c)); freqs.append(int(freq))
146132     d[intern(a),intern(b)][intern(c)] = int(freq)
145408     d[intern(a)][intern(b)][intern(c)] = int(freq)
 83888 [*] d[a+' '+b+' '+c] = int(freq)
 82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)
 68756     keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))
 60320     keys.append(a+' '+b+' '+c); freqs.append(int(freq))
 50556     pair array
 48320     squeezed pair array
 33024     squeezed single array

The entries marked [*] have no efficient way to look up a pair (a,b); they're listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)

标记为 [*] 的条目没有有效的方法来查找一对 (a,b)；它们被列出只是因为其他人推荐了它们（或它们的变体）。（我有点厌烦这样做，因为最高投票的答案没有帮助，如表所示。）

'Pair array' is the scheme below in my original answer ("I'd start with the array with keys being the first two words..."), where the value table for each pair is represented as a single string. 'Squeezed pair array' is the same, leaving out the frequency values that are equal to 1 (the most common case). 'Squeezed single array' is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:

'Pair array' 是我原始答案中的以下方案（“我从数组开始，键是前两个词......”），其中每对的值表表示为单个字符串。'Squeezed pair array' 是相同的，省略了等于 1 的频率值（最常见的情况）。“压缩单个数组”类似于压缩对数组，但将键和值组合为一个字符串（带有分隔符）。压缩后的单数组代码：

import collections

def build(file):
    pairs = collections.defaultdict(list)
    for line in file:  # N.B. file assumed to be already sorted
        a, b, c, freq = line.split()
        key = ' '.join((a, b))
        pairs[key].append(c + ':' + freq if freq != '1' else c)
    out = open('squeezedsinglearrayfile', 'w')
    for key in sorted(pairs.keys()):
        out.write('%s|%s\n' % (key, ' '.join(pairs[key])))

def load():
    return open('squeezedsinglearrayfile').readlines()

if __name__ == '__main__':
    build(open('freqs'))

I haven't written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.

我还没有编写代码来从这个结构中查找值（使用 bisect，如下所述），也没有实现下面描述的更高级的压缩结构。

Original answer:A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there's a standard trick to strip out common prefixes, with another level of index to get them back, but that's rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are 'hello george' and 'hello world', make the second entry be '6world' instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I'd start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though -- I think you may be out of luck as far as batteries-included memory-efficient options.

原答案：一个简单的字符串排序数组，每个字符串都是一个空格分隔的单词串联，使用 bisect 模块搜索，应该值得一试。这样就节省了指针等的空间，还是会因为单词的重复而浪费空间；有一个标准的技巧可以去除常见的前缀，并使用另一个级别的索引来恢复它们，但这更复杂，速度也更慢。（这个想法是以必须按顺序扫描的压缩形式存储数组的连续块，以及每个块的随机访问索引。块足够大以进行压缩，但足够小以实现合理的访问时间。特定的压缩此处适用的方案：如果连续条目是“hello george”和“hello world”，则将第二个条目改为“6world”。zlib? 无论如何，您可以通过查找全文搜索中使用的字典结构来了解更多信息。）因此，具体而言，我将从数组开始，键是前两个词，并行数组的条目列出可能的第三个词及其频率。不过，它可能仍然很糟糕——我认为就包含电池的内存高效选项而言，你可能不走运。

Also, binary tree structures are notrecommended for memory efficiency here. E.g., this papertests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.

此外，这里不建议使用二叉树结构来提高内存效率。例如，本文在类似问题上测试了各种数据结构（尽管是一元组而不是三元组），并找到了一个哈希表来通过该度量击败所有树结构。

I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your 'real' data structure, whatever it is, you use integer keys instead of strings -- indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn't suggest this after all.)

我应该提到，就像其他人所做的那样，排序数组只能用于词表，而不是二元词或三元词；然后对于你的“真实”数据结构，无论它是什么，你都使用整数键而不是字符串——索引到单词列表中。（但这使您无法利用除词表本身之外的常见前缀。也许我毕竟不应该建议这样做。）

Answer 2

回答by hasen

Use tuples.
Tuples can be keys to dictionaries, so you don't need to nest dictionaries.

使用元组。
元组可以是字典的键，所以你不需要嵌套字典。

d = {}
d[ word1, word2, word3 ] = 1

Also as a plus, you could use defaultdict

另外作为一个加号，你可以使用 defaultdict

so that elements that don't have entries always return 0
and so that u can say d[w1,w2,w3] += 1without checking if the key already exists or not

这样没有条目的元素总是返回 0
这样你就可以在d[w1,w2,w3] += 1不检查密钥是否已经存在的情况下说

example:

例子：

from collections import defaultdict
d = defaultdict(int)
d["first","word","tuple"] += 1

If you need to find all words "word3" that are tupled with (word1,word2) then search for it in dictionary.keys() using list comprehension

如果您需要查找与 (word1,word2) 元组的所有单词“word3”，则使用列表理解在 dictionary.keys() 中搜索它

if you have a tuple, t, you can get the first two items using slices:

如果您有一个元组 t，则可以使用切片获取前两项：

>>> a = (1,2,3)
>>> a[:2]
(1, 2)

a small example for searching tuples with list comprehensions:

使用列表推导式搜索元组的一个小例子：

>>> b = [(1,2,3),(1,2,5),(3,4,6)]
>>> search = (1,2)
>>> [a[2] for a in b if a[:2] == search]
[3, 5]

You see here, we got a list of all items that appear as the third item in the tuples that start with (1,2)

你在这里看到，我们得到了一个所有项目的列表，这些项目出现在以 (1,2) 开头的元组中的第三个项目

Answer 3

回答by tzot

In this case, ZODB1 BTrees might be helpful, since they are much less memory-hungry. Use a BTrees.OOBtree (Object keys to Object values) or BTrees.OIBTree (Object keys to Integer values), and use 3-word tuples as your key.

在这种情况下，ZODB1 BTrees 可能会有所帮助，因为它们不太需要内存。使用 BTrees.OOBtree（对象值的对象键）或 BTrees.OIBTree（整数值的对象键），并使用 3 字元组作为键。

Something like:

就像是：

from BTrees.OOBTree import OOBTree as BTree

The interface is, more or less, dict-like, with the added bonus (for you) that .keys, .items, .iterkeysand .iteritemshave two min, maxoptional arguments:

该接口或多或少类似于 dict，具有额外的好处（对您而言）.keys, .items,.iterkeys和.iteritems有两个min, max可选参数：

>>> t=BTree()
>>> t['a', 'b', 'c']= 10
>>> t['a', 'b', 'z']= 11
>>> t['a', 'a', 'z']= 12
>>> t['a', 'd', 'z']= 13
>>> print list(t.keys(('a', 'b'), ('a', 'c')))
[('a', 'b', 'c'), ('a', 'b', 'z')]

1 Note that if you are on Windows and work with Python >2.4, I know there are packages for more recent python versions, but I can't recollect where.

1 请注意，如果您使用的是 Windows 并使用 Python > 2.4，我知道有适用于更新的 Python 版本的软件包，但我不记得在哪里。

PS They exist in the CheeseShop?

PS 它们存在于CheeseShop 中吗？

Answer 4

回答by Dustin

A couple attempts:

几次尝试：

I figure you're doing something similar to this:

我想你正在做类似的事情：

from __future__ import with_statement

import time
from collections import deque, defaultdict

# Just used to generate some triples of words
def triplegen(words="/usr/share/dict/words"):
    d=deque()
    with open(words) as f:
        for i in range(3):
            d.append(f.readline().strip())

        while d[-1] != '':
            yield tuple(d)
            d.popleft()
            d.append(f.readline().strip())

if __name__ == '__main__':
    class D(dict):
        def __missing__(self, key):
            self[key] = D()
            return self[key]
    h=D()
    for a, b, c in triplegen():
        h[a][b][c] = 1
    time.sleep(60)

That gives me ~88MB.

这给了我 ~88MB。

Changing the storage to

将存储更改为

h[a, b, c] = 1

takes ~25MB

需要约 25MB

interning a, b, and c makes it take about 31MB. My case is a bit special because my words never repeat on the input. You might try some variations yourself and see if one of these helps you.

实习 a、b 和 c 使它需要大约 31MB。我的情况有点特殊，因为我的话从不重复输入。您可以自己尝试一些变体，看看其中一个是否对您有帮助。

Answer 5

回答by orip

Are you implementing Markovian text generation?

您是否正在实施马尔可夫文本生成？

If your chains map 2 words to the probabilities of the third I'd use a dictionary mapping K-tuples to the 3rd-word histogram. A trivial (but memory-hungry) way to implement the histogram would be to use a list with repeats, and then random.choicegives you a word with the proper probability.

如果您的链将 2 个单词映射到第三个单词的概率，我会使用字典将 K 元组映射到第三个单词的直方图。实现直方图的一种简单（但需要内存）的方法是使用带有重复的列表，然后random.choice以适当的概率为您提供一个单词。

Here's an implementation with the K-tuple as a parameter:

这是一个以 K 元组为参数的实现：

import random

# can change these functions to use a dict-based histogram
# instead of a list with repeats
def default_histogram():          return []
def add_to_histogram(item, hist): hist.append(item)
def choose_from_histogram(hist):  return random.choice(hist)

K=2 # look 2 words back
words = ...
d = {}

# build histograms
for i in xrange(len(words)-K-1):
  key = words[i:i+K]
  word = words[i+K]

  d.setdefault(key, default_histogram())
  add_to_histogram(word, d[key])

# generate text
start = random.randrange(len(words)-K-1)
key = words[start:start+K]
for i in NUM_WORDS_TO_GENERATE:
  word = choose_from_histogram(d[key])
  print word,
  key = key[1:] + (word,)

Answer 6

回答by user39307

You could try to use same dictionary, only one level deep.

你可以尝试使用相同的字典，只有一层深。

topDictionary[word1+delimiter+word2+delimiter+word3]

delimiter could be plain " ". (or use (word1,word2,word3))

分隔符可以是普通的“”。（或使用 (word1,word2,word3)）

This would be easiest to implement. I believe you will see a little improvement, if it is not enough... ...i'll think of something...

这将是最容易实现的。我相信你会看到一点点进步，如果还不够……我会想办法的……

Answer 7

回答by user39307

Scipy has sparse matrices, so if you can make the first two words a tuple, you can do something like this:

Scipy 具有稀疏矩阵，因此如果您可以将前两个单词设为元组，则可以执行以下操作：

import numpy as N
from scipy import sparse

word_index = {}
count = sparse.lil_matrix((word_count*word_count, word_count), dtype=N.int)

for word1, word2, word3 in triple_list:
    w1 = word_index.setdefault(word1, len(word_index))
    w2 = word_index.setdefault(word2, len(word_index))
    w3 = word_index.setdefault(word3, len(word_index))
    w1_w2 = w1 * word_count + w2
    count[w1_w2,w3] += 1

Answer 8

回答by Stephan Eggermont

Ok, so you are basically trying to store a sparse 3D space. The kind of access patterns you want to this space is crucial for the choice of algorithm and data structure. Considering your data source, do you want to feed this to a grid? If you don't need O(1) access:

好的，所以您基本上是在尝试存储稀疏的 3D 空间。您想要访问该空间的类型对于算法和数据结构的选择至关重要。考虑到您的数据源，您想将其提供给网格吗？如果您不需要 O(1) 访问权限：

In order to get memory efficiency you want to subdivide that space into subspaces with a similar number of entries. (like a BTree). So a data structure with :

为了获得内存效率，您需要将该空间细分为具有相似条目数的子空间。（就像一个 B 树）。所以一个数据结构：

firstWordRange
secondWordRange
thirdWordRange
numberOfEntries
a sorted block of entries.
next and previous blocks in all 3 dimensions

第一个词范围
第二个词范围
第三词域
条目数
一个排序的条目块。
所有 3 个维度中的下一个和上一个块

Answer 9

回答by orip

If memory is simply not big enough, pybsddbcan help store a disk-persistent map.

如果内存不够大，pybsddb可以帮助存储磁盘持久映射。

Answer 10

回答by orip

You could use a numpy multidimensional array. You'll need to use numbers rather than strings to index into the array, but that can be solved by using a single dict to map words to numbers.

您可以使用 numpy 多维数组。您需要使用数字而不是字符串来索引数组，但这可以通过使用单个 dict 将单词映射到数字来解决。

import numpy
w = {'word1':1, 'word2':2, 'word3':3, 'word4':4}
a = numpy.zeros( (4,4,4) )

Then to index into your array, you'd do something like:

然后要索引到您的数组中，您可以执行以下操作：

a[w[word1], w[word2], w[word3]] += 1

That syntax is not beautiful, but numpy arrays are about as efficient as anything you're likely to find. Note also that I haven't tried this code out, so I may be off in some of the details. Just going from memory here.

这种语法并不漂亮，但 numpy 数组与您可能找到的任何东西一样有效。另请注意，我还没有尝试过此代码，因此我可能会在某些细节上有所疏忽。这里只是凭记忆。

Python 字典的内存高效替代品

提问by ricree

采纳答案by Darius Bacon

回答by hasen

回答by tzot

回答by Dustin

回答by orip

回答by user39307

回答by user39307

回答by Stephan Eggermont

回答by orip

回答by orip

相关推荐

最近更新

标签

Python 字典的内存高效替代品

提问by ricree

采纳答案by Darius Bacon

回答by hasen

回答by tzot

回答by Dustin

回答by orip

回答by user39307

回答by user39307

回答by Stephan Eggermont

回答by orip

回答by orip

相关推荐

有没有好的 Python GUI shell？

python PyOpenGl 还是 pyglet？

python 计算泊松概率百分比

python 跨平台隐藏文件检测

相关推荐

最近更新

标签