提高 Python 中超大字典的性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16256913/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Improving performance of very large dictionary in Python
提问by szli
I find that if I initialize an empty dictionary at the beginning, and then adding elements to the dictionary in a for loop (about 110,000 keys, the value for each key is a list, also increasing in the loop), the speed goes down as for loop goes.
我发现如果我在开始时初始化一个空字典,然后在一个for循环中向字典添加元素(大约110,000个键,每个键的值是一个列表,也在循环中增加),速度下降为for 循环。
I suspect that the problem is, the dictionary does not know the number of keys at init time and it is not doing something very smart, so perhaps the storage collision becomes quite often and it slows down.
我怀疑问题是,字典在初始化时不知道键的数量,并且它没有做一些非常聪明的事情,所以也许存储冲突变得非常频繁并且它变慢了。
If I know the number of keys and exactly what are those keys, is there any way in python to make a dict (or a hashtable) work more efficiently? I vaguely remember that if you know the keys, you can design the hash function smartly (perfect hash?) and allocate the space beforehand.
如果我知道键的数量以及这些键的确切含义,python 中是否有任何方法可以使 dict(或哈希表)更有效地工作?依稀记得,如果你知道key,你可以巧妙地设计hash函数(完美hash?)并预先分配空间。
采纳答案by Raymond Hettinger
If I know the number of keys and exactly what are those keys, is there any way in python to make a dict (or a hashtable) work more efficiently? I vaguely remember that if you know the keys, you can design the hash function smartly (perfect hash?) and allocate the space beforehand.
如果我知道键的数量以及这些键的确切含义,python 中是否有任何方法可以使 dict(或哈希表)更有效地工作?依稀记得,如果你知道key,你可以巧妙地设计hash函数(完美hash?)并预先分配空间。
Python doesn't expose a pre-sizing option to speed-up the "growth phase" of a dictionary, nor does it provide any direct controls over "placement" in the dictionary.
Python 没有公开预先调整大小的选项来加速字典的“增长阶段”,也没有提供对字典中“位置”的任何直接控制。
That said, if the keys are always known in advance, you can store them in a setand build your dictionaries from the set using dict.fromkeys(). That classmethod is optimized to pre-size the dictionary based on the set sizeand it can populate the dictionary without any new calls to __hash__():
也就是说,如果键总是事先知道的,您可以将它们存储在一个集合中,并使用dict.fromkeys()从集合中构建您的字典。该类方法经过优化,可根据设置的大小预先确定字典的大小,并且无需对 __hash__() 进行任何新调用即可填充字典:
>>> keys = {'red', 'green', 'blue', 'yellow', 'orange', 'pink', 'black'}
>>> d = dict.fromkeys(keys) # dict is pre-sized to 32 empty slots
If reducing collisions is your goal, you can run experiments on the insertion order in the dictionary to minimize pile-ups. (Take a look at Brent's variation on Algorithm Din Knuth's TAOCP to get an idea of how this is done).
如果减少冲突是您的目标,您可以对字典中的插入顺序进行实验以最大程度地减少堆积。(查看Brent在 Knuth 的 TAOCP 中对算法 D 的变体,以了解这是如何完成的)。
By instrumenting a pure Python model for dictionaries (such as this one), it is possible to count the weighted-average number of probes for an alternative insertion order. For example, inserting dict.fromkeys([11100, 22200, 44400, 33300])averages 1.75 probes per lookup. That beats the 2.25 average probes per lookup for dict.fromkeys([33300, 22200, 11100, 44400]).
通过为字典检测纯 Python 模型(例如this one),可以计算替代插入顺序的探针的加权平均数。例如,dict.fromkeys([11100, 22200, 44400, 33300])每次查找插入平均 1.75 个探针。这超过了每次查找 2.25 次平均探测dict.fromkeys([33300, 22200, 11100, 44400])。
Another "trick" is to increase spareness in a fully populated dictionary by fooling it into increasing its size without adding new keys:
另一个“技巧”是通过在不添加新键的情况下欺骗它增加其大小来增加完全填充的字典的备用性:
d = dict.fromkeys(['red', 'green', 'blue', 'yellow', 'orange'])
d.update(dict(d)) # This makes room for additional keys
# and makes the set collision-free.
Lastly, you can introduce your own custom __hash__() for your keys with the goal of eliminating all collisions (perhaps using a perfect hash generator such as gperf).
最后,您可以为您的键引入您自己的自定义 __hash__() 以消除所有冲突(可能使用完美的哈希生成器,例如gperf)。

