Python 中的键序字典

Question

提问by LeMiz

I am looking for a solid implementation of an ordered associative array, that is, an ordered dictionary. I want the ordering in terms of keys, not of insertion order.

我正在寻找有序关联数组的可靠实现，即有序字典。我想要按键排序，而不是按插入顺序排序。

More precisely, I am looking for a space-efficent implementation of a int-to-float (or string-to-float for another use case) mapping structure for which:

更准确地说，我正在寻找一种节省空间的 int-to-float（或其他用例的字符串到浮点数）映射结构的实现，其中：

Ordered iteration is O(n)
Random access is O(1)

有序迭代是 O(n)
随机访问是 O(1)

The best I came up with was gluing a dict and a list of keys, keeping the last one ordered with bisect and insert.

我想出的最好的方法是粘合一个 dict 和一个键列表，保持最后一个用 bisect 和 insert 排序。

Any better ideas?

有什么更好的想法吗？

Answer 1

采纳答案by Alex Martelli

"Random access O(1)" is an extremely exacting requirement which basically imposes an underlying hash table -- and I hope you do mean random READS only, because I think it can be mathematically proven than it's impossible in the general case to have O(1) writes as well as O(N) ordered iteration.

“随机访问 O(1)”是一个极其严格的要求，它基本上强加了一个底层哈希表——我希望你的意思只是随机读取，因为我认为它可以在数学上得到证明，而不是在一般情况下不可能有 O (1) 写以及 O(N) 有序迭代。

I don't think you will find a pre-packaged container suited to your needs because they are so extreme -- O(log N) access would of course make all the difference in the world. To get the big-O behavior you want for reads and iterations you'll need to glue two data structures, essentially a dict and a heap (or sorted list or tree), and keep them in sync. Although you don't specify, I think you'll only get amortizedbehavior of the kind you want - unless you're truly willing to pay any performance hits for inserts and deletes, which is the literal implication of the specs you express but does seem a pretty unlikely real-life requirement.

我认为您不会找到适合您需求的预包装容器，因为它们太极端了——O(log N) 访问当然会使世界变得不同。要获得读取和迭代所需的大 O 行为，您需要粘合两个数据结构，本质上是一个字典和一个堆（或排序列表或树），并使它们保持同步。尽管您没有指定，但我认为您只会得到您想要的那种摊销行为 - 除非您真的愿意为插入和删除支付任何性能损失，这是您表达的规范的字面含义，但确实如此似乎是一个不太可能的现实生活要求。

For O(1) read and amortizedO(N) ordered iteration, just keep a list of all keys on the side of a dict. E.g.:

对于 O(1) 读取和分摊O(N) 有序迭代，只需在 dict 一侧保留所有键的列表。例如：

class Crazy(object):
  def __init__(self):
    self.d = {}
    self.L = []
    self.sorted = True
  def __getitem__(self, k):
    return self.d[k]
  def __setitem__(self, k, v):
    if k not in self.d:
      self.L.append(k)
      self.sorted = False
    self.d[k] = v
  def __delitem__(self, k):
    del self.d[k]
    self.L.remove(k)
  def __iter__(self):
    if not self.sorted:
      self.L.sort()
      self.sorted = True
    return iter(self.L)

If you don't like the "amortized O(N) order" you can remove self.sorted and just repeat self.L.sort()in __setitem__itself. That makes writes O(N log N), of course (while I still had writes at O(1)). Either approach is viable and it's hard to think of one as intrinsically superior to the other. If you tend to do a bunch of writes then a bunch of iterations then the approach in the code above is best; if it's typically one write, one iteration, another write, another iteration, then it's just about a wash.

如果你不喜欢“摊余O（N）令”您可以删除self.sorted，只是重复self.L.sort()的__setitem__本身。当然，这使得写入 O(N log N)（虽然我仍然在 O(1) 处写入）。这两种方法都是可行的，很难认为一种方法在本质上优于另一种方法。如果你倾向于做一堆写然后一堆迭代，那么上面代码中的方法是最好的；如果它通常是一次写入，一次迭代，另一次写入，另一次迭代，那么它只是一次清洗。

BTW, this takes shameless advantage of the unusual (and wonderful;-) performance characteristics of Python's sort (aka "timsort"): among them, sorting a list that's mostly sorted but with a few extra items tacked on at the end is basically O(N) (if the tacked on items are few enough compared to the sorted prefix part). I hear Java's gaining this sort soon, as Josh Block was so impressed by a tech talk on Python's sort that he started coding it for the JVM on his laptop then and there. Most sytems (including I believe Jython as of today and IronPython too) basically have sorting as an O(N log N) operation, not taking advantage of "mostly ordered" inputs; "natural mergesort", which Tim Peters fashioned into Python's timsort of today, is a wonder in this respect.

顺便说一句，这无耻地利用了 Python 排序（又名“timsort”）的不寻常（和美妙；-）性能特征：其中，对主要排序但最后添加一些额外项目的列表进行排序基本上是 O (N)（如果与排序的前缀部分相比，附加的项目足够少）。我听说 Java 很快就会获得这种类型，因为 Josh Block 对 Python 类型的技术演讲印象深刻，他开始在他的笔记本电脑上为 JVM 编写代码。大多数系统（包括我相信今天的 Jython 和 IronPython）基本上都将排序作为 O(N log N) 操作，而不是利用“大部分有序”的输入；“自然归并排序”，Tim Peters 将其塑造成今天的 Python 的 timsort，在这方面是一个奇迹。

Answer 2

回答by GrantJ

The sortedcontainersmodule provides a SortedDicttype that meets your requirements. It basically glues a SortedListand dict type together. The dict provides O(1) lookup and the SortedList provides O(N) iteration (it's extremely fast). The whole module is pure-Python and has benchmark graphsto backup the performance claims (fast-as-C implementations). SortedDict is also fully tested with 100% coverage and hours of stress. It's compatible with Python 2.6 through 3.4.

该sortedcontainers模块提供了SortedDict能满足你需求的类型。它基本上将SortedList和 dict 类型粘合在一起。dict 提供 O(1) 查找，SortedList 提供 O(N) 迭代（非常快）。整个模块是纯 Python 的，并有基准图来支持性能声明（fast-as-C 实现）。SortedDict 也经过了 100% 覆盖率和数小时压力的全面测试。它与 Python 2.6 到 3.4 兼容。

Answer 3

回答by LeMiz

Here is my own implementation:

这是我自己的实现：

import bisect
class KeyOrderedDict(object):
   __slots__ = ['d', 'l']
   def __init__(self, *args, **kwargs):
      self.l = sorted(kwargs)
      self.d = kwargs

   def __setitem__(self, k, v):
      if not k in self.d:
         idx = bisect.bisect(self.l, k)
         self.l.insert(idx, k)
       self.d[k] = v

   def __getitem__(self, k):
      return self.d[k]

   def __delitem__(self, k):
      idx = bisect.bisect_left(self.l, k)
      del self.l[idx]
      del self.d[k]

   def __iter__(self):
      return iter(self.l)

   def __contains__(self, k):
      return k in self.d

The use of bisect keeps self.l ordered, and insertion is O(n) (because of the insert, but not a killer in my case, because I append far more often than truly insert, so the usual case is amortized O(1)). Access is O(1), and iteration O(n). But maybe someone had invented (in C) something with a more clever structure ?

bisect 的使用使 self.l 保持有序，插入是 O(n)（因为插入，但在我的情况下不是杀手，因为我追加的频率远远超过真正插入，所以通常的情况是摊销 O(1) ))。访问为 O(1)，迭代为 O(n)。但也许有人（用 C 语言）发明了一些结构更巧妙的东西？

Answer 4

回答by fortran

An ordered tree is usually better for this cases, but random access is going to be log(n). You should keep into account also insertion and removal costs...

对于这种情况，有序树通常更好，但随机访问将是 log(n)。您还应该考虑插入和移除成本...

Answer 5

回答by John Fouhy

You could build a dict that allows traversal by storing a pair (value, next_key)in each position.

您可以通过(value, next_key)在每个位置存储一对来构建一个允许遍历的字典。

Random access:

随机访问：

my_dict[k][0]   # for a key k

Traversal:

遍历：

k = start_key   # stored somewhere
while k is not None:     # next_key is None at the end of the list
    v, k = my_dict[k]
    yield v

Keep a pointer to startand endand you'll have efficient update for those cases where you just need to add onto the end of the list.

保留指向startand的指针，end对于那些只需要添加到列表末尾的情况，您将获得有效的更新。

Inserting in the middle is obviously O(n). Possibly you could build a skip liston top of it if you need more speed.

中间插入显然是O(n)。如果您需要更高的速度，您可能可以在其上构建一个跳过列表。

Answer 6

回答by Santi

I'm not sure which python version are you working in, but in case you like to experiment, Python 3.1 includes and official implementation of Ordered dictionaries: http://www.python.org/dev/peps/pep-0372/http://docs.python.org/3.1/whatsnew/3.1.html#pep-372-ordered-dictionaries

我不确定您使用的是哪个 Python 版本，但如果您想尝试，Python 3.1 包括有序词典的官方实现：http: //www.python.org/dev/peps/pep-0372/ http ://docs.python.org/3.1/whatsnew/3.1.html#pep-372-ordered-dictionaries

Answer 7

回答by Anthon

The ordereddict package ( http://anthon.home.xs4all.nl/Python/ordereddict/) that I implemented back in 2007 includes sorteddict. sorteddict is a KSO ( Key Sorted Order) dictionary. It is implemented in C and very space efficient and several times faster than a pure Python implementation. Downside is that only works with CPython.

我在 2007 年实现的ordereddict 包（http://anthon.home.xs4all.nl/Python/ordereddict/）包括sorteddict。sorteddict 是一个 KSO（Key Sorted Order）字典。它是用 C 实现的，非常节省空间，比纯 Python 实现快几倍。缺点是仅适用于 CPython。

>>> from _ordereddict import sorteddict
>>> x = sorteddict()
>>> x[1] = 1.0
>>> x[3] = 3.3
>>> x[2] = 2.2
>>> print x
sorteddict([(1, 1.0), (2, 2.2), (3, 3.3)])
>>> for i in x:
...    print i, x[i]
... 
1 1.0
2 2.2
3 3.3
>>>

Sorry for the late reply, maybe this answer can help others find that library.

抱歉回复晚了，也许这个答案可以帮助其他人找到那个图书馆。

Answer 8

回答by Mikhail Korobov

For "string to float" problem you can use a Trie - it provides O(1) access time and O(n) sorted iteration. By "sorted" I mean "sorted alphabetically by key" - it seems that the question implies the same.

对于“字符串到浮动”问题，您可以使用 Trie - 它提供 O(1) 访问时间和 O(n) 排序迭代。通过“排序”，我的意思是“按字母顺序按键排序”-似乎这个问题暗示了相同的含义。

Some implementations (each with its own strong and weak points):

一些实现（每个都有自己的优点和缺点）：

https://github.com/biopython/biopythonhas Bio.trie module with a full-featured Trie; other Trie packages are more memory-effcient;
https://github.com/kmike/datrie- random insertions could be slow, keys alphabet must be known in advance;
https://github.com/kmike/hat-trie- all operations are fast, but many dict methods are not implemented; underlying C library supports sorted iteration, but it is not implemented in a wrapper;
https://github.com/kmike/marisa-trie- very memory efficient, but doesn't support insertions; iteration is not sorted by default but can be made sorted (there is an example in docs);
https://github.com/kmike/DAWG- can be seen as a minimized Trie; very fast and memory efficient, but doesn't support insertions; has size limits (several GB of data)

https://github.com/biopython/biopython有 Bio.trie 模块和全功能的 Trie；其他 Trie 包的内存效率更高；
https://github.com/kmike/datrie- 随机插入可能很慢，必须提前知道密钥字母；
https://github.com/kmike/hat-trie- 所有操作都很快，但是很多dict方法没有实现；底层C库支持排序迭代，但没有在包装器中实现；
https://github.com/kmike/marisa-trie- 内存效率很高，但不支持插入；迭代默认不排序，但可以排序（文档中有一个示例）；
https://github.com/kmike/DAWG- 可以看作是最小化的 Trie；非常快且内存高效，但不支持插入；有大小限制（几 GB 的数据）

Answer 9

回答by SingleNegationElimination

here's a pastie: I Had a need for something similar. Note however that this specific implementation is immutable, there are no inserts, once the instance is created: The exact performance doesn't quite match what you're asking for, however. Lookup is O(log n) and full scan is O(n). This works using the bisectmodule upon a tuple of key/value (tuple) pairs. Even if you can't use this precisely, you might have some success adapting it to your needs.

这是一个馅饼：我需要类似的东西。但是请注意，此特定实现是不可变的，一旦创建实例，就没有插入：但是，确切的性能与您的要求并不完全匹配。查找是 O(log n)，全扫描是 O(n)。这可以bisect在键/值（元组）对的元组上使用该模块。即使您不能精确地使用它，您也可能会成功地使其适应您的需求。

import bisect

class dictuple(object):
    """
        >>> h0 = dictuple()
        >>> h1 = dictuple({"apples": 1, "bananas":2})
        >>> h2 = dictuple({"bananas": 3, "mangoes": 5})
        >>> h1+h2
        ('apples':1, 'bananas':3, 'mangoes':5)
        >>> h1 > h2
        False
        >>> h1 > 6
        False
        >>> 'apples' in h1
        True
        >>> 'apples' in h2
        False
        >>> d1 = {}
        >>> d1[h1] = "salad"
        >>> d1[h1]
        'salad'
        >>> d1[h2]
        Traceback (most recent call last):
        ...
        KeyError: ('bananas':3, 'mangoes':5)
   """


    def __new__(cls, *args, **kwargs):
        initial = {}
        args = [] if args is None else args
        for arg in args:
            initial.update(arg)
        initial.update(kwargs)

        instance = object.__new__(cls)
        instance.__items = tuple(sorted(initial.items(),key=lambda i:i[0]))
        return instance

    def __init__(self,*args, **kwargs):
        pass

    def __find(self,key):
        return bisect.bisect(self.__items, (key,))


    def __getitem__(self, key):
        ind = self.__find(key)
        if self.__items[ind][0] == key:
            return self.__items[ind][1]
        raise KeyError(key)
    def __repr__(self):
        return "({0})".format(", ".join(
                        "{0}:{1}".format(repr(item[0]),repr(item[1]))
                          for item in self.__items))
    def __contains__(self,key):
        ind = self.__find(key)
        return self.__items[ind][0] == key
    def __cmp__(self,other):

        return cmp(self.__class__.__name__, other.__class__.__name__
                  ) or cmp(self.__items, other.__items)
    def __eq__(self,other):
        return self.__items == other.__items
    def __format__(self,key):
        pass
    #def __ge__(self,key):
    #    pass
    #def __getattribute__(self,key):
    #    pass
    #def __gt__(self,key):
    #    pass
    __seed = hash("dictuple")
    def __hash__(self):
        return dictuple.__seed^hash(self.__items)
    def __iter__(self):
        return self.iterkeys()
    def __len__(self):
        return len(self.__items)
    #def __reduce__(self,key):
    #    pass
    #def __reduce_ex__(self,key):
    #    pass
    #def __sizeof__(self,key):
    #    pass

    @classmethod
    def fromkeys(cls,key,v=None):
        cls(dict.fromkeys(key,v))

    def get(self,key, default):
        ind = self.__find(key)
        return self.__items[ind][1] if self.__items[ind][0] == key else default

    def has_key(self,key):
        ind = self.__find(key)
        return self.__items[ind][0] == key

    def items(self):
        return list(self.iteritems())

    def iteritems(self):
        return iter(self.__items)

    def iterkeys(self):
        return (i[0] for i in self.__items)

    def itervalues(self):
        return (i[1] for i in self.__items)

    def keys(self):
        return list(self.iterkeys())

    def values(self):
        return list(self.itervalues())
    def __add__(self, other):
        _sum = dict(self.__items)
        _sum.update(other.__items)
        return self.__class__(_sum)

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Answer 10

回答by KT.

Here's one option that has not been mentioned in other answers, I think:

这是其他答案中没有提到的一个选项，我认为：

Use a binary search tree (Treap/AVL/RB) to keep the mapping.
Alsouse a hashmap (aka dictionary) to keep the same mapping (again).

使用二叉搜索树（Tre ap/ AVL/ RB）来保持映射。
还使用哈希映射（又名字典）来保持相同的映射（再次）。

This will provide O(n)ordered traversal (via the tree), O(1)random access (via the hashmap) and O(log n)insertion/deletion (because you need to update both the tree and the hash).

这将提供O(n)有序遍历（通过树）、O(1)随机访问（通过哈希图）和O(log n)插入/删除（因为您需要更新树和哈希）。

The drawback is the need to keep all the data twice, however the alternatives which suggest keeping a list of keys alongside a hashmap are not much better in this sense.

缺点是需要将所有数据保留两次，但是建议在哈希映射旁边保留键列表的替代方案在这个意义上并没有好多少。

Python 中的键序字典

提问by LeMiz

采纳答案by Alex Martelli

回答by GrantJ

回答by LeMiz

回答by fortran

回答by John Fouhy

回答by Santi

回答by Anthon

回答by Mikhail Korobov

回答by SingleNegationElimination

回答by KT.

相关推荐

最近更新

标签

Python 中的键序字典

提问by LeMiz

采纳答案by Alex Martelli

回答by GrantJ

回答by LeMiz

回答by fortran

回答by John Fouhy

回答by Santi

回答by Anthon

回答by Mikhail Korobov

回答by SingleNegationElimination

回答by KT.

相关推荐

python 如何以 Pythonic 的方式检测 CSV 文件中的缺失字段？

Python - 捕获 Popen stdout 并在控制台上显示？

Python - 使用正则表达式过滤数据

使用 Python 在 Google App Engine 中实现 Comet / 服务器推送

相关推荐

最近更新

标签