Python set() 是如何实现的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3949310/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:29:55  来源:igfitidea点击:

How is set() implemented?

pythondata-structuressetcpython

提问by Daenyth

I've seen people say that setobjects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?

我见过有人说setpython中的对象有 O(1) 成员资格检查。它们如何在内部实施以允许这样做?它使用什么样的数据结构?该实施还有哪些其他含义?

Every answer here was really enlightening, but I can only accept one, so I'll go with the closest answer to my original question. Thanks all for the info!

这里的每一个答案都非常有启发性,但我只能接受一个,所以我会选择最接近我原来问题的答案。谢谢你的信息!

采纳答案by Justin Ethier

According to this thread:

根据这个线程

Indeed, CPython's sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values

实际上,CPython 的集合被实现为类似于具有虚拟值的字典(键是集合的成员),并进行了一些优化来利用这种缺少值的情况

So basically a setuses a hashtable as its underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation, on average.

所以基本上 aset使用哈希表作为其底层数据结构。这解释了 O(1) 成员资格检查,因为在哈希表中查找项目平均是 O(1) 操作。

If you are so inclined you can even browse the CPython source code for setwhich, according to Achim Domma, is mostly a cut-and-paste from the dictimplementation.

如果您愿意,您甚至可以浏览setCPython 源代码,根据Achim Domma 的说法,它主要是从dict实现中剪切和粘贴的。

回答by Shay Erlichmen

I think its a common mistake, setlookup (or hashtable for that matter) are not O(1).
from the Wikipedia

我认为这是一个常见的错误,set查找(或哈希表)不是 O(1)。
来自维基百科

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n)comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

在最简单的模型中,哈希函数是完全未指定的,并且表不会调整大小。对于散列函数的最佳选择,具有开放寻址的大小为 n 的表没有冲突,最多可容纳 n 个元素,只需进行一次比较即可成功查找,大小为 n 的具有链接和 k 个键的表具有最小最大值(0, kn) 碰撞和O(1 + k/n)比较以进行查找。对于哈希函数的最差选择,每次插入都会导致冲突,并且哈希表退化为线性搜索,每次插入都有 Ω(k) 次摊销比较,成功查找最多可进行 k 次比较。

Related: Is a Java hashmap really O(1)?

相关:Java 哈希映射真的是 O(1) 吗?

回答by gimel

We all have easy access to the source, where the comment preceding set_lookkey()says:

我们都可以轻松访问,其中前面的评论set_lookkey()说:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <[email protected]>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

回答by unutbu

When people say sets have O(1) membership-checking, they are talking about the averagecase. In the worstcase (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.

当人们说集合具有 O(1) 成员资格检查时,他们指的是平均情况。在最坏的情况下(当所有散列值发生冲突时)成员资格检查是 O(n)。请参阅有关时间复杂度Python wiki

The Wikipedia articlesays the best casetime complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.

维基百科的文章说,最好的情况下为一个哈希表,不调整大小的时间复杂度O(1 + k/n)。这个结果并不直接适用于 Python 集,因为 Python 集使用了一个调整大小的哈希表。

A little further on the Wikipedia article says that for the averagecase, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/ncan be bounded by a constant c<1.

维基百科文章的进一步说明,对于一般情况,并假设一个简单的统一散列函数,时间复杂度为O(1/(1-k/n)),其中k/n可以以常数 为界c<1

Big-O refers only to asymptotic behavior as n → ∞. Since k/n can be bounded by a constant, c<1, independent of n,

Big-O 仅将渐近行为称为 n → ∞。由于 k/n 可以由一个常数 c<1 限定,与 n 无关

O(1/(1-k/n))is no bigger than O(1/(1-c))which is equivalent to O(constant)= O(1).

O(1/(1-k/n))不大于O(1/(1-c))等于O(constant)= O(1)

So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).

因此,假设统一的简单散列,Python 集的成员资格检查平均O(1).

回答by user1767754

To emphasize a little more the difference between set'sand dict's, here is an excerpt from the setobject.ccomment sections, which clarify's the main difference of set's against dicts.

为了更加强调set's和之间的区别dict's,这里是setobject.c评论部分的摘录,它阐明了 set 与 dicts 的主要区别。

Use cases for sets differ considerably from dictionaries where looked-up keys are more likely to be present. In contrast, sets are primarily about membership testing where the presence of an element is not known in advance. Accordingly, the set implementation needs to optimize for both the found and not-found case.

集合的用例与更可能出现查找键的字典有很大不同。相比之下,集合主要是关于成员资格测试,其中元素的存在是事先不知道的。因此,集合实现需要针对找到和未找到的情况进行优化。

source on github

github上的源码