C语言 C - 如何实现 Set 数据结构?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2630738/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
C - How to implement Set data structure?
提问by psihodelia
Is there any tricky way to implement a set data structure (a collection of unique values) in C? All elements in a set will be of the same type and there is a huge RAM memory.
在 C 中实现集合数据结构(唯一值的集合)有什么棘手的方法吗?集合中的所有元素都是相同的类型,并且有一个巨大的 RAM 内存。
As I know, for integers it can be done really fast'N'easy using value-indexed arrays. But I'd like to have a very general Set data type. And it would be nice if a set could include itself.
据我所知,对于整数,使用值索引数组可以非常快速地完成。但我想要一个非常通用的 Set 数据类型。如果一个集合可以包含它自己,那就太好了。
采纳答案by vladr
There are multiple ways of implementingset (and map) functionality, for example:
有多种实现集合(和映射)功能的方法,例如:
- tree-based approach (ordered traversal)
- hash-based approach (unordered traversal)
- 基于树的方法(有序遍历)
- 基于哈希的方法(无序遍历)
Since you mentioned value-indexed arrays, let's try the hash-based approach which builds naturally on top of the value-indexed array technique.
既然你提到了值索引数组,让我们尝试基于哈希的方法,它自然地建立在值索引数组技术之上。
Beware of the advantages and disadvantagesof hash-based vs. tree-based approaches.
注意基于哈希的方法与基于树的方法的优缺点。
You can design a hash-set(a special case of hash-tables) of pointers to hashablePODs, with chaining, internally represented as a fixed-size array of buckets of hashables, where:
可以设计出散列的组(的特例哈希表的指针),以可哈希PODS,与链接,内部表示为的铲斗的固定大小的数组hashables,其中:
- all hashablesin a bucket have the same hash value
- a bucket can be implemented as a dynamic array orlinked list of hashables
- a hashable's hash value is used to index into the array of buckets(hash-value-indexed array)
- one or more of the hashablescontained in the hash-set could be (a pointer to) another hash-set, or even to the hash-set itself (i.e. self-inclusion is possible)
- 所有hashables水桶具有相同的哈希值
- 存储桶可以实现为动态数组或可哈希的链表
- 一个可哈希的哈希值用于索引到桶的阵列(散列值索引的阵列)
- 散列集中包含的一个或多个可散列对象可能是(指向)另一个散列集,甚至指向散列集本身(即可以自我包含)
With large amounts of memory at your disposal, you can size your array of buckets generously and, in combination with a good hash method, drastically reduce the probability of collision, achieving virtually constant-time performance.
拥有大量内存供您使用,您可以充分调整存储桶数组的大小,并结合良好的散列方法,显着降低碰撞概率,实现几乎恒定时间的性能。
You would have to implement:
您必须实施:
- the hash functionfor the type being hashed
- an equality function for the type being used to test whether two hashables are equal or not
- the hash-set
contains/insert/removefunctionality.
You can also use open addressingas an alternative to maintaining and managing buckets.
您还可以使用开放寻址作为维护和管理存储区的替代方法。
回答by andand
Sets are usually implemented as some variety of a binary tree. Red black treeshave good worst case performance.
集合通常被实现为某种二叉树。 红黑树具有良好的最坏情况性能。
These can also be used to build an mapto allow key / value lookups.
这些也可用于构建地图以允许键/值查找。
This approach requires some sort of ordering on the elements of the set and the key values in a map.
这种方法需要对集合的元素和映射中的键值进行某种排序。
I'm not sure how you would manage a set that could possibly contain itself using binary trees if you limit set membership to well defined types in C ... comparison between such constructs could be problematic. You could do it easily enough in C++, though.
如果您将集合成员资格限制为 C 中定义明确的类型,我不确定您将如何使用二叉树管理可能包含自身的集合……此类构造之间的比较可能会出现问题。不过,您可以在 C++ 中轻松完成。
回答by High Performance Mark
If the maximum number of elements in the set (the cardinality of the underlying data type) is small enough, you might want to consider using a plain old array of bits (or whatever you call them in your favourite language).
如果集合中元素的最大数量(基础数据类型的基数)足够小,您可能需要考虑使用普通的旧位数组(或您喜欢的语言中的任何名称)。
Then you have a simple set membership check: bit n is 1 if element n is in the set. You could even count 'ordinary' members from 1, and only make bit 0 equal to 1 if the set contains itself.
然后您有一个简单的集合成员资格检查:如果元素 n 在集合中,则位 n 为 1。您甚至可以从 1 开始计算“普通”成员,如果集合包含自身,则仅使位 0 等于 1。
This approach will probably require some sort of other data structure (or function) to translate from the member data type to the position in the bit array (and back), but it makes basic set operations (union, intersection, membership test, difference, insertion, removal,compelment) very very easy. And it is only suitable for relatively small sets, you wouldn't want to use it for sets of 32-bit integers I don't suppose.
这种方法可能需要某种其他数据结构(或函数)来从成员数据类型转换到位数组中的位置(并返回),但它进行基本的集合操作(联合、交集、成员资格测试、差异、插入、移除、强制)非常非常容易。而且它只适用于相对较小的集合,您不会想将它用于我不认为的 32 位整数集合。
回答by David Thornley
The way to get genericity in C is by void *, so you're going to be using pointers anyway, and pointers to different objects are unique. This means you need a hash map or binary tree containing pointers, and this will work for all data objects.
在 C 中获得通用性的方法是 by void *,因此无论如何您都将使用指针,并且指向不同对象的指针是唯一的。这意味着您需要一个包含指针的哈希映射或二叉树,这将适用于所有数据对象。
The downside of this is that you can't enter rvalues independently. You can't have a set containing the value 5; you have to assign 5 to a variable, which means it won't match a random 5. You could enter it as (void *) 5, and for practical purposes this is likely to work with small integers, but if your integers can get into large enough sizes to compete with pointers this has a very small probability of failing.
这样做的缺点是您不能独立输入右值。你不能有一个包含值 5 的集合;您必须将 5 分配给一个变量,这意味着它不会匹配随机 5。您可以将其输入为(void *) 5,并且出于实际目的,这可能适用于小整数,但如果您的整数可以达到足够大的大小以与指针竞争,这失败的可能性很小。
Nor does this work with string values. Given char a[] = "Hello, World!"; char b[] = "Hello, World!";, a set of pointers would find aand bto be different. You would probably want to hash the values, but if you're concerned about hash collisions you should save the string in the set and do a strncmp()to compare the stored string with the probing string.
这也不适用于字符串值。给定char a[] = "Hello, World!"; char b[] = "Hello, World!";,一组指针会发现a并且b是不同的。您可能想要散列值,但如果您担心散列冲突,您应该将字符串保存在集合中,然后strncmp()将存储的字符串与探测字符串进行比较。
(There's similar problems with floating-point numbers, but trying to represent floating-point numbers in sets is a bad idea in the first place.)
(浮点数也有类似的问题,但尝试在集合中表示浮点数首先是一个坏主意。)
Therefore, you'd probably want a tagged value, one tag for any sort of object, one for integer value, and one for string value, and possibly more for different sorts of values. It's complicated, but doable.
因此,您可能需要一个标记值,一个标记用于任何类型的对象,一个用于整数值,一个用于字符串值,可能更多用于不同类型的值。这很复杂,但可行。

